Pso 1

Published on 2 weeks ago | Categories: Documents | Downloads: 1 | Comments: 0 | Views: 73
of x
Download PDF   Embed   Report

Comments

Content

 

Applied Soft Computing 11 (2011) 326–336

Contents lists available at ScienceDirect at  ScienceDirect

Applied Soft Computing  j o u r n a l h o m e p a g e :   w w w . e l s e v i e r . c o m / l o c a t e / a s o c

Application of particle swarm optimization to association rule mining R.J. Kuo a,∗ , C.M. Chao b , Y.T. Chiu c a

Department of Industrial Management, National Taiwan University of Science and Technology No. 43, Section 4, Kee-Lung Road, Taipei 106, Taiwan Department of Business Management, National Taipei University of Technology No. 1, Section 3, Chung-Hsiao East Road, Taipei 106, Taiwan c Department of Industrial Engineering and Management National Taipei University of Technology No. 1, Section 3, Chung-Hsiao East Road, Taipei 106, Taiwan b

a r t i c l e

i n f o

 Article history: Received 7 September 2007 Received in revised form 18 October 2009 Accepted 17 November 2009 Available online 24 November 2009 Keywords: Association rule mining Particle swarm optimization algorithm

a b s t r a c t

In the area of association rule mining, most previous research had focused on improving computational efficiency. However, determination of the threshold values of support and confidence, which seriously affect the quality of association rule mining, is still under investigation. Thus, this study intends to propose a novel algorithm for association rule mining in order to improve computational efficiency as well as to auto automat matical ically ly determ determine ine sui suitabl table e thre threshol shold d val values ues.. Theparticle Theparticle swarm swarm opt optimi imizati zation on algori algorithm thm first searches for the optimum fitness value of each particle and then finds corresponding support and confidence as minimal threshold values after the data are transformed into binary values. The proposed method is verified by applying the FoodMart2000 database of Microsoft SQL Server 2000 and compared with a genetic algorithm. The results indicate that the particle swarm optimization algorithm really can suggestt suitable sugges suitable thresho threshold ld valuesand obtainquality rules.In addition, addition, a real-w real-worldstock orldstock marketdatabase is employed to mine association rules to measure investment behavior and stock category purchasing. The computational results are also very promising. © 2009 Elsevier B.V. All rights reserved.

1. Intro Introducti duction on

With the devel developmen opmentt of info informat rmation ion tech technolo nology, gy, ther there e are many different kinds of information databases, such as scientific data, medical data, financial data, and marketing transaction data. How to effectively analyze and apply these data and find the critical hidden information from these databases have become very import imp ortan antt iss issues ues.. Da Data ta min miningtechn ingtechniqu ique e ha hass been been the mos mostt wid widely ely discu discusse ssed d and freque frequentl ntly y app applie lied d tool tool in rec recent ent dec decade ades. s. Da Data ta min min-ing has been successfully applied in the areas of scientific analysis, business busin ess appl applicat ication,and ion,and medic medical al resea research.Not rch.Not onlyare its appli applicacationss gett tion getting ing broa broader, der, but its comp computat utationa ionall effic efficienc iency y and accu accuracy racy are also improving. Data mining can be categorized into several models, including association rules, clustering and classification. Among these models, association rule mining is the most widely applied method. The Apriori algorithm is the most representative algorithm. It consists of man many y modi modified fied algorit algorithms hms that focus on improvi improving ng its efficiency and accuracy. However, two parameters, minimal support and confi confidenc dence, e, are alwa always ys deter determine mined d by the decisiondecision-make makerr him/herself or through trial-and-error; and thus, the algorithm lacks both objectiveness and efficiency. Therefore, the main purpose of this study is to propose an improved algorithm that can

provide feasible threshold values for minimal support and confidence. For the purpose of comparison, this study first employs the embedd emb edded ed dat databa abase se of Mic Micros rosoftSQL oftSQL Ser Server2000to ver2000to ass assess ess the the proproposed algorithm. The simulation results show that the proposed algorith algo rithm m hasbetter com computat putationa ionall perfo performan rmance ce thangeneti thangeneticc algo algo-rithms. Moreover, a real-world data provided by a well-known securityfirm secur ityfirm alsoindicate alsoindicate thatthe propo proposedalgorith sedalgorithm m canmine the relationship between investors’ transaction behavior in different industrial categories. The rest of this study is organized as follows. Section 2 Section 2 b  briefly riefly presents the general background, while the proposed method is explained in Section   3.  Sections   Sections   4 and 5 5   illustrate the computationa tionall res result ultss of the the Mic Micros rosoftSQL oftSQL Ser Server2000datab ver2000databaseand aseand a stock stock companydatabase comp anydatabase,, respec respective tively. ly. The con conclud cludingremarks ingremarks are final finally ly made in Section 6 Section 6.. 2. Liter Literatur ature e revi review  ew 

Thissectionwill briefl briefly y prese present nt thegeneralbackgro thegeneralbackgroundsof undsof association rule mining and the particle swarm optimization method, respectively.  2.1. Association rule

In many applica application tionss of data mining tech technolo nology, gy, appl applying ying ∗

Corresponding author. Tel.: +886 2 2737 6328; fax: +886 2 2737 6344. E-mail address: [email protected] (R.J.  [email protected]  (R.J. Kuo).

1568-4946/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi: doi:10.1016/j.asoc.2009.11.023 10.1016/j.asoc.2009.11.023

associatio assoc iation n rules are the most broad broadly ly discu discussed ssed method. This method is capable of finding interesting associative and relative

 

R.J. Kuo et al. / Applied Soft Computing 11 (2011) 326–336

characteristics from commercial transaction records and helping decision-makers formulate business strategy.  2.1.1. Definition of association rule Ag Agraw rawal al et al.  al.   [31]   first first pro propos posed ed the issue of the minin mining g ass associ ociati ation on rul rule e in 199 1993. 3. The They y poi point nted ed out that that som some e hid hidden den relations relat ionships hips exist betw between een purc purchased hased item itemss in transact transactiona ionall databases data bases.. There Therefore, fore, mini mining ng result resultss can help decisiondecision-make makers rs understand customers’ purchasing behavior. An association rule is X → Y , where  X  and in the form of   X   and  Y  represent  represent Itemset(I ), ), or products, respectively and Itemset includes all possible items{i1 , i 2 , . . ., im }. The general general tran transact sactiondatabase( iondatabase(D = {T 1 , T 2 , . . ., T k })canrepresentt the pos sen possib sibili ility ty tha thatt a custo customer mer will will buy pro produc ductt Y after buying buying product X and X ∩ Y = . How Howeve ever, r, th the e min mining ing ass assoc ociat iation ion rul rule e mus mustt accord with two parameters at the same time:

(1)  Minimal suppo support  rt : Fin Findin ding g fre freque quent nt itemse itemsets ts wit with h the their ir sup suppor ports ts above the minimal support threshold. Support( X   → Y  Y )) =

# of tran transact sactions ions which cont contain ain   X &Y  # of transactions in the database

 

327

 2.1.2. Association rule algorithm The mostrepresen mostrepresentativ tative e assoc associati iation on rulealgorith rulealgorithm m is theApriori alg algori orithm thm,, whichwas whichwas pro propos posed ed by Agraw Agrawal al et al. in 199 1993. 3. TheApriori algorithm repeatedly generates candidate itemsets and uses minimal support and minimal confidence to filter these candidate itemsets to find high-frequency itemsets. Association rules can be figured out from the high-frequency itemsets. The process of finding high-freq high-frequenc uency y items itemsets ets from cand candidat idate e item itemsets sets is intr introduc oduced ed in in Fig. 1  [1]..  Fig. 1 [1] 1,,  Step 1.1 finds the frequent itemset, represented as In   Fig. 1 L1 . In Steps 1.2 through 1.10,   L1  are utilized to generate candi-

date itemset   C k   to find   Lk . The process “Apriori gen” generates candidate itemsets and processes join and prune. The join procedure, Steps 2.1–2.4, combines  L k−1  into candidate itemsets. The prune prun e proc procedure edure,, Steps 2.5– 2.5–2.7, 2.7, deletes deletes infrequen infrequentt can candida didate te itemsets. Infrequent itemset is tested in “has infrequent subset.” After the Apriori algorithm has generated frequent itemsets, association rules can be gene generated rated.. As long as the calcul calculated ated confid confidence ence of a fre freque quent nt itemse itemsett is lar larger ger than than the pre predefi defined ned min minima imall confi confiden dence, ce, its corresponding association rule can be accepted.

(1)

# of transactions which contain   X &Y  # of tran transact sactions ions which con contain tain   X 

 2.1.3. Extension of the association rule algorithm Since the processing of the Apriori algorithm requires plenty of time, its computational efficiency is a very important issue. In order to improve the efficiency of Apriori, many researchers have proposed prop osed modi modified fied assoc associati iation on rulerule-relat related ed algo algorith rithms. ms. Savasere Savasere et al. [2] intro ntroduce duced d a part partition ition algo algorith rithm m for mini mining ng assoc associatio iation n rules

(2)

thatis fund fundamen amentallydifferentfrom tallydifferentfrom the classic classic algo algorith rithm. m. First, First, the

(2)  Minimal confidence: Using frequent itemsets found in Eq. (1) Eq. (1) t  to o generateassociat gene rateassociationrules ionrules thathave confi confidenc dence e level levelss abov above e the minimal confidence threshold. Confidence( X   → Y ) = Confidence( X 

Fig. 1.  The Apriori algorithm.

 

328

 

R.J. Kuo et al. / Applied Soft Computing 11 (2011) 326–336

partition algorithm scans the database once to generate a set of all potentiall poten tially y larg large e item itemsets,and sets,and thenthe supp supports orts for all the items itemsets ets are measured in the second scan of the database. The key to the correctness of the partition algorithm is that the potentially large itemset appears as a large itemset in at least one of the partitions. The algorithm logically divides the database into a number of nonoverlapping partitions, which can be held in the main memory. The partitions are considered one at a time and all large itemsets are generated for that partition. These large itemsets are further merged to create a set of all potential large itemsets. Then these itemsets are generated [2] generated [2]..

 2.1.3.3. An efficient hash-based method for discovering the maximal  frequent set (HMFS) algorithm.   In 200 2001,Yanget 1,Yanget al.prop al.propose osed d the the effi effi-cientt hash cien hash-bas -based ed meth method, od, HMFS, for disc discoveri overing ng maxim maximal al frequ frequent ent itemsets. The HMFS method combines the advantages of both the DHP and the Pincer-Search algorith algorithm. m. The combination of the two methods leads to two advantages. First, the HMFS method, in general, can reduce the number of database scans. Second, the HMFS can filter the infrequent candidate itemsets and use the filtered itemsets to find the maximal frequent itemsets. These two advantagess canreduce the overall tage overall comp computin uting g timeof findi finding ng the maxi maximal mal frequent itemsets. In addition, the HMFS method also provides an

Park et al. proposed the DHP algorithm in 1995. DHP can be derived from Apriori by intr derived introduc oducing ing additional additional cont control. rol. To this purpose, DHP makes use of an additional hash table that aims at limiti lim iting ng the the gen genera eratio tion n of can candid didate atess as muc much h as pos possib sible.DHP le.DHP also also progressively trims the database by discarding attributes in transaction actionss or eve even n by disca discardi rding ng ent entire ire transa transacti ction onss when when they they app appear ear to be sub subseq sequen uentlyusele tlyuseless. ss. The Theref refore ore,, DHPinclu DHPincludes des two two maj major or fea fea-tures, the efficient generation of large itemsets and the effective reduction of transaction database sizes. Toivonen proposed the sampling algorithm in 1996. This algorithm is involved in finding association rules to reduce database activity. The sampling algorithm applies the level-wise method to the sample, along with a lower lower minimal minimal support thresho threshold, ld, to mine the superset of a large itemset. The method produces exact association rules, but in some cases it does not generate all the association rules, that is, some missing association rules might exist. Therefore, this approach requires only one full pass over the database in most cases, but two passes in the worst of  cases [4] cases  [4]..

efficient mechanism to construct the maximal frequent candidate itemsets so as to reduce the search space [8] space [8].. Genetic algorithms have also been applied in association rule mining [9]. [9]. This This study study use usess wei weigh ghted ted items items to rep repres resent ent th the e imp imporortance of individual items. These weighted items are applied to the fitness function of heuristic genetic algorithms to estimate the value of different rules. These genetic algorithms can generate suitable threshold values for association rule mining. In addition, Saggar et al. proposed an approach concentrating on optimizing the rules generated using genetic algorithms. The most important aspect of their approach is that it can predict the rule that containss nega tain negative tive attri attributes butes [10] [10].. In ano anoth ther er stu study dy,, a gen geneti eticc alg algori orithm thm was employed to mine the association rule oriented to the dataset in a manufa manufactu cturin ring g inform informati ation on sys system(MIS) tem(MIS).. Accor Accordin ding g to the the test test results, resul ts, the conclusio conclusion n draw drawn n state stated d that the genetic genetic algo algorith rithm m had considerably higher efficiency [11] efficiency  [11].. In another study, an ant colony system was also employed to data mining under multi-dimensional constraints. The computational results showed that the proposed method could provide more condensed rules than the Apriori method. In addition, the n addition, this method computation time was also reduced [12] reduced  [12].. IIn was was int integr egrate ated d wit with h th the e cluste clusterin ring g met methodto hodto pro provid vide e mor more e pre precis cise e rules [13] rules  [13]..   First, the dataset is clustered with the self-organizing map (SOM) network and the association rules in each cluster are then mined by an ACS-base association rule mining system. The results show that the new mining framework can provide better rules. The improved algorithms described above have dramatically improved impr oved the effic efficienc iency y of the Apriori algorith algorithm. m. Amon Among g these improvements, some studies have focused on solving the problem of set settin ting g min minima imall sup suppor portt and and min minima imall con confide fidenc nce e to ach achiev ieve e more objective and efficient association rules. An increasing number of studies combine meta-heuristic methods, such as genetic algorithms and ant colony systems, with Apriori algorithm. These studies have proven that such integration can improve Apriori’s efficiency and discover association rules more precisely. However, the following problems still exist. First, if the experience rule is employed during association rule threshold decisions, such as in the determination of minimal support and minimal confidence, experi exp erimen mental tal dat data a sto stored red in th the e test test dat databa abase se are linear linear,, as a res result ult,, these experiences cannot completely reflect the actual situation. In addition, the technique of combining heuristic method with genetic algorithms to search for ideal association rule thresholds still requires further improvement. This is because the requirements of parameter settings, like crossover and mutation, make the procedures more complicated.

 2.1.3.1. The dynamic itemset counting (DIC) algorithm.  The DIC DIC algoalgorithm was proposed by Brin et al. in 1997. One of the main design motiv mot ivati ationswas onswas to lim limit it the the total total num numberof berof passesperfo passesperforme rmed d ov over er databases. DIC partitions a database into several blocks marked by start points and repeatedly scans the database. In contrast to Apriori, ori, DI DIC C can can addnewcand addnewcandida idateitems teitemsetsat etsat anystar anystartt point point,, in inste steadof  adof   just at the beginningof a new database scan. At each start point, DIC est estima imatesthe testhe sup suppor portt of allitems allitemsetsthat etsthat arecurre arecurrentl ntly y cou count nted ed and adds new itemsets to the set of candidate itemsets if all its subsets are estimated to be frequent [5] frequent [5]..  2.1.3.2. The Pincer-Search algorithm.  The Pincer-Search algorithm was proposed by Lin et al. in 1998 and can efficiently discover the maxi maximum mum frequ frequent ent set. The Pinc Pincer-Sea er-Search rch algo algorith rithm m comb combines ines both the bottom-up and top-down directions. The main search direction is still bottom-up, but a restricted search is conducted in the top-down direction. This search is used only for maintaining and updating the new data structure designed for this study, namely, the maximum frequent candidate set. It is used to prune candidates in the bottom-up search. Another very important characteristic of the algorithm is that it is not necessary to explicitly examine every frequent itemset. Therefore, it performs well even when when som some e maxim maximal al fre freque quent nt itemse itemsets ts are lon long. g. The Pincer Pincer-Se -Searc arch h algori algorith thm m ca can n red reduc uce e both both the the nu numbe mberr of times times the the dat databa abase se is rea read d and the number of candidates considered considered [6]  [6].. A sing single le mini minimal mal suppo support rt is insuf insufficie ficient nt for assoc associatio iation n rule mining since it cannot reflect the nature and frequency differences of  the items items in the dat databa abase. se. In rea real-l l-life ife app applic licati ation ons, s, suc such h dif differ ferenc ences es can be very large. It is not satisfactory to set the minimal support too high, high, no norr is it sat satisf isfact actory ory to set it too low low.. The Theref refore ore,, the Mul MultitipleMinimal pleMinim al Sup Suppor portt algori algorithmwas thmwas pro propos posed ed byLiu et al.in 199 1999. 9. It is a more flexible and powerful model, allowing the user to specify multip mul tiple le min minima imall item item sup suppor ports. ts. Thi Thiss mod model el enable enabless us to fin find d rar rare e item rules with frequent items [7] items  [7]..

 2.2. Particle swarm optimization algorithm

Kennedy and Eberhart proposed the particle swarm optimization (PSO) algorithm in 1995. The PSO algorithm has become an evolutionary computation technique and an important heuristic algori algorith thm m in rec recent ent yea years. rs. The mai main n conc concept ept of PSO origin originate atess from from the study of fauna behavior.

 

R.J. Kuo et al. / Applied Soft Computing 11 (2011) 326–336

 2.2.1. Concept of particle swarm optimization algorithm The PSO algorithm simulates the behaviors of bird flocking. Consider the following scenario: a group of birds are randomly searching for food in an area. There is only one piece of food in the area being searched. None of the birds know where the food is. Howeve How ever,the r,the bir birdsdo dsdo kno know w ho how w farthe foo food d is durin during g alliter alliterati ation ons. s. So what is the best strategy to find the food? The most effective strategy is to follow the bird that is nearest to the food. PSO learned from such a scenario and used it to solve the optimization problems. In PSO, each single solution is a “bird” in the search space. We refer to each solution as a “particle.” All particles

have fitness values, which are evaluated by the fitness function to be optimized. The particles also have velocities which direct the flight of the particles. Particles fly through the problem space by following the current optimum particles. PSO is initialized with a group of random particles (solutions) and then searches for optima by updating generations. During all iterat iteration ions, s, eac each h par partic ticle le is upd update ated d by follow followingthe ingthe two “be “best” st” va vallues. The first one is the best solution (fitness) it has achieved so far. The fitness value is also stored. This value is called “pbest.” The other “best” “best” val value ue tha thatt is tracke tracked d by the parti particle cle swarm optimi opt imizer zer is the best va value lue obtai obtained ned so far by any any par partic ticle le in the populat population ion.. Thi Thiss bes bestt val value ue is a glo global bal best and is cal called led “gbest.” After finding the two best values, each particle updates its corresponding velocity and position with Eqs.  (3) and (4) (4),,  as  [14]:: follows [14] follows new old id   = v id   + c 1 rand( )(pbest − xid ) + c 2 rand( )(gbest − xid )



new  xid  

=  x

old new id   + v id

 

(3) (4)

The var variab iable le defi defini nitio tions ns are as follow follows: s: v id  is the par partic ticle le vel veloci ocity ty of  the idth particle; particle; xid is th the e idth, or current, current, particle; particle; i is the particle’s particle’s number; d is th the e dime dimens nsio ion n of se sear arch chin ing g sp spac ace. e. ra rand nd ( ) is a ra rand ndom om numb number er in (0 (0,, 1); 1); c 1  is th the e ind indivi ividua duall fac factor tor;; c 2  is the the soc societ ietal al fac factor tor.. Usually c 1   and  c 2  are set to be 2 [15] 2  [15]..  In addition, all particles have fitness values calculated by the fitness function. The velocities of particles in each dimension are clamped to a maximum velocity  V max max . If the sum of accelerations would cause thevelocityof theveloc ityof tha thatt dim dimens ensionto ionto exc exceed eed V max whichis a par parame ameter ter max , whichis specifi spe cified ed by the use user, r, then then the vel veloci ocity ty of tha thatt dim dimens ension ion is lim limite ited d to   V max method”  [16]..  In 1998, Shi max . This method is called “ V max max   method” [16] and Eberhart proposed another method called the “inertia weight method met hod.” .” In this this met metho hod, d, the par partic ticle le upd update atess its vel veloc ocity ity and and pos posiition with Eqs. (5) Eqs. (5) and (6), (6),  as follows: new   =

v id

 

(5) (6)

The variable   w   plays the role of balancing the global search and local search. It can be a positive constant or even a positive linear or nonlinear function of time  time   [17]. [17].  In 1999, Clerc also proposed another method called “constriction factor method.” In Clerc’s method, the particle updates its velocity and position with Eqs. (7) Eqs.  (7) and (8) (8) [18]  [18]:: new id   =

new  xid  

K   =

K [v oidld  + c 1 rand( )(pbest − xid ) + c 2 rand( )(gbest − xid )]

=  x

 2

329

of more more par partic ticles les to contr control ol th the e mut mutati ation on ope operat ration ion.. It als also o ext extend endss theoriginal formu theoriginal formulas las of PSOto searc search h forthe glob global al optim optimal al solut solution ion more mo re ef effe fect ctiv ivel ely. y. This This is simi simila larr to hu huma man n soci societ ety y in that that a grou group p of  leaders can make better decisions.  2.2.2. Particle swarm optimization-related applications After the PSO algorithm was proposed in 1995, besides the above mentioned modifications, many different kinds of applications tions have have bee been n dev develo eloped ped.. PSOis app applie lied d to learn learn neu neural ral net networ works ks and it can classify XOR problem precisely. The results have shown that that PSOcan learn learn sim simple ple neu neuralnetwo ralnetworks rks.. Mor Moreov eover,PSO er,PSO wasalso utilized to develop the weight and structure of a neural network in 1998. It is more efficient than traditional training algorithms. Applications of PSO are gradually increasing, like in the medical treatment of human tremors in diseases such as Parkinson’s disease, and in industrial automation of computer-aided design and  [19].. manufacturing manufacturin g (CAD/CAM) (CAD/CAM) [19] In addition, PSO has been applied in clustering analysis. Cohen and Castr Castro o   [26]   presented presented a modi modified fied PSOA that feature featured d selforganization of the updating rule for clustering analysis. In their PSOA, it is not necessary to calculate fitness value. The results show that it is better than the K-means method. Kuo et al.   [27] proposed a PSKO which combined PSO-clustering with K-means. The PSKOA was evaluated in four datasets, and compared with the performance of K-means clustering, PSO-clustering and hybrid PSO. The experimental results show that the PSKO algorithms outperform other algorithms. In addition, Kuo and Lin   [28]   f urther urther

use used bin binary ary PSO to sol solve ve aproblem. cluste clusterin ring gChen ana analys lysis is pro proble m and and app applie lied d it tod an order clustering and Ye blem proposed a PSObased clustering algorithm [20] algorithm [20],,  which they called PSO-clustering PSO-clustering.. This method used minimal target function in PSO to automatically search for the data group center in multi-dimensional space. Compared with traditional clustering algorithms, PSO-clustering requir req uires es few fewer er param paramete eterr set settin tings gs and and avo avoids ids local local opt optima imall solutions. In 2003, Sousa et al. applied PSO to classification problems in a decision support system [21] system  [21]..  Three kinds of algorithms generated from PSO, discrete PSO, linear decreasing weight PSO, and constric cons tricted ted PSO, were comp compared ared with a genet genetic ic algo algorith rithm. m. The results showed that PSO has better convergence. Eberhart and Shi showed that PSO can be applied in image recognition to search for the most likely distortion-displacement and the rotation of recognizable fingerprint features, since PSO has high feasibility and fast search capability to obtain correct variables in limited solu-

old

wv id   + c 1 rand( )(pbest − xid ) + c 2 rand( )(gbest − xid )

new old new  xid   =  x   + v  id id



 

old new id   + v id

−ϕ−

2   ϕ

2 − 4ϕ

 

 

where ϕ where ϕ

= c 1 + c 2 ,

(7) (8)

ϕ> >   4

(9)

Social important factor improve PSO, Zh performance. man ce. Tointeraction enh enhanc ance e the this e an soc social ial intera interacti ction onss intoth the e algori algorithm thm, Zhao ao et  put ut forward a new method of improved PSO. They propose al. [29] al. [29] p the use of an improved adaptation strategy with enhanced social interactions for PSO. This adaptation strategy uses the information

tion space. Once distortions of fingerprint are found, it is easy to recognize whether or not two fingerprint images are the same [22] [22].. Wang Wan g et al.emplo al.employedPSO yedPSO to sol solve ve a travel travelingsale ingsalesma sman n pro proble blem m (TSP)in (TSP)in 20 2003 03[23] [23].. The resul results ts indicate indicated d the searc search h spac space e of possi possible ble solutions solut ions dimi diminish nishes es andconverg andconvergencecan encecan be achi achievedvery evedvery quick quickly ly under the condition that the PSO has the same optimal solution, as is the case with traditional methods. Another study utilized PSO to find the lowest cost and best purchase strategy, or select the bestt ven bes vendor dor.. This This stu study dy use used d a heu heuris ristic tic alg algori orith thm m to gen genera erate te a set of initial solutions, and then combined the PSO algorithm to solve the problem. The results showed that the proposed method could increase efficiency and quality [24] quality [24].. 3. Method Methodology  ology 

According to the background presented in the above section, this section proposes a novel algorithm, which applies the particle swarm optimization algorithm in generating association rules fro from m a dat databa abase. se. As follow follows, s, a det detail ailed ed exp explan lanati ation on of the pro propos posed ed algorithm is given.

 

330

 

R.J. Kuo et al. / Applied Soft Computing 11 (2011) 326–336

Fig. 3.  Data type transformation.

example, this transaction only purchased products 2 and 3, so the values of cells 2 and 3 are both “1s,” whereas cells 1 and 4 are both “0s.”  3.2.2. Calculation of IR value This study applies the PSO algorithm in association rule discovery, as well as in the calculation of IR value which is included in chromosome encoding. The purpose of such an inclusion is to produce more meaningful association rules. Moreover, search efficiency is increased when IR analysis is utilized to decide the rule

Fig. 2.  The proposed association rule mining algorithm.

 3.1. The proposed algorithm

The proposed proposed algo algorith rithm m comp comprisestwo risestwo part parts, s, prepr preprocess ocessingand ingand mining. The first part provides procedures related to calculating the fitness values of the particle swarm. Thus, the data are transformed and stored in a binary format. Then, the search range of  the particle swarm is set using the IR (itemset range) value. In the second part of the algorithm, which is the main contribution of this study, the PSO algorithm is employed to mine the association rules. First, we proceed with particle swarm encoding, this step is similar to chromosome encodin encoding g of genetic algorithms. The next step is to generate a population of particle swarms according to the calculated fitness value. Finally, the PSO searching procedure dur e pro procee ceeds ds until until th the e stop stop con condit ditionis ionis rea reach ched, ed, whichmean whichmeanss the the best particle is found. The support and confidence of the best particle can represent the minimal support and minimal confidence. Thus, we can use this minimal support and minimal confidence 2 i  illustrates llustrates the algorithm for further association rule mining. Fig. mining.  Fig. 2 structure.  3.2. Preprocessing of PSO association rule mining   3.2.1. Binary transformation This study adopts the approach proposed by Wur and Leu in 1998 [25] to transf transform orm transa transact ction ion dat data a int into o bin binary ary typ type e dat data, a, eac each h recorded and stored as either 0 or 1. This approach can accelerate the database scanning operation, and it calculates support and confidence more easily and quickly. The transformation approach

is explained anare example in  Fig. say in Fig. 3 3.. T1 to T5, in the original data. In In Fig.  Fig. 3, 3, there tby here five records, Each of these records is transformed and stored as a binary type. For instance, there are a total of only four different products in the database, so four cells exist for each transaction. Take B4 as an

length generated by chromosomes in particle swarm evolution. IR  analys ana lysis is avoid avoidss sea search rchingfor ingfor too many many assoc associat iation ion rul rules, es, whichare whichare meaningless itemsets in the process of particle swarm evolution. This method addresses the front and back partition points of each chromosome, and the range decided by these two points is called the IR, which is shown in Eq. (10) Eq.  (10):: mTransNum( m)) + log( nTransNum( n))] IR  = [log( [log(m TransNum(m log(n TransNum(n

Trans(m, n) Trans(m, TotalTrans (10)

/   n   and   m < n. “m” rep repres resent entss the the len lengt gth h of  In Eq.   (10), (10),   m   = the itemset and TransNum(m) means the number of transaction records reco rds cont containi aining ng m prod products. ucts. “n” is the the le leng ngthof thof the the it item emse set, t, and and TransNum(n) means the number of transaction records containing n products. Trans(m,  n ) means the number of transaction records purchasing m  to  n  products. TotalTrans represents the number of  total transactions.

 3.3. Application of PSO to association rule mining 

ApplyingPSO Appl yingPSO to assoc associati iation on mini miningis ngis themain partof thisstudy. We use PSO as a module to mine best fitness value. The algorithmic process is quite similar to that of genetic algorithms, but the proposed procedures include only encoding, fitness value calculation, population generation, best particle search, and termination condition. Each of the steps in the PSO algorithm and the process of generating association rules are explained as follows: 1.  Encoding : According to the definition of association rule mining, the intersection of the association rule of itemset   X  to itemset   Y  ( X → Y ) must be empty. Items which appear in itemset  X  do  do not  Y , and appear in itemset vice versa.for Hence, both the and back partition points must be given the purpose offront chromosome encoding. The itemset before the front partition point is called “itemset   X ,” ,” while that between the front partition and back partition points is called “itemset  Y ..””

 

 

R.J. Kuo et al. / Applied Soft Computing 11 (2011) 326–336

331

Fig. 4.  Chromosome encoding relationship.

The chromosome encoding approach in this study is “string encoding.” Each value represents a different item name, which me mean anss th that at item item 1 is en enco code ded d as “1” “1” an and d item item 2 is en enco code ded d as “2.” “2.” The representative value of each item is encoded into a string type chromosome by the corresponding order. An example of  4..  This chromosome has six a chromosome is illustrated in Fig. in  Fig. 4 different diffe rent items items,, indi indicati cating ng that several several possi possible ble assoc associati iation on rules can be generated. Thereafter, the IR value calculated in the previous step is employed to choose the front and back partition points of the chromosomes. In this example, the corresponding maximum IR, IR {2 → 5}, means that the minimal front partition point and the maximum back partition point are 2 and 5, respectively. 2.  Fitness value calculation: Thefitness Thefitne ss val value ue in this this stu study dy is utiliz utilized ed to eva evalua luate te theimpor theimpor-tance of each particle. The fitness value of each particle comes from the fitness function. Here, we employ the target function proposed by Kung [30] Kung  [30] t  to o determine the fitness function value as shown in Eq. (11) Eq. (11).. k) X  log(support(k Fitness(k) = confidence( Fitness(k confidence(k  log(support(k) X  length(k   length(k) + 1) (11)

Fig. 5.  Database search method for calculating support.

particlewith the small particlewith smallest est distancewill distancewill be selec selected ted andtreated as the particle’s new position. In the distance measuring function, we use traditional “Euclidean distance” as shown in Eq. (12) Eq.  (12),,

    ( x d

n

m

dist( x , y ) =

n  − ym 2  ) i i

(12)

1

where x n is the position of the particle at  n th update and  y m is the possible possible part particle icle numb number er m in thecons theconstra traine ined d ran range.In ge.In addiaddition, d is the dim dimens ension ion of the searc search h spa space. ce. The nea neares restt pos possib sible le particle is selected to be the target particle’s new position. This method can prevent a particle from falling beyond the search space when its position is updated. 5.  Termination condition: To complete particle evolution, the design of a termination condition is necessary. In this study, the evolution terminates when the fitness values of all particles are the same. In other words, the positions of all particles are fixed. Another termination condition occurs after 100 iterations and the evolution of  the particle swarm is completed. Finally, after the best particle is found, its support and confidence are recommended as the value of minimal support and minimal mini mal con confiden fidence. ce. These para paramete meters rs are employ employed ed for associ associ-ation rule mining to extract valuable information.

Fitness(k) is the fitness value of association rule type  k . Confidence(k) is the confidence of association rule type  k. Support(k) is the actual support of association rule type  k . Length(k) is the length of association rule type   k. The objective of this fitness function is maximization. The larger the particle support and confidence, the greater the strength of the association, meaning that it is an important association rule. In Eq. (11) Eq.  (11),,   support, confidence and itemset length must be calculated before calculating fitness value. This study uses the binary type data search method. This method first arranges the original data into a two-dimensional matrix where rows represent data records and columns represent product items. For example, to calculate the support of  { 25} → {3}, it should first be transformed into a binary data type “011010.” Then, only search for columns whose values are 1s, say columns 1, 2, and 4. The search proceeds column by column, as shown in Fig. in  Fig. 5 5.. Be Beca caus use e co colu lumn mnss wh whos ose e va valu lues es ar are e 0, sa say y co colu lumn mnss 0, 3 an and d 5, ar are e skipped, skipp ed, searc search h timeis grea greatly tly reduc reduced ed andcalcula andcalculationefficienc tionefficiency y increases. 3.  Population generation: In ord order er to app apply ly th the e evo evolut lution ion pro proces cesss of th the e PSOalgor PSOalgorith ithm, m, it is nec necess essary ary to firs firstt genera generate te the ini initia tiall pop popula ulatio tion. n. In this this stu study dy,, we sel select ect partic particles les which which hav have e lar larger ger fitness fitness val values ues as the pop popuulation lation.. The par partic ticles les in thi thiss pop popula ulatio tion n are cal called led ini initia tiall partic particles les.. 4.  Search the best particle: First,the First,the par partic ticle le wit with h themaxim themaximum um fitnessvaluein fitnessvaluein the the pop popuulationis lationis sel select ected ed as the“gbes the“gbest.”The t.”The ini initia tiall vel veloci ocity ty of thepart theparticl icle e is set to be  v 0 = 0, while the initial position is x is  x0 = (2 (2,, 5, 1, 3, 4). The particle’ particle’ss init initial ial “pbes “pbest” t” is its init initial ial posit position,and ion,and it is upda updated ted (3)) an and d (4) (4).. Since Since th the e val valuescalc uescalcula ulatedby tedby thesetwo thesetwo equ equaabyEqs. (3

This study’s experiment was conducted in the environment of  Microsoft Windows XP using an IBM compatible computer with Interr Pentium IV 1.60 GHz and 512 MB RAM. The algorith Inte algorithm m was coded by Borland C++ Builder 6. In regard to the experimental testing database, its source was a FoodMart2 Food Mart2000retail 000retail tran transact saction ion data databaseembeddedin baseembeddedin a Micr Microsoft osoft SQL Server 2000, as illustrated in Fig. in  Fig. 6 6.. Since Since the there re are dif differ ferent ent kin kinds ds of transa transact ction ion dat datab abase asess in

tions may not always be an integerthe or search. fall in the (1, 5), we designed a method to constrain Therange constrained method is to calculate the distance between the particle’s new position and all the possible particles inside the constrained range before the particle’s position is updated. Definitely, the

FoodMart2 Food Mart2000,we sales factin1997data table e forisassess assessment. The 000,we numberonlyselect of product items this datatabl table 1560.In orde orderr to effec effectivel tively y mine meaning meaningful ful assoc associati iation on rules, this experiment categorizes the products into groups according to the product caterogy provided by the data table. Thus, products are

4. Model evaluation evaluation results and discuss discussion ion

This section will use the database provided by Microsoft SQL  Server 2000 to verify the feasibility of the proposed algorithm. A detailed discussion is provided as follows. 4.1. Experimental pla platform tform an and d database

 

332

 

R.J. Kuo et al. / Applied Soft Computing 11 (2011) 326–336  Table 1 The results of the two-dimensional PSO association rules for FoodMart2000.

Possible results High-frequency itemset (≥minimal support)

Association itemset (≥minimal confidence)

1

Minimal confi Minimal confidence= dence= 0.4067 0.40675 5 {Snack Foods → Vegetables} Minimal Minim al confi confidence= dence= 0.3849 0.38497 7 {Snack Foods → Vegetables} {Vegetables → Snack Foods} Minimal Minim al confi confidence= dence= 0.4070 0.40709 9 {Dairy → Vegetables}

Minimal support = 0. 0.15628 {Snack Foods, Vegetables} Minimal support = 0. 0.15628 {Snack Foods, Vegetables}

2

3

Final results

Fig. 6.  The data table of the FoodMart2000 database.

classified class ified into 34 cate categori gories, es, each with a corr correspon esponding ding product category id. In regard to data selection, 6000 customers are randomly rand omly selec selected ted alon along g withtheir corr correspon esponding ding transact transaction ion data at different times. After arrangement, there are a total of 12,100 transaction records for these 6000 customers. 4.2. Mining results via PSO algorithm

In this this exp experi erimen ment, t, eve every ry transa transacti ction on rec record ord in th the e FoodFoodMart2000 has 1–13 items. After calculating the IR values, we find

Minimal support = 0. 0.09578 {Dairy, Vegetables} {Snack foods, Vegetables}  

{Snack Foods, Vegetables} {Dairy, Vegetables}

{Snack Foods → Vegetables} {Vegetables → Snack Foods} {Dairy → Vegetables}

that IR(1 → 6) = 5.86 5.86822 822 is the largest. largest. Therefore, Therefore, we can genergenerate five different dimensions of encoding types for the particle swarm. They are two dimensions, 1 → 2, three dimensions, 1 → 3 and2 → 3, fourdimensio fourdimensions, ns, 1 → 4, 2 → 4,and3 → 4,and fiv five e dim dimenen→ → → → → sions, 1 5, 2 5, 3 5, and 4 5, and six dimensions, 1 6, 2 → 6, 3 → 6, 4 → 6, and 5 → 6. According to these five dimensions, we can implement the PSO mining process. An example of two dimensions is illustrated in Fig. in Fig. 7. 7. Since the computational results are different for each replication, a total of 30 replications are conducted in order to get the final experimental results. The results show that there are three possibilities as listed in Table in Table 1. 1. Next, we can conduct the three-dimensional PSO association rule mining. The population size is 20. The computational results can be found in   Appendix A A..  The results indicate that the maximal setup value is 0.03652 for the minimal support threshold  Fig. 8 8   also illustrates the number of high-frequency itemvalue. Fig. value. sets under different setups of minimal support threshold values. Becaus Bec ause e the the sup suppor portt thr thresh eshold old value value is sma smalle llerr tha than n 0.0 0.05, 5, the num num-ber of high-frequency itemsets mined is too large. This means that more meaningless rules are generated. In the current experiment, since the maximal setup values of minimal support for three, four,

Fig. 7.   A demonstration of the implementation of PSO association rule mining using the FoodMart2000 database.

 

 

R.J. Kuo et al. / Applied Soft Computing 11 (2011) 326–336

333

Fig. 8.  Number of high-frequency itemsets under different minimal supports. Fig. 11.  Relationship between population size and computation time using the FoodMart2000 database.

Fig. Fig. 9.   Relat Relationsh ionship ip between between popul populationsize ationsize and compu computationtime tationtime for PSO and GA.

Fig. 12.  Relationship between number of evolutions and computation time.

10 clearly Figs. 9 and 10 c learly indicate that the proposed PSO algorithm outperforms the genetic algorithm both in population size and shows ws a sim simila ilarr out outco come, me, sho showwnumber numb er of evolu evolution tions. s. Table Table 2 also sho ing that PSO can converge faster. This is very important as the database is very large. Fig. Fig. 10.   Relationship between number of evolutions and computation time for PSO and GA.

five, five, and and sixdime sixdimensi nsionsareall onsareall sma smalle llerr tha than n 0.05,noneof 0.05,noneof themin themining ing results are used.

4.3.2. Performance evalua evaluation tion of the PSO algorithm’s searching  capability This subsection intends to discuss the searching performance of the PSO algorithm. The first issue is the relationship between population size and speed. The parameter setup is as follows:

(1) Number of dimensions: 2. (2) Number of data: 12,100. (3) Number of replications: 20.

4.3. Performance evaluation analysis 4.3.1. Comp Compariso arison n of PSO and GA shown wn th that at the the pro propos posed ed PSO algori algorithm thm Though Thou gh Secti Section on 4.2 has sho canprovidevery prom promisingresults isingresults,, furth further er inve investig stigationis ationis stillnecessary. Thus, we compare it with the genetic algorithm proposed in in [9]  [9].. For both the PSO algorithm and genetic algorithm, the number of product items and the number of transaction records are 25 and 140,000, respectively. Basically, the two algorithms were implemented under the same conditions. We attempt to ascertain whether or not computation time is related to the population size or number of evolutions, which are illustrated in  Figs. 9 and 10, 10, respectively.

(4) Population size: 5, 10, 20, 30, 40 and 50 particles. The average running times for different population sizes are 11.. illustrated in Fig. in Fig. 11 In additi addition on,, if mos mostt of the the par parame ameter ter set setups ups are th the e sam same e exc except ept that the size of population is 20, termination condition is 100 generations and the number of experimental replications is 50, then the relationship between the number of evolutions and running speed is presented in Fig. in  Fig. 12. 12. In summ summary, ary, thou though gh the proposed proposed PSO algo algorith rithm m for assoc associatio iation n mining requires more computation time with increasing popula-

 Table 2 Comparison table of PSO and GA.

Comparison

Computational procedures Memory Search message

Algorithm PSO

GA

Using particle velocity Yes gbest provides the movement message to the particles and the search direction follows the current pbest

Crossover and mutation No The chromosomes can share the message, thus the movement toward the optimal solution is more uniform

 

334

 

R.J. Kuo et al. / Applied Soft Computing 11 (2011) 326–336

 Table 3 Names and codes of industrial groups.

Code

Name

Code

Name

Code

Name

1 2 3 4 5 6

Elec. and Mach. Glass and Ceramics Steel and Iron Rubber Electronics Construction

7 8 9 10 11 12

Transportation Tourism Wholesale and Retail Finance Textiles Cement

13 14 15 16 17 18

Plastics Foods Chemicals Elec. Appliance and Cable Paper and Pulp Others

tion size, the increase is not significant. In addition, the numbers

 Table 4

of evolutions mostly fall within the range from 1 to 10. This means thatt theconv tha theconverg ergenc ence e of com comput putati ation on is ver very y fas fast. t. Thu Thus, s, onl only y a sma small ll number of evolutions are good enough for real application. Furthe Fur thermo rmore, re, in reg regardto ardto th the e sel select ectionof ionof thr thresh esholdvalue oldvalue set setup, up, this study can prov provide ide the most feasible feasible mini minimal mal support and confidenc confi dence. e. This dram dramatic atically ally decreases the time consumed by trial-and trial -and-erro -error. r. Thus,the propo proposed sed PSO algo algorith rithm m is bett better er thanthe traditional Apriori algorithm since it does not need to subjectively set up the threshold values for minimal support and confidence. This can also save computation time and enhance performance.

Names and codes of the categories used in the second experiment. Name Code Name Code 1 2 3 4 5 6

Low-P LowP ric rice e E le lect ctrro nic nicss Steel and Iron Tourism Medi Me dium um-P -Pri rice ce El Elec ectr tron onic icss Wholesale and Retail Finance

7 8 9 10 11

Medium Medi um-- hig high h- Pr Pr ice ice Ele lect ctrr on onic icss Chemicals Elec. Appliance and Cable High Hi gh-P -Pri rice ce El Elec ectr tron onic icss Others

In this section, the PSO algorithm proposed for association rule mining mini ng is appl appliedto iedto inve investor stors’ s’ stoc stock k purch purchasebehavior asebehavior.. The dataare the daily trading records of 50 individual brokerage accounts of a securi sec urityfirmin tyfirmin Tai Taiwanfro wanfrom m Apr April2005to il2005to Jun June e 200 2005.The 5.The tot totalnumalnumber of securities purchased by investors during this period is 827, and they are grouped into 18 industry categories.  Table 3 s 3  shows hows the names and codes of these industry categories. Using association rule mining, this study tries to examine the evidence of the stock preferences of individual investors. In the the firs firstt exp experi erimen ment, t, this this stu study dy exa examin mines es the the ass associ ociaation tion rules rules of inves investor tors’ s’ pur purch chasi asing ng beh behavi avior or of sto stock ckss wit within hin differ different ent in indus dustri trial al gro groups ups.. The dat data a are transf transform ormed ed int into o binary bin ary form form with with 133 1330 0 tradin trading g rec record ords. s. Aft After er the calcu calculat lation ion of IR va valu lues, es, we fin find d th that at IR IR(1 (1 → 2)= 5.231 5.23184 84 is the large largest. st. Given Giv en the the par partic ticle le pop popula ulatio tion n of 10, th this is study study imp implem lement entss th the e PS PSO O mini mining ng pr proc oces esss an and d th the e final final re resu sult ltss show show se seve ven n associati assoc iation on item sets. They are   {Steel Steel and and Iron Iron → Electronics}, {Tourism → Electronics},   {Wh Whole olesal sale e and Retail Retail → Electronics}, {Finance → Electronics},   {Chemicals → Electronics},   {Elec. Applianceand Cabl Cable e   → Electronics} and {Others → Electronics}. Acco Accordrd-

is a relatively important industry group. Therefore, the Electronics industry is suitable for evaluating the relationship between the price of the stocks and investors’ purchasing behavior. In thi thiss exp experi erimen ment, t, we foc focus us on th the e eig eight ht ind indust ustria riall gr group oupss wi with th significant purchasing associations observed in the first experiment.. The stoc ment stocks ks of the Elect Electroni ronics cs indu industry stry are furth further er cate categori gorized zed according to their stock prices. Stocks whose market prices are lower than NT$20, between NT$21 and NT$60, between $61 and $100, and over $100, are categorized as “Low-Price Electronics,” “Medium-Price Electronics,” “Medium-high “Medium-high-Price -Price Electronics, Electronics,”” and “High-Price Electronics,” respectively. The categories used in the second experiment are shown in Table in Table 4. 4. The data are transformed into binary form with 1263 trading records. records. Aft After er th the e calcu calculat lation ion of IR val values ues,, we find find tha thatt IR(1 → 2) = 4.79 4.79758 758 is the largest. largest. Given the part particle icle popula population tion of 10 10,, th this is study study imp implem lement entss the the PSO minin mining g pro proces cesss and the final results show four association item sets. They are  { Low-Price Electronics → Medium-Price Electronics},   {Tourism → MediumPrice Pric e Electroni Electronics cs},   {Medium-high Medium-high-Price -Price Electronics → MediumPrice Electronics}, and and   {High-Price Electronics → Medium-Price Electronics}. Accord According ing to the result resultss of the associ associati ation on itemitemsets, most inve investor storss purc purchase hase medium price in the Electron Electronics ics industry. Our results indicate that price matters when investors make

ing to the result res s of the thindu e ass associ ociati on itemse ite msets, ts, ginv invest estors ors Iron who who, purc purchase hase stoc stocks ksults fromthe industria strial lation group groups s incl includin uding Steeland Iron, Tourism, Touri sm, Whol Wholesale esale and Reta Retail, il, Fina Finance, nce, Chemical Chemicals, s, Elec Elec.. Appli Appliance ance and Cable and Others tend to purchase stocks from the Electronics industry. However, in addition to the industrial categories, the price of  stocks is another factor that may affect people’s purchasing decision. Owing to a limited amount of data, this study selects the Electroni Elect ronics cs industryto industryto cond conductthe uctthe experi experiment ment.. Give Given n thatthe trad trad-ingvalue ingvalu e of sto stock ckss of th the e Electr Electroni onics cs in indus dustryacco tryaccountfor untfor 70%of the the trading value of the Taiwan Stock Exchange Market on average, it

purchasing Investors who purchase Tourism industry stocks tend decisions. to purchase medium-price stocks in the Electronics indust ind ustry. ry. The rea reason son may may be that that the the price price ran range ge of “Me “Mediu dium-P m-Pric rice e Electronics” is similar to that of the stocks of the Tourism industry. Moreover, those who purchase stocks in the highest price range are less likely to purchase stocks in the lowest price range, and vice versa. This implies that when making investment decisions, investors’ personal preference may lead them to purchase stocks within a certain price range. In addition, investors seem to prefer stocks in the same industry category. Those people who purchase sto stocksfrom cksfrom theElect theElectron ronics ics ind indust ustry ry under under a cer certai tain n price price ran range ge are

5. Case study study of the stock stock market market

 Table 5 Names and codes of the categories used in the second experiment.

Code

Name

Code

Name

1 2 3 4 5 6 7

Steel and Iron, non-MSCI Electronics, non-MSCI Tourism, non-MSCI Wholesale and Retail, non-MSCI Finance, non-MSCI Chemicals, non-MSCI Elec. Appliance and Cable, non-MSCI

8 9 10 11 12 13

Others, non-MSCI Steel and Iron, MSCI Electronics, MSCI Finance, MSCI Chemicals, MSCI Elec. Appliance and Cable, MSCI

 

R.J. Kuo et al. / Applied Soft Computing 11 (2011) 326–336

more lik more likelyto elyto pur purch chasestock asestockss of th the e sam same e ind indust ustry ry wi with th a dif differ ferent ent price range. In the thi third rd exp experi erimen ment, t, thi thiss stu study dy exa examin mines es wh wheth ether er inv invest estors ors prefer pre fer sto stocks cks th that at are inc includ luded ed in the Mor MorganStanl ganStanley ey Cap Capita itall In Inter ter-national (MSCI) Taiwan Index. The MSCI Taiwan Index measures the performance of a set of equity securities over time, and most of the stocks are high-quality stocks. Many international institutional investors take MSCI indices as an important reference for their investment decisions and asset allocation. Moreover, information mat ion abo about ut the sto stock ck com compos positi itions ons of the MSC MSCII ind index ex is av avail ailabl able e in the mass media. Therefore, the results can give us some hints as to whether investors engage in attention-ba attention-based sed buying. Accordingly, this experiment separates the stocks of the eight industrial groups into “MSCI stocks” and “non-MSCI stocks.” Since there the re are no sto stocks cks inc includ luded ed in th the e MSC MSCII ind index ex for Tou Touris rism, m, Who Wholelesale & Retail, and Others industry, there are only 13 categories, as shown in Table in Table 5. 5. The data are transformed into binary form with 1263 trading records. records. Aft After er the the cal calcul culati ation on of IR val values ues,, we find that IR(1 → 2)= 5.2624 5.26242 2 is the the larges largest. t. Giv Given en th the e par partic ticle le pop popula ula-ti tion on of 10, 10, th this is st stud udy y im impl plem emen ents ts th the e PS PSO O mi mini ning ng pr proc oces esss and the final results show four associatio association n item itemsets. sets. They are {Touri Tourism, sm, nonnon-MSCI MSCI → Electronics, non-MSCI},   {Touri Tourism, sm, non}  { MSCI → Electronics, MSCI , Others, non-MSCI → Electronics, nonMSCI},and {Electronics,MSCI → Electronics,non-MSCI}. Acc Accordin ording g to the results of the association itemsets, investors’ purchasing decisions are less likely to be affected by the presence, or lack of  presence, of the stocks in the MSCI Taiwan Index. Those investors who purchase stocks in the Tourism industry tend to purchase stocks in the Electronics industry regardless of whether they are included in the MSCI index. On the other hand, those investors who bought MSCI-included stocks from the Electronics industry tend ten d to pur purcha chase se non non-MS -MSCI CI stocks stocks fro from m the the sam same e ind indust ustry, ry, an and d are less less likely likely to pur purch chase ase the MSC MSCI-i I-inc nclud luded ed sto stocks cks of oth other er ind indust ustria riall groups. In summ summary,this ary,this sessio session n exami examines nes the stoc stock k preferenc preferences es of indi indi-vid vidual ual invest investors ors as rev reveal ealed ed by their their purch purchasi asing ng activi activitie ties. s. By usi using ng the proposed PSO algorithm, it is not necessary to test different values of support and confidenc confidence. e. However, the traditional Apriori algorithm has to subjectively set up the threshold values for minimal support and confidence. As an application of particle swarm optimization for association rule mining, the results imply that investors tend to purchase stocks within a certain price range, and are more likely to purchase stocks in the same industrial category. Owing to considerations of confidentiality, we can only obtain the tradin trading g rec record ordss of 50 ind indivi ividua duall broker brokerag age e accou account nts. s. Giv Given en thelimited ited amo amoun untt of dat data, a, our exp experi erimen menta tall result resultss cann cannot ot repres represent ent the the investment patterns of individual investors in Taiwan. However, it does shed some light on the issues for further research.

6. Concl Conclusion usionss

Following the development of information technology, technology, searching for mean meaningf ingful ul infor informati mation on in large large data databases bases has beco become me a very important issue. That explains why association rule mining is the most popular technique in data mining. However, the traditional Apriori algorithm has a very critical drawback in that the minimal support and confidence are determined subjectively. This study has demonstrated that using the PSO algorithm can determine deter mine these two paramete parameters rs quick quickly ly and objectiv objectively, ely, thus, enha enhancin ncing g mini mining ng perfo performan rmance ce for larg large e data databases bases by apply applying ing the FoodMart2000 database. By applyingthis applyingthis algo algorith rithm m to stoc stock k selectionbehavio selectionbehavior, r, thisstudy also further proves that the proposed PSO algorithm can find the association between industrial categories. The mining results can

 

335

provide security for customers’ transaction behavior and also provide a reference for the formulation of marketing strategy. Future studies can focus on testing different updating rules. Moreover, different product items may have different importance. A weighted PSO mining algorithm could be further investigated in order to provide more practical approaches for industries.  Appendix A. Three-dimensional experimental results of  FoodMart2000 Possible result

High-frequency itemset (≥minimal support)

Association itemset (≥minimal confidence)

1

Minimal support = 0 ..0 03652 {Dairy, Snack Foods, Vegetables} Minimal support = 0 ..0 03239 {Can, snack foods, vegetables } {Dairy, Snack Foods, Vegetables} Minimal support = 0 ..0 02950 {Can, Snack Foods, Vegetables} {Dairy, Snack Foods, Vegetables} {Fruit, Snack Foods, Vegetables} { Jams and Jellies, Snack Foods, Vegetables} {Meat, Snack Foods, Vegetables} Minimal support = 0 ..0 025785 {Can, Snack Foods, Vegetables} {Beverages, Snack Foods, Vegetables} {Dairy, Snack Foods, Vegetables} {Fruit, Snack Foods, Vegetables} { Jams and Jellies, Snack Foods, Vegetables} {Meat, snack Foods, Vegetables}

Minimal Minim al confidence= confidence= 0.41424 0.41424 {Dairy, Snack Foods → Vegetables} Minimal Minim al confidence= confidence= 0.4237 {Can, Snack Foods → Vegetables}

2

3

4

Minimal Minim al confidence= confidence= 0.4204 {Can, Snack Foods → Vegetables} {Fruit, Snack Foods → Vegetables} { Jams and Jellies, Snack Foods →  Vegetables}

Minimal Minim al confidence= confidence= 0.41106 0.41106 {Can, Snack Foods → Vegetables} {Beverages, Snack Foods, Vegetables} {Dairy, Snack Foods, Vegetables} {Fruit, Snack Foods → Vegetables} { Jams and Jellies, Snack Foods →  Vegetables}

References [1] J. Han, M. Kamber, Data Mining: Concepts and Techn Techniques, iques, Morgan Kaufmann, New York, 2000. [2] A. Savasere, E. Omiecinski, S. Navathe, An effi efficient cient algorithm for mining association rules in large database, in: Proceedings of the 21st VLDB Conference, 1995, pp. 432–444. [4] H. Toivonen, Sampling large databases for association rules, rules, in: Proceedings of  the 22nd VLDB Conference, 1996, pp. 134–145. [5] S. Birn, R. Motwani, J.D. Ullman, S. Tsur, Dyn Dynamic amic itemset counting and implicationrulesfor cationrulesfor marketbaske marketbaskett data, data, in:Proce in:Proceedi edingsof ngsof theACM SIG SIGMOD MOD,, 1997, 1997, pp. 255–264. [6] D.I. Lin, Z.M. Kedem, Pincer search: a new algorithm fo forr discovering the maximum frequent frequent set, in: Proceedin Proceeding g of the 6th Inte Internati rnational onal Conference Conference on Extending Database Technology: Advances in Database Technology, 1998, pp. 105–119. [7] B.Liu, W.Hsu, Y. Ma,Mining Ma,Mining associ associati ationruleswith onruleswith mul multip tiple le min minima imall suppor supports, ts, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, San Diego, CA, USA, 1999, pp. 337–341. [8] D.L. Yang, C.T. Pan, Y.C. Chung, An efficient hash-based method for discovering the maximal frequent set, in: Proceeding of the 25th Annual International Conference on Computer Software and Applications, 2001, pp. 516–551. [9] S.S. Gun, Application Application of genetic algorithm and weighted itemset for association rule mining, Master Thesis, Department of Industrial Engineering and Management, Yuan-Chi University, 2002. [10] M. Saggar, A.K. Agrawal, A. Lad, Optimization of association rule mining usin using g improved genetic algorithms, in: Proceeding of the IEEE International Conference on Systems Man and Cybernetics, vol. 4, 2004, pp. 3725–3729. [11] C. Li, M. Yang, Association Association rule data mining mining in manufacturing manufacturing information information system based on genetic algorithms, in: Proceeding of the 3rd International Conferenceon Confe renceon Computatio Computational nal Electromag Electromagnetic neticss and Its Applic Application ations, s, 2004, pp. 153–156. [12] R.J. Kuo, C.W. Shih, Shih, Associatio Association n rule mining through the ant colony colony system system for National Health Insurance Research Database in Taiwan, Computers and Mathematics with Applications 54 (11–12) (2007) 1303–1318.

 

336

 

R.J. Kuo et al. / Applied Soft Computing 11 (2011) 326–336

[13] R.J. Kuo, S.Y. Lin, C.W. Shih, Discovering association rules thro through ugh ant colony system for medical database in Taiwan,” to appear, International Journal of  Expert Systems with Applications 33 (November (3)) (2007). [14] Particle Swarm Optimizatio Optimization: n: Tutorial, Tutorial,  http://www.swarmintelligence.org/ tutorials.php.. tutorials.php [15] M.P. Song, G.C. Gu, Research Research on particle particle swarm optimiza optimization: tion: a revie review, w, in: Proceedings of the IEEE International Conference on Machine Learning and Cybernetics, 2004, pp. 2236–2241. [16] R.C. Eberhar Eberhart, t, J. Kennedy, Kennedy, A new optimizer optimizer using particle swarm theory, in: Proceeding Proce edingss of the6th Inte Internati rnationalSymposi onalSymposium um onMicro Machin Machine e andHuman Science, Nagoya, Japan, 1995, pp. 39–43. [17] Y. Shi, R.C. Eberhart, A modified particle swarm optimizer, in: Proceedings of  the IEEE International Conference on Evolutionary Computation, Piscataway, 1998, pp. 69–73.

[23] K.P.Wang,L. Huang, Huang, C.G.Zhou, W.Pang, Particleswarm Particleswarm optimizatio optimization n fortraveling salesman problem, in: Proceedings of the Second International Conference on Machine Learning and Cybernetics, 2003, pp. 1583–1585. [24] L.W. Yeh, Optimal purchasing decisio decision n for multiple product productss and multiple suppliers under limited supplier capacities and price discount, Master Thesis, Department of Industrial Engineering and Management, Yuan-Chi University, Taiwan, 2002. [25] S.Y. Wur, Y. Leu, An effective Boolean algorithm for mining association rules in large database databases, s, in: Proceedin Proceedings gs of the 6th Inte Internati rnational onal Conference Conference on Database Systems for Advanced Applications, 1998, pp. 179–186. [26] S.C.M.Cohen,L.N.D. Castro,Data Castro,Data clusteringwith clusteringwith particleswarms,IEEE particleswarms,IEEE Congress Congress on Evolutionary Computations (2006) 1792–1798. [27] R.J. Kuo, M.J. Wang, T.W. Huang, Application Application of clustering clustering analysi analysiss to reduce reduce SMT setup time—a case study on an industrial PC manufacturer in Taiwan, in:

[18] M. Clerc, The swarm and the queen: queen: towards a determinis deterministic tic and adaptive adaptive particle swarm optimization, in: Proceedings of the Congress of Evolutionary Computation, Washington, 1995, pp. 1951–1957. [19] R.C. Eberhart, Y. Shi, Particle swarm o optimization: ptimization: developments, applications applications and reso resources urces,, in: Proce Proceeding edingss of the IEEECongresson Evolu Evolutiona tionary ry Comput Computaation, 2001, pp. 81–86. [20] C.Y. Chen, F. Ye, Particle swarm optimization algorithm and its application to clustering analysis, in: Proceedings of the IEEE International Conference on Networking, Sensing and Control, Taipei, Taiwan, 2004, pp. 21–23. [21] T. Sousa, A. Neves, A. Silva, Swarm optimi optimization zation as a new tool for data mining, in: Proceedings of the International Parallel and Distributed Processing Symposium, 2003. [22] C.J. Tzan, A methodolo methodology gy for intelligent intelligent recognitio recognition n syst system, em, Master Thes Thesis, is, Department of Electronic Engineering, I-So University, Taiwan, 2000.

Proceedings of International Conference on Enterprise Information Systems, Milan, Italy, 2008. R.J. Kuo, F.J. Lin, Application Application of particle particle swarm optimizat optimization ion to reduce reduce SMT setup time for industrial PC manufacturer in Taiwan, International Journal of  Innovative Computing, Information, and Control, in press. B. Zhao, C.X. Guo, B.R. Bai, Y.J. Cao, An improved particle swarm optimizatio optimization n algorithm for unit commitment, International Journal of Electrical Power & Energy Systems 28 (7) (2006) 482–490. S.H. Kung, Applying Genetic Algorithm and Weight Item to Association Rule, Master Thesis, Department of Industrial Engineering and Management, Yuan Ze University, Taiwan, 2002. R. Agrawal, T. Imiel Imieliinski, ´ A. Swami, Mining association rules between sets of  items in large databases, ACM SIGMOD Record 22 (2) (1993) 207–216.

[28]

[29]

[30]

[31]

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close