Submitted in partial fulfillment of the requirements of the degree of Bachelor of Engineering In omputer Engineering By Abhishe! Aggar"al Roll No: 37 Under the Guidance of Mrs# R#G# ME$TA
Department %f omputer Engineering S# &# National Institute of Technolog'( Surat
Seminar Appro-al Sheet
This is to certify that the seminar entitled .Data Mining/ Submitted by Abhishe! Aggar"al is approved for the degree of Bachelors of Engineering In omputer Engineering onferred by S & National Institute of Technolog'( Surat
!" #" Date1 )lace1
$ %ould li&e to ac&no%ledge the contribution of certain distinguished people' %ithout %hom support and guidance this seminar %ould not have been concluded" $ ta&e this opportunity to e(press my sincere than&s and deep sense of gratitude to my seminar guide )rs" R" G" )ehta' for her guidance and moral support during the course of preparation of this seminar report *inally' $ %ould li&e to than& my family and friends for their all time support and help in each + every aspect of the course of my seminar preparation" ,bhishe& ,ggar%al
$NTR-.U T$-N -* .,T, )$N$NG //////////////5 .0*$N,T$-N -* .,T, )$N$NG//////////////""5 *-UN.,T$-N -* .,T, )$N$NG//////////////"5 N00. -* .,T, )$N$NG//////////////////"* ,R 1$T0 TUR0 -* .,T, )$N$NG//////////////", R02,T$-N B0T300N .,T, )$N$NG ,N. .,T,3,R0 1-US0/" 6 3-R4$NG -* .,T, )$N$NG/////////////////7 .,T, )$N$NG 5R- 0SS//////////////////8 .,T, )$N$NG T0 1N$6U0S////////////////9 BUS$N0SS -* .,T, )$N$NG/////////////////5+ .,T, )$N$NG ,N BR$NG 5-$NT T- 5-$NT , UR, 7 $N S,20//////////////////////////""5+ B0N0*$TS ,N. ,552$ ,T$-N -* .,T, )$N$NG///////""5* 407 SU 0SS *, T-R -* .,T, )$N$NG 5R-80 TS/////"" 5: .,T, )$N$NG 5R-80 T )0T1-.-2-G7//////////"" 5; .,T, )$N$NG T--2S ,N. T0 1N-2-G7//////////" 5; .,T, )$N$NG: , T1R0,T T- 5R$9, 7///////////""*+ , 5-SS$B20 S 0N,R$- -* .,T, )$N$NG $N *UTUR0/////** -N 2US$-N///////////////////////"*,
Data mining, the e(traction of hidden predictive information from large databases , is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Most companies already collect and refine massive quantities of data. Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. hen implemented on high performance client!server or parallel processing computers, data mining tools can analy"e massive databases to deliver answers to questions such as, # hich clients are most likely to respond to my next promotional mailing, and why$# This report provides an introduction to the basic technologies of data mining. %xamples of profitable applications illustrate its relevance to today&s business environment as well as a basic description of how data warehouse architectures can evolve to deliver the value of data mining to end users.
5#5 Introduction to Data Mining
.iscovering hidden value in your data %arehouse" The ma:or reason that data mining has attracted a great deal of attention in the information industry in the recent years is due to %ide availability of huge amount of data and immediate need of turning this data in useful information and &no%ledge" The information and &no%ledge gained can be used for applications ranging from business management' production control and mar&et analysis" 5#5#5 Definition of Data Mining .ata mining' the extraction of hidden predictive information from large databases' is a po%erful ne% technology %ith great potential to help companies focus on the most important information in their data %arehouses" .ata mining tools predict future trends and behaviors' allo%ing businesses to ma&e proactive' &no%ledge;driven decisions" The automated' prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems" .ata mining tools can ans%er business <uestions that traditionally %ere too time consuming to resolve" They scour databases for hidden patterns' finding predictive information that e(perts may miss because it lies outside their e(pectations" .ata mining derives its name from the similarities bet%een searching for valuable business information in a large database"""and mining a mountain for a vein of valuable ore" Both processes re<uire either sifting through an immense amount of material' or intelligently probing it to find e(actly %here the value resides"
5#5#*The <oundations of Data Mining
.ata mining techni<ues are the result of a long process of research and product development" This evolution began %hen business data %as first stored on computers' continued %ith improvements in data access' and more recently' generated technologies that allo% users to navigate through their data in real time" .ata mining ta&es this evolutionary process beyond retrospective data access and navigation to prospective and
proactive information delivery" .ata mining is ready for application in the business community because it is supported by three technologies that are no% sufficiently mature:
• • •
)assive data collection 5o%erful multiprocessor computers .ata mining algorithms
ommercial databases are gro%ing at unprecedented rates" , recent )0T, Group survey of data %arehouse pro:ects found that !=> of respondents are beyond the [email protected]
gigabyte level' %hile ?=> e(pect to be there by second <uarter of !==A"$n some industries' such as retail' these numbers can be much larger" The accompanying need for improved computational engines can no% be met in a cost;effective manner %ith parallel multiprocessor computer technology" .ata mining algorithms embody techni<ues that have e(isted for at least [email protected]
years' but have only recently been implemented as mature' reliable' understandable tools that consistently out perform older statistical methods" $n the evolution from business data to business information' each ne% step has built upon the previous one" *or e(ample' dynamic data access is critical for drill;through in data navigation applications' and the ability to store large databases is critical to data mining" The core components of data mining technology have been under development for decades' in research areas such as statistics' artificial intelligence' and machine learning" Today' the maturity of these techni<ues' coupled %ith high;performance relational database engines and broad data integration efforts' ma&e these technologies practical for current data %are housing
5#5#, Need of Data Mining
Today' the business climate is e(tremely competitive" GlobaliBation has created ne% competitors' eroding margins' and smarter customers" $n a data;driven' fast;paced economy' managers are being as&ed to ma&e more missions critical business decisions than ever before" .ata deluge and the lac& of appropriate automated &no%ledge creation processes lead to a C&no%ledge gapD" ,s a result' managers do not have all of the information they need to ma&e the right decisions %ith confidence ma&ing ris&y decisions based on instinct" lear' fact;based decision;ma&ing provides managers %ith the information they need to avoid ma&ing the %rong decisions" ompetitors %ho use fact;based decision;ma&ing by mining corporate data %ill have a competitive &no%ledge advantage" .ata mining represents the most effective %ay to close the &no%ledge gap" .ata mining helps decision;ma&ers e(tract intelligence from huge amounts of data for better understanding and leverage your data %arehouse and business data mart" .ata mining provides you %ith a strategic &no%ledge advantage in a fast;paced' changing mar&etplace"
5#* Architecture for Data Mining
To best apply .ata )ining techni<ues' they must be fully integrated %ith a data %arehouse as %ell as fle(ible interactive business analysis tools" )any data mining tools currently operate outside of the %arehouse' re<uiring e(tra steps for e(tracting' importing' and analyBing the data" *urthermore' %hen ne% insights re<uire operational implementation' integration %ith the %arehouse simplifies the application of results from data mining" The resulting analytic data %arehouse can be applied to improve business processes throughout the organiBation' in areas such as promotional campaign management' fraud detection' ne% product rollout' and so on" *igure ! illustrates architecture for advanced analysis in a large data %arehouse"
*igure ! ; $ntegrated .ata )ining ,rchitecture
The ideal starting point is a data %arehouse containing a combination of internal data trac&ing all customer contact coupled %ith e(ternal mar&et data about competitor activity" Bac&ground information on potential customers also provides an e(cellent basis for prospecting" This %arehouse can be implemented in a variety of relational database systems: Sybase' -racle' Redbric&' and so on' and should be optimiBed for fle(ible and fast data access"
,n -2,5 E-n;2ine ,nalytical 5rocessingF server enables a more sophisticated end;user business model to be applied %hen navigating the data %arehouse" The multidimensional structures allo% the user to analyBe the data as they %ant to vie% their business G summariBing by product line' region' and other &ey perspectives of their business" The .ata )ining Server must be integrated %ith the data %arehouse and the -2,5 server to embed R-$;focused business analysis directly into this infrastructure" ,n advanced' process;centric metadata template defines the data mining ob:ectives for specific business issues li&e campaign management' prospecting' and promotion optimiBation" $ntegration %ith the data %arehouse enables operational decisions to be directly implemented and trac&ed" ,s the %arehouse gro%s %ith ne% decisions and results' the organiBation can continually mine the best practices and apply them to future decisions" This design represents a fundamental shift from conventional decision support systems" Rather than simply delivering data to the end user through <uery and reporting soft%are' the ,dvanced ,nalysis Server applies usersH business models directly to the %arehouse and returns a proactive analysis of the most relevant information" These results enhance the metadata in the -2,5 Server by providing a dynamic metadata layer that represents a distilled vie% of the data" Reporting' visualiBation' and other analysis tools can then be applied to plan future actions and confirm the impact of those plans"
5#, Relation bet"een Data 3arehouse and Data Mining
.ata %arehouses store the information used in the data mining process" They provide a consolidated source for data from numerous other databases" .ata %arehouses are systems intended for storing massive amounts of data in a central location that allo% the use of access' reporting and analysis tools to interpret the data" Since a data %arehouse has consistent data definitions and includes contents of many databases' it can be used to support decision;ma&ing and planning' as %ell as data mining tools"
5#6 3or!ing of Data Mining
1o% e(actly is data mining able to tell you important things that you didnIt &no% or %hat is going to happen ne(tJ The techni<ue that is used to perform these feats in data mining is called modeling" )odeling is simply the act of building a model in one situation %here you &no% the ans%er and then applying it to another situation that you donIt" *or instance' if you %ere loo&ing for a sun&en Spanish galleon on the high seas the first thing you might do is to research the times %hen Spanish treasure had been found by others in the past" 7ou might note that these ships often tend to be found off the coast of Bermuda and that there are certain characteristics to the ocean currents' and certain routes that have li&ely been ta&en by the shipHs captains in that era" 7ou note these similarities and build a model that includes the characteristics that are common to the locations of these sun&en treasures" 3ith these models in hand you sail off loo&ing for treasure %here your model indicates it most li&ely might be given a similar situation in the past" 1opefully' if youIve got a good model' you find your treasure" This act of model building is thus something that people have been doing for a long time' certainly before the advent of computers or data mining technology" 3hat happens on computers' ho%ever' is not much different than the %ay people build models" omputers are loaded up %ith lots of information about a variety of situations %here an ans%er is &no%n and then the data mining soft%are on the computer must run through that data and distill the characteristics of the data that should go into the model" -nce the model is built it can then be used in similar situations %here you donIt &no% the ans%er" *or e(ample' say that you are the director of mar&eting for a telecommunications company and youId li&e to ac<uire some ne% long distance phone customers" 7ou could :ust
randomly go out and mail coupons to the general population ; :ust as you could randomly sail the seas loo&ing for sun&en treasure" $n neither case %ould you achieve the results you desired and of course you have the opportunity to do much better than random ; you could use your business e(perience stored in your database to build a model" ,s the mar&eting director you have access to a lot of information about all of your customers: their age' se(' credit history and long distance calling usage" The good ne%s is that you also have a lot of information about your prospective customers: their age' se(' credit history etc" 7our problem is that you donIt &no% the long distance calling usage of these prospects Esince they are most li&ely no% customers of your competitionF" 7ouId li&e to concentrate on those prospects that have large amounts of long distance usage" 7ou can accomplish this by building a model"
The goal in prospecting is to ma&e some calculated guesses about the information t based on the model that %e build" *or instance' a simple model for a telecommunications company might be: =K> of my customers %ho ma&e more than [email protected]
'@@@Myear spend more than [email protected]
on long distance This model could then be applied to the prospect data to try to tell something about the proprietary information that this telecommunications company does not currently have access to" 3ith this model in hand ne% customers can be selectively targeted" Test mar&eting is an e(cellent source of data for this &ind of modeling" )ining the results of a test mar&et representing a broad but relatively small sample of prospects can provide a foundation for identifying good prospects in the overall mar&et" $f someone told you that he had a model that could predict customer usage ho% %ould you &no% if he really had a good modelJ The first thing you might try %ould be to as& him to apply his model to your customer base ; %here you already &ne% the ans%er" 3ith data mining' the best %ay to accomplish this is by setting aside some of your data in a vault to isolate it from the mining process" -nce the mining is complete' the results can be tested against the data held in the vault to confirm the modelHs validity" $f the model %or&s' its observations should hold for the vaulted data"
5#6#5 Data Mining )rocesses
*rom a process;oriented vie%' there are three classes of data mining activity: discovery, predictive modeling and forensic analysis, as shown in figure below "
.iscovery is the process of loo&ing in a database to find hidden patterns %ithout a predetermined idea or hypothesis about %hat the patterns may be" $n other %ords' the program ta&es the initiative in finding %hat the interesting patterns are' %ithout the user thin&ing of the relevant <uestions first"
$n predictive modeling patterns discovered from the database are used to predict the future" 5redictive modeling thus allo%s the user to submit records %ith some un&no%n field values' and the system %ill guess the un&no%n values based on previous patterns
discovered from the database" 3hile discovery finds patterns in data' predictive modeling applies the patterns to guess values for ne% data items"
*orensic analysis is the process of applying the e(tracted patterns to find anomalous or unusual data elements" To discover the unusual' %e first find %hat is the norm' and then %e detect those items that deviate from the usual %ithin a given threshold" .iscovery helps us find Nusual &no%ledge'N but forensic analysis loo&s for unusual and specific cases"
5#6#* Data Mining Techniques
.ata )ining has three ma:or components lustering or lassification' ,ssociation Rules and Se<uence ,nalysis" lassification The clustering techni<ues analyBe a set of data and generate a set of grouping rules that can be used to classify future data" The mining tool automatically identifies the clusters' by studying the pattern in the training data" -nce the clusters are generated' classification can be used to identify' to %hich particular cluster' an input belongs" *or e(ample' one may classify diseases and provide the symptoms' %hich describe each class or subclass"
Association ,n association rule is a rule that implies certain association relationships among a set of ob:ects in a database" $n this process %e discover a set of association rules at multiple levels of abstraction from the relevant setEsF of data in a database" *or e(ample' one may discover a set of symptoms often occurring together %ith certain &inds of diseases and further study the reasons behind them"
$n se<uential ,nalysis' %e see& to discover patterns that occur in se<uence" This deals %ith data that appear in separate transactions Eas opposed to data that appear in the same transaction in the case of associationF e"g" if a shopper buys item , in the first %ee& of the month' and then he buys item B in the second %ee& etc"
Neural Nets and Decision Trees *or any given problem' the nature of the data %ill affect the techni<ues you choose" onse<uently' youIll need a variety of tools and technologies to find the best possible model" lassification models are among the most common' so the more popular %ays for building them have been e(plained here" lassifications typically involve at least one of t%o %or&horse statistical techni<ues ; logistic regression Ea generaliBation of linear regressionF and discriminate analysis" 1o%ever' as data mining becomes more common' neural nets and decision trees are also getting more consideration" ,lthough comple( in their o%n %ay' these methods re<uire less statistical sophistication on the part of the user" Neural nets use many parameters Ethe nodes in the hidden layerF to build a model that ta&es and combines a set of inputs to predict a continuous or categorical variable"
The value from each hidden node is a function of the %eighted sum of the values from all the preceding nodes that feed into it" The process of building a model involves finding
the connection %eights that produce the most accurate results by NtrainingN the neural net %ith data" The most common training method is bac&;propagation' in %hich the output result is compared %ith &no%n correct values" ,fter each comparison' the %eights are ad:usted and a ne% result computed" ,fter enough passes through the training data' the neural net typically becomes a very good predictor" .ecision trees represent a series of rules to lead to a class or value" *or e(ample' you may %ish to classify loan applicants as good or bad credit ris&s" *igure belo% sho%s a simple decision tree that solves this problem" ,rmed %ith this tree and a loan application' a loan officer could determine %hether an applicant is a good or bad credit ris&" ,n individual %ith N$ncome O [email protected]
'@@@N and N1igh .ebtN %ould be classified as a NBad Ris&'N %hereas an individual %ith N$ncome Q [email protected]
'@@@N and N8ob O ? 7earsN %ould be classified as a NGood Ris&"N
.ecision trees have become very popular because they are reasonably accurate and' unli&e neural nets' easy to understand" .ecision trees also ta&e less time to build than neural nets" Neural nets and decision trees can also be used to perform regressions' and some types of neural nets can even perform clustering"
5#7 The Business of Data Mining
*or the past several years' corporations have been inundated %ith data" GlobaliBation and technological advances have transformed the %ay companies conduct business" 0very day companies receive data from customers' vendors' employees and more that needs to be managed and mined for greater insight" 0very business needs to manage and mine its data to gain a greater understanding about customer preferences and behavior' products and services so decision;ma&ers can ma&e the profitable decisions faster than the competition" 0very industry from finance to telecommunications to pharmaceutical companies and others need to e(tract intelligence from their data" Transactional 3eb data is gro%ing e(ponentially as millions of CsurfersD interact %ith companies from customer support to e;commerce transactions" The data e(plosion has led to a gro%ing need for a data %arehouse strategy to manage Terabytes of data' so data miners can access corporate &no%ledge easily" Typically data %arehouse pro:ects are comple( and can cost companies millions of dollars to store and manage data from a central repository" Today many companies have data %arehouses available and to some degree operational" No%' the <uestion becomes' C3ho ordered thisJD or C3hat are you going to do %ith all that dataJD or C3hat does it mean in business termsJD *or most companies' confidence in the <uality of the stored data and &no%ledge about its business interpretation is often missing" )ost companies do not use their data %arehouses effectively and have not discovered efficient %ays to harness the po%er of their data" .ata mining can provide the R-$ needed to ma&e data %arehouse strategy and implementation pay" .ata mining is essential to help decision;ma&ers from senior management to line managers transform data into actionable business &no%ledge" Data mining uses statistical analysis and predictive models to uncover hidden patterns, trends, and relationships in big datasets to make more profitable decisions. .ata mining ma&es use of Cgood old statisticsD but is much more than that:
5#7#5 Data Mining an Bring )inpoint Accurac' to Sales
.ata %arehousing ; the practice of creating huge' central stores of customer data that can be used throughout the enterprise ; is becoming more and more commonplace" But data %arehouses are useless if companies donIt have the proper applications for accessing and using the data"
T%o popular types of applications that leverage companiesI investments in data %arehousing are data mining and campaign management soft%are" .ata mining enables companies to identify trends %ithin the data %arehouse Esuch as Nfamilies %ith teenagers are li&ely to have t%o phone lines'N in the case of a telephone companyIs dataF" ampaign management soft%are enables them to leverage these trends via highly targeted and automated direct mar&eting campaigns Esuch as a telemar&eting campaign intended to sell second phone lines to families %ith teenagersF" .ata mining and campaign management have been successfully deployed by hundreds of *ortune [email protected]
@@ companies around the %orld' %ith impressive results" But recent advances in technology have enabled companies to couple these technologies more tightly' %ith the follo%ing benefits: increased speed %ith %hich they can plan and e(ecute mar&eting campaignsR increased accuracy and response rates of campaignsR and higher overall mar&eting return on investment" .ata mining automates the detection of patterns in a database and helps mar&eting professionals improve their understanding of customer behavior' and then predict behavior" *or e(ample' a pattern might indicate that married males %ith children are t%ice more li&ely to drive a particular sports car than married males %ith no children" , mar&eting manager for an auto manufacturer might find this some%hat surprising pattern <uite valuable" The data mining process can model virtually any customer activity" The &ey is to find patterns relevant to current business problems" Typical patterns that data mining uncovers include %hich customers are most li&ely to drop a service' %hich are li&ely to purchase merchandise or services' and %hich are most li&ely to respond to a particular offer" The data mining process results in the creation of a model" , model embodies the discovered patterns and can be used to ma&e predictions for records for %hich the true behavior is un&no%n" These predictions' usually called scores' are numerical values that are assigned to each record in the database and indicate the li&elihood that the customer %ill e(hibit a particular behavior" These numerical values are used to select the most appropriate prospects for a targeted mar&eting campaign" ampaign management and data mining' %hen closely integrated' are potent tools" ampaign management soft%are enables companies to deliver to customers and prospects timely' pertinent' and coordinated offers' and also manages and monitors customer communications across all channels" $n addition' it automates and integrates the planning' e(ecution' assessment and refinement of possibly tens to hundreds of highly segmented campaigns running monthly' %ee&ly' daily or intermittently" .ynamic scoring data avoids manual integration of scores %ith the database' and eliminates the need to score an entire database" $nstead' dynamic scoring mar&s only relevant customer subsets and only %hen needed" This shrin&s mar&eting cycle times and assures fresh' up;to;date results" -nce a model is in the campaign management system' the user can start to build mar&eting campaigns based upon it simply by choosing it from a menu of options"
,ny company that is creating or has created a data %arehouse should be considering the use of integrated data mining and campaign management applications' %hich unloc& the data and put it to use" By discovering customer behavior patterns and then acting upon them <uic&ly' companies can stave off competitionR and increase customer retention' cross selling and up;selling' all of %hich ultimately contribute to higher overall revenues"
5#8 The Benefits and Application of Data Mining
.ata mining streamlines and increases the efficiency of business processes" $t supports better decisions and accelerates the &no%ledge e(traction process" .ata mining can help optimiBe your decision;ma&ing processes reducing your time;to;mar&et for competitive advantage" 1o%ever' this can only be achieved if there is a detailed analysis and understanding of the specific business issues that are to be solved %ith the help of data mining" There is no Csilver bulletD or Cone siBe fits allD approach" Businesses' government' and organiBations use data mining to loo& for patterns and relationships in their data so that they can ma&e Nproactive' &no%ledge;drivenN decisions" The important thing about data mining is that it is able to predict future trends' rather than performing the retrospective functions of other types of analysis tools such as online analytical processing E-2,5F or statistical soft%are" -rganiBations use data mining to target ne% customers and to cultivate relationships %ith current ones' to reduce fraud' and to study use of the $nternet" 9irtually any process from pharmacology to customer service can be studied' understood' and improved using data mining" The top three end uses of data mining are' not surprisingly' in the mar&eting area ; customer profiling' targeted mar&eting' and mar&et;bas&et analysis" Attracting ne" customers , company %ith a large data %arehouse can use data mining tools to analyBe their current customers' discovering the common characteristics in those %ho use the companyIs product or service the most Ecreating a NmodelN or customer profileF" The company can then collect information on potential customers through census data' credit bureaus and other sources of public dataR select those %ho have similar characteristics to those found in the model' and target those customers in their ne(t advertising campaign" This approach reflects the current practice of niche mar&eting" ,s mass mar&eting becomes less popular' data mining creates the opportunity to target potential customers in much more specific %ays" ulti-ating relationships "ith current customers
, company can use its customer data to identify the best prospects for a ne% product or service among its current customers" The attributes of those customers %ho are most li&ely to choose a ne% product can be determined by using a test mailing and analyBing the characteristics of those %ho respond to create a model" The model can then be applied to the full database and those customers %ho fit the profile %ould be included in a targeted mailing campaign" Both this approach and the one above reduce mar&eting costs by focusing on those customers %ho are most profitable to the company" $n customer profiling' characteristics of good customers are identified %ith the goals of predictingR %ho %ill become one and helping mar&eters target ne% prospects" .ata mining can find patterns in a customer database that can be applied to a prospect database so that customer ac<uisition can be appropriately targeted" *or e(ample' by identifying good candidates for mail offers or catalogs direct;mail mar&eters can reduce e(penses and increase their sales" Targeting specific promotions to e(isting and potential customers offers similar benefits" ,nother common use of data mining in many organiBations is to help manage customer relationships" By determining characteristics of customers %ho are li&ely to leave for a competitor' a company can ta&e action to retain that customer because doing so is usually far less e(pensive than ac<uiring a ne% customer" Reducing fraud .ata mining can be used to identify patterns of fraudulent credit card usage' and to find behavior patterns of ris&y customers or those %ho are most li&ely to e(hibit patterns of fraudulent behavior" -ne e(ample of this %ould be to establish credit ris& by loo&ing at the ratio of debt to incomeR another %ould be using data mining to determine if a particular transaction is out of the normal range of a personIs activity and flagging that transaction for verification" *raud detection is of great interest to telecommunications firms' credit;card companies' insurance companies' stoc& e(changes' and government agencies" The aggregate total for fraud losses is enormous" But %ith data mining' these companies can identify potentially fraudulent transactions and contain the damage" Monitoring the Internet $nternet advertising companies' as %ell as other %eb;based organiBations' use Ncoo&iesN to collect data about those vie%ing their %eb sites" These coo&ies are used to create profiles of users in order to better target advertising" The information collected' via the userIs $5 address' determines the userIs geographic location and the sites that the user vie%s" ,dvertisers may also be able to determine the userIs company name' and the type and siBe of the organiBation" This can be combined %ith personal information re<uested on the %eb page' if the user chooses to fill in registration forms or logs on to use the site Mar!et Anal'sis
)ar&et;bas&et analysis helps retailers understand %hich products are purchased together or by an individual over time" 3ith data mining' retailers can determine %hich products to stoc& in %hich stores' and even ho% to place them %ithin a store" .ata mining can also help assess the effectiveness of promotions and coupons" *inancial companies use data mining to determine mar&et and industry characteristics as %ell as predict individual company and stoc& performance" ,nother interesting application is in the medical field: .ata mining can help predict the effectiveness of surgical procedures' diagnostic tests' medications' service management' and process control"
%thers , pharmaceutical company can analyBe its recent sales force activity and their results to improve targeting of high;value physicians and determine %hich mar&eting activities %ill have the greatest impact in the ne(t fe% months" The data needs to include competitor mar&et activity as %ell as information about the local health care systems" The results can be distributed to the sales force via a %ide;area net%or& that enables the representatives to revie% the recommendations from the perspective of the &ey attributes in the decision process" The ongoing' dynamic analysis of the data %arehouse allo%s best practices from throughout the organiBation to be applied in specific sales situations" • , credit card company can leverage its vast %arehouse of customer transaction data to identify customers most li&ely to be interested in a ne% credit product" Using a small test mailing' the attributes of customers %ith an affinity for the product can be identified" Recent pro:ects have indicated more than a #@;fold decrease in costs for targeted mailing campaigns over conventional approaches" • , diversified transportation company %ith a large direct sales force can apply data mining to identify the best prospects for its services" Using data mining to analyBe its o%n customer e(perience' this company can build a uni<ue segmentation identifying the attributes of high;value prospects" ,pplying this segmentation to a general business database such as those provided by .un + Bradstreet can yield a prioritiBed list of prospects by region" • , large consumer pac&age goods company can apply data mining to improve its sales process to retailers" .ata from consumer panels' shipments' and competitor activity can be applied to understand the reasons for brand and store s%itching" Through this analysis' the manufacturer can select promotional strategies that best reach their target customer segments"
0ach of these e(amples has a clear common ground" They leverage the &no%ledge about customers implicit in a data %arehouse to reduce costs and improve the value of customer relationships" These organiBations can no% focus their efforts on the most
important EprofitableF customers and prospects' and design targeted mar&eting strategies to best reach them" To understand it clearly' ta&e as an e(ample the case of a direct mailing campaign from creation to delivery" )ore than one third of the campaignHs success is accounted for by the <uality of the selected target prospects" The remaining t%o thirds is related to the creative Ebrochure' pictures' colors' etc"F and product features"
Generating a good target list can increase response and purchase rates increasing corporate revenue" .ata mining is the ideal %ay to develop selection models" ,s an e(ample thin& of a statistical model based on an ,rtificial Neural Net%or& that has been trained to predict the li&elihood Eor scoreF for a given prospect to purchase the product" 1ere three attributes are used to ma&e the prediction!: the customerHs age' gender' and salary" 5urchase rates %ill increase since only prospects %ith high scores are targeted"
7ou could observe very high positive impacts on the purchase rate in various data mining pro:ects" ,n increase by [email protected]
> as given in the e(ample above is absolutely realistic" 0ven higher values have been measured" To sho% the effect data mining has on your business consider the follo%ing e(ample" 7ou can compute the customer lifetime value E 29F of a given customer portfolio in a R) conte(t" ,s an appro(imation %e assume the %hole lifetime has duration of three years" 3e can start by defining the business scenario before introducing the supporting data mining processes:
3hen you apply data mining processes you can e(pect the follo%ing results: SReduced churn rate: .ata mining attrition models reveal %hich customers are the most profitable" 5rofitable customers can be targeted %ith retention programs for continued loyalty reducing a companyHs churn rate" SIncreased turno-er per customer: 1igher performing cross;selling and ac<uisition of customers is a result of predictive models that target customers %ith the highest li&elihood to purchase a given product" They also help spotting not only li&ely purchasers but also good purchasers Ethose %ho not only buy but buy a lotF" SReduced costs: Better targeting helps reduce mar&eting budgets and increases returns" The ne(t table ma&es some realistic assumptions about increasing retention rate by A"#?>' increasing customer turnover by ?>' and reducing cost by 7>" The resulting gains in the customer lifetime value are significant:
The cost of a data mining pro:ect %ould be easily covered %ith the pro:ected R-$" This sho%s the impact data mining has %hen introduced in a company" $t generates sustained gro%th and substantial return on investment"
<inancial Ser-ices E0ample
.ata mining offers significant R-$ for the financial services industry" ,ssume our goal is to support a direct mail campaign for sales of specific mutual funds" 0ach customer contact has a certain cost associated to it' %hich in this e(ample is L!"? per piece of mail sent out to a customer Eprice per contactF" )ore differentiated cost models come up if for instance a multi;channel approach is ta&en %here some customers are contacted through the call center' others by e;mail' and yet others by direct mail" 0ach contact type has a different cost' %hich has to be reflected in the cost function" 1ere you can follo% our simple e(ample" -n the other side %e have revenue %ith each purchase" $n this case %e e(pect average revenue of [email protected]
per purchase and year" The data mining process provides a statistical model that assigns to each customer a probability for purchasing the mutual fund" $f %e put all customers in descending order according to this probability' %e have the top candidates for purchase on the top" .epending on model performance and on our business ob:ectives %e can select the top percentage of customers to be targeted by a specific mutual funds campaign" ,part of the target list based on data mining %e can also provide a target list according to traditional selection methods that do not use advanced analytics but reasoning such as: Cinclude all customers bet%een ages !K and #? and domicile in region T because %e thin& this group could have potentialD" Belo% %e give an e(ample of the model results and their financial implications ta&ing the cost and revenue figures as defined above" *or this campaign %e decided to target only the top #@th percentile of our customers and e(pect to sell to !?7# customers' i"e" a purchase rate of 3"=> for a profit of appro(imately L3='@@@" ,s %e e(ecute the campaign %e trac& the results and come to the follo%ing actual figures for the traditional and model selections:
1ere you can clearly see the financial impact" ,n increase of purchase rate of [email protected]
> %ith respect to the traditional selection can result in an increase of profit of almost [email protected]
5#9 2e' Success <actors for Data Mining )ro=ects
Today many companies have invested millions of dollars building data %arehouse solutions to store corporate data" .ata mining can help companies e(tract intelligence from this data to gain greater understanding of mar&et conditions' build customers relationships' and increase R-$" 1o%ever' some companies find that data mining pro:ects are not delivering the e(pected return on investment" There are several reasons %hy corporate data mining pro:ects are failing" ompanies need a customiBed data %arehouse and data mining solution to solve business problems" Generic Cone siBe fits allD approaches that are deployed often fail in practice" There is no Csilver bulletD" 7ou %ill al%ays have to enrich these technical platforms %ith business specific &no%ledge to receive the information you need to ma&e competitive decisions" Successful data mining pro:ects re<uire consideration of various business issues at different levels %ithin an enterprise" .ata mining impacts a company in many different aspects from aligning %ith strategic business planning process to integrating %ith a companyHs technical infrastructure" .ata mining should al%ays have an immediate impact on business performance therefore these business and technical issues need to be discussed before launching a ma:or data mining initiative" 3hatever your current situation is' there is al%ays an optimal %ay to being a successful data mining initiative' for e(ample: )ic! the lo" hanging fruit first1 .iscover %here the greatest business pain is and %here immediate results could be achieved" Start "ith "hat is currentl' a-ailable > impro-ise and be pragmatic1 Use the data that is available no%' you do not need to build a data %arehouse first" ,s& for immediate R-$" Impro-e incrementall'1 2earn from your first e(periences %ith data mining and incrementally evolve into a more sophisticated solution" The follo%ing issues %ill help you in address the continuous evolution of your data mining initiative from conception to delivery"
5#: Data Mining )ro=ect Methodolog'
The &ey to successful data mining is using a proven methodology including the follo%ing dimensions: S hoose a customi?able approach1 There are several standardiBed Cone siBe fits allD approaches offered on the mar&et" They give the impression of early returns' %hich they might even be able to deliver" 1o%ever' for mid; to long;term success they have limited business impact" $nsufficient customiBation to integrate business issues into the solution severely restricts a data mining investment" SE-olutionar' approach to deli-er pro-ides earl' "ins1 ,dopting a pragmatic approach to%ards data mining pro:ects can be successful" 0arly %ins are mandatory for any such pro:ect to receive internal buy;in from senior management and business units" , clear' systematic process for delivering a large solution must address the added business;value of the solution" .eveloping this process in such a %ay that it delivers tangible results in Eat mostF three;month intervals is ideal" $tHs critical to &eep in mind the final vision you %ant to achieve and the roadmap leading to it" reate actionable results from data mining1 Strategic &no%ledge drives actionable decision;ma&ing that can reduce corporate ris& and improve profitability" *or e(ample' greater &no%ledge can provide a list of li&ely high;ris&' high;debt customers to be called %hen certain business conditions or signals arise in the data" [email protected]
the business processes( gaps and ob=ecti-es1 This is crucial for guaranteeing that data mining %ill really leverage and enhance the processes in the %ay it should" , clear understanding of business processes and ho% data mining fits into these processes as a sub;process for larger business corporate benefits"
5#; Data Mining Technolog' and Tools
hoosing the correct technology and tools for data mining is essential" The follo%ing list sho%s some of the issues to be considered for ma&ing the right choice: SRobust anal'tics1 Tools for advanced analytics must be increasingly tolerant of faulty input or handling because users are often non;technical" Robustness is therefore a synonym for comfort of tool handling" SScalable anal'tics1 $n order to cope %ith todayHs data flood you need fle(ible' scalable tools to meet the increasing demand of efficiently handling larger data sets" S)reApac!aged data models1 $f available use tools that include specialiBed data models for data mining" This reduces the time re<uired for doing this type of data modeling internally" $t is al%ays better to use a pre;pac&aged model and customiBe it to your companyHs needs" SAutomated data mining increases producti-it'1 Be sure to choose a data;mining platform that deploys statistical models efficiently" ,utomating business processes increases productivity" .ata mining solutions can perform iterative tas&s allo%ing s&illed <uantitative analysts to analyBe ne% and relevant business issues" SInteracti-e data mining to e0plore ne" opportunities1 .ata mining offers a platform for interactive ad hoc e(ploration to reveal ne% opportunities and previously un&no%n
patterns in business behavior" To this end you need a po%erful %or&bench for creating ne% insight" SData qualit'1 The importance of data <uality is often underestimated" , thorough assessment of data <uality is essential for successful data mining solutions" The common rule Cgarbage in G garbage outD is true in such an environment" ,nalytic processes and tools need to be able to handle data <uality issues"
5#5+ Data Mining1 A Threat to )ri-ac'
The anadian $nformation 5rocessing Society E $5SF includes in the definition of personal information Nthe individualIs telephone number %hich is generally publicly available' as %ell as sensitive information about UtheV individualIs age' se(' se(ual orientation' medical' criminal' and education history' or financial and %elfare transactions" 5ersonal information %ould also include biometric information' such as blood type' fingerprints and genetic ma&eupN 3a' of collecting personal information onsumers produce data in the process of conducting their daily business: using ban& machines' paying %ith debit and credit cards' using loyalty cards' borro%ing money' %riting che<ues' renting videos and cars' ma&ing phone calls' sending e;mail and bro%sing the 3eb" Businesses encourage the transition from print to electronic transactions by ma&ing tas&s more convenient' and by providing discounts and bonuses in e(change for personal information" These electronic transactions may be collected by large organiBations in their data %arehouse' and may be sub:ect to data mining" The conflict %ith privacy occurs %hen the collection of data ta&es place %ithout the &no%ledge or consent of the individual' or %hen the information is used in %ays that the individual is not a%are of' or %hen the information is disclosed to others %ithout the e(press consent of the individual"
onsumers can do follo"ing things to ensure their pri-ac' is maintained b' ,s&ing to see a businessIs privacy or confidentiality policy" ,ssess it against your e(pectations of ho% you %ant your personal information to be handled" $f the policy does not meet your e(pectations' contact the business and inform it of your e(pectations" $f no
policy e(ists' inform the business that you e(pect respectful and fair handling of your personal information" Giving only the minimum amount of personal information needed to complete a transaction" $f you are in doubt about the relevance of any information that is re<uested' as& about %hy it is needed' and as& that all the uses of the re<uested information be identified" Securit' in Data Mining There are a number of issues relating to data mining that need to be addressed by those %ho collect information" Data Bualit' .ata must be relevant for the purpose for %hich it is to be used" $t should be accurate' complete and up;to;date" *actors that affect this are accuracy of input and the steps ta&en to ensure that data is clean" lean data refers to dataIs age and accuracy" )urpose Specification The reason that data is being collected should be specified %hen the data is collected" Subse<uent use of the data should be limited to those purposes or other purposes that are compatible %ith those specified" @se 4imitation .ata should not be used for purposes other than those specified unless the individual gives consent or the use is re<uired by la%" .ata collected should be relevant to and sufficient for the purpose' but not e(cessive" .ata mining conflicts %ith this by using data collected for one purpose to be used for another' secondary' purpose" , means of re<uesting permission to perform data mining must be developed" %penness The business or organiBation must be open about developments' practices' and policies relating to personal data" $t should be possible for an individual to determine the nature and e(istence of the data held about himMherself' the purpose of the dataIs use' and the identity of the data controller %ho is accountable for the dataIs accuracy" ustomers must be informed that their data is being used for data mining purposes" Indi-idual )articipation
,n individual should have the right to confirm %hether the data controller has information about himMherself and find out %hat that information is" This should be done in a reasonable length of time' at reasonable or no charge' in a reasonable manner' and in a form that is intelligible to the individual" $f the re<uest is denied' the individual should be given the reason' and should be able to challenge the denial" 1eMshe should be able to challenge the data and have it erased' corrected' completed or amended as necessary
5#55 A )ossible Scenario for the <uture of Data Mining
3hat does the future have in store for data miningJ $n the end' much of %hat is called data mining %ill li&ely end up as standard tools built into database or data %arehouse soft%are products" ,s a motivation for this statement' $ %ould li&e to use the field of spell chec&ing soft%are as an e(ample" 8ust loo& bac& ten years to the infancy of computer %ord processing" )any companies made spell;chec&ing soft%are" 7ou %ould usually buy a spell chec&er as a separate piece of soft%are for use %ith %hatever %ord processor you might have" Sometimes the spell;chec&er %ouldnIt understand a particular %ord processorIs file format" Some spell;chec&ers might have even re<uired you to dump your document as an ,S $$ file before it %ould chec& the spelling Eon the ,S $$ fileF" $n that case' you %ould have had to manually ma&e corrections in the original document" 0ventually the spell chec&ers became more users friendly and understood every possible document format" *unctionality also increased" The future of spell chec&ing probably loo&ed pretty rosy" So' %here is the spell chec&ing companies todayJ 3here is the spell chec&ing soft%areJ $f you loo& at your local computer store you %onIt find much there" $nstead you %ill find that your ne% %ord processor comes %ith a built;in spell chec&er" ,s %ord processor soft%are increased in sophistication and functionality' it %as a natural progression to include spell chec&ing into the standard system" The future of data mining may very %ell parallel the history of spell chec&ing" The functionality of database mar&eting products %ill increase to integrate %ith relational database products Eno more dumping a R.B)S into a flat fileWF and %ith &ey .SS application environments' it %ill stress the business problem rather than the technology' and present the process to the user in a friendly manner" .atabase mar&eting %ill start losing some of the hype and begin to provide real value to users" This %ill ma&e database mar&eting an important business in and out of it" The larger R.B)S and data %arehouse companies have already e(pressed an interest in integrating data mining into their database products" $n the end' this ne% mar&et and its business opportunities %ill drive mainstream database companies to database mar&eting" Ten years from no% there may be only a fe% independent data mining companies left in e(istence" The real survivors
%ill li&ely be the ones %ith the foresight to develop a strong relationship %ith the mainstream database industry"
.atabase mar&eting soft%are applications %ill have a tremendous impact on ho% business is done in the future" ,lthough the core data mining technology is here today' developers need to ta&e %hat already e(ists and turn it into something that business users can %or& %ith" The successful database mar&eting applications %ill combine data mining technology %ith a thorough understanding of business problems and present the results in a %ay that the user can understand" ,t that point the &no%ledge contained in a database %ill be understood by people %ho can turn %hat is &no%n into %hat can be done" omprehensive data %arehouses that integrate operational data %ith customer' supplier' and mar&et information have resulted in an e(plosion of information" ompetition re<uires timely and sophisticated analysis on an integrated vie% of the data" 1o%ever' there is a gro%ing gap bet%een more po%erful storage and retrieval systems and the usersH ability to effectively analyBe and act on the information they contain" Both relational and -2,5 technologies have tremendous capabilities for navigating massive data %arehouses' but brute force navigation of data is not enough" , ne% technological leap is needed to structure and prioritiBe information for specific end;user problems" The data mining tools can ma&e this leap" 6uantifiable business benefits have been proven through the integration of data mining %ith current information systems' and ne% products are on the horiBon that %ill bring this integration to an even %ider audience of users" .ata mining offers great promise in helping organiBations uncover hidden patterns in their data" 1o%ever' users %ho understand the business' the data' and the general nature of the analytical methods involved must guide data mining tools" Realistic e(pectations can yield re%arding results across a %ide range of applications' from improving revenues to reducing costs" Building models is only one step in &no%ledge discovery" $tIs vital to collect and prepare the data properly and to chec& models against the real %orld" The NbestN model is often found after building models of several different types and by trying out various technologies or algorithms" The data mining area is still relatively young' and tools that support the %hole of the data mining process in an easy to use fashion are rare" 1o%ever' one of the most important issues facing researchers is the use of techni<ues
against very large data sets" ,ll the mining techni<ues are based on ,rtificial $ntelligence' %here they are generally e(ecuted against small sets of data' %hich can fit in memory" 1o%ever' in data mining applications these techni<ues must be applied to data held in very large databases" These include use of parallelism and development of ne% database oriented techni<ues" 1o%ever' much %or& is re<uired before data mining can be successfully applied to large data sets" -nly then %ill the true potential of data mining be able to be realiBed""