DATA MINING: WHAT IS IT AND HOW IS IT USED?
By Barry Keating
Data mining is a way to gain market intelligence from a huge amount of data ... the problem today is not the lack of data, but how to learn from it ... in data mining, the data tell the story, but it is up to you how to use that information.
stories, or any other kind); and so on. With this type of information available, decision makers will make better choices. Human resource people will hire the right individuals. Credit departments will target those prospective customers that are less prone to become delinquent and/or less likely to involve in fraudulent activities.
Direct marketers will target those customers that are more likely to purchase their products. With the insight gained from data mining, businesses may wish to re-configure their product offering and/or emphasize specific features of a product. These are not the only uses of data mining. Police use this tool to dctennine when and where a crime is likely to occur, and what would be the nature of that crime. Organized stock exchanges detect fraudulent activities with data mining. Pharmaceutical companies mine data to predict the efficacy of compounds as well as to uneover new chemical entities that may be useful for a particular disease. The airline industry uses it to predict which flights are likely to be delayed (well before the flight is scheduled to depart). Weather analysts determine weather patterns with data mining to predict when there will be rain, sunshine, a hurricane, or snow. Nonprofit companies use data mining to predict the likelihood of individuals making a donation for a certain cause. The uses of data mining arc far reaching and its benefits may be quite significant.
ata mining is used to search for valuable information from the mounds of data collected over time, which could be used in decision making. The information may be certain patterns and/or relationships that exist.
With data mining, a retail store may find that certain products are sold more in one channel of distribution than in the others; certain products are sold together; certain products are sold more in one geographical location than in others; and certain products are sold when a certain event occurs. Wal-Mart, for example, has found that the sales of beer increase when a hurricane is imminent. This means that they have to hold more than the usual supply ofbeer when a hurricane is expected. With data mining, a financial analyst would like to know the characteristics of a company becoming insolvent; human resource managers would like to know the characteristics of a successful prospective employee; credit card departments would like to know which potential customers are more likely to pay back the debt and when a credit card is swiped, which transaction is fraudulent and which one is legitimate; direct marketers would like to know which customers purchase which types of products; booksellers like Amazon would like to know which customers purchase which types of books (fiction, detective
BARRY KEATING Dr. Keating is the Jesse H. Jones Professor of Business Economics at the University of Notre Dame. He specializes in understanding how notfor-profit organizations function; more specifically, how they respond to incentives, changes in revenue and cost conditions, and changes in regulatory mechanisms. He is widely published and is the co-author of the book. Business Forecasting, published by McGrawHill. He Is a Heritage Foundation Fellow (1992-1996), a Heartland Institute Research Fellow, and serves on the Board of Advisors of both the Indiana Policy Review Group and the Institute of Business Forecasting.
DATA MINING IN HISTORICAL PERSPECTIVE
The job of a data miner is to extract valuable information from the data available. The approach to finding information is to find patterns and relationships present in the data, which, of course., is not new. Indeed, man has looked for patterns in almost every endeavor undertaken by mankind. Early man looked for patterns in the sky at night, in the movement of stars and planets, and in the weather. Modem man still hunts for
THE JOURNAL OF BUSINESS FORECASTING, FALL 2008
patterns in early election returns, in global temperature ehanges, and in the saies data of new and matured products. Over the last 25 years or so, there has been a gradual evolution from data processing to data mining. In the 1960s, businesses routinely collected data and processed it using database management techniques that allowed an orderly listing and tabulation of the data as well as some query activity. On-line Transaction Processing (OLTP) became routine, data retrieval from stored data became faster and more efficient because of the availability of new and better storage devices, and data processing became quicker and more efficient because of advancements in computer technology. Database management advanced rapidly to include highly sophisticated query systems, and became popular not only in business applications but also in scientific inquiries. Databases began to grow at previously unheard of rates. The amount of data in all of the world's databases is now estimated to double in less than every two years. Businesses currently deploy what we call data warehouses and data marts. A "data warehouse" is a firm's repository of historic data, containing information of every relevant activity that occurred in the past. A "data mart," on the other hand, is a subset of a data warehouse. It holds some special information or infonnation that has been grouped to help businesses in making better decisions. Data used here are usually derived from a data warehouse. The first organized use of such large databases started with Online Analytical Processing (OLAP). Data mining tools use and analyze the data that exist in databases, data marts and data warehouses. Researchers have been doing data inining for a long time, though they called it by different names. Some called it Exploratory Data Analysis; others called it Business Intelligence, Data Driven Discovery, Deductive Learning., Discovery Science, and Knowledge Discovery in Databases (KDD).
DATA MINING VS, FORECASTING MODELS
A decade ago. one of the most pressing problems of a forecaster was lack of data. But today we are overwhelmed with data. Data are collected whenever you swipe a eredit card—at the grocery store, retail store, or whenever you tiiake a purchase or click for some information. The amount of data being collected is exploding with no end in sight. The presence of large cheap storage devices makes it possible to store every piece of infonnation or data produced. The pressing problem now is not the lack of data, but how to leam from it. Data mining is the set of tools that helps to accomplish that. Data mining is quite different from the statistical modeling used in forecasting. In the traditional statistical forecasting, forecasters first determine the patterns that exist in a dataset; that is, whether it has a trend, seasonality, cyclicality. etc., and then they search for a model that captures those patterns. Remember, each model eapttires a certain pattern. If we know the pattern, then we know which model to use. The model may be regression, exponential smoothing, or any other model. In the case of new product forecasting, for example, forecasters may assume that new products will "roll out" with a life cycle that looks like an "s-curve." In tbat sense, they impose their belief about the appropriate model on the data. By doing that, forecasters impose the pattern on the data because they believe that pattern describes the data. But with data mining, tables are turned. Forecasters do not presume what pattern or family of patterns may fit a particular set of data. Many times they don't even know what kind of pattern they are going to find. This may sound strange, beeause data mining is not a method of attacking the data; on the contrary, it is a way of teaming from the data and then using that information. For that reason, we need a new mindset in data mining. We must be open to finding relationships and patterns that we never imagined existed. We let data tell us the
story rather than impose a model on the data that we feel will replicate the actual patterns. Perhaps the most common misconception about data mining is that it will automatically extract all the valuable infonnation embedded in a database without any intervention on the part of the researcher. In fact, every large database contains numerous sets of patterns, which may very well be as many in number as the number of items in the database itself. But. most of the patterns could be irrelevant to the researcher's task. So, the researcher, before he or she starts the data mining process, sets goals and research parameters. This way he or she will eliminate many patterns that are irrelevant to the task and concentrate on the ones that are important and pertinent. As in traditional statistical forecasting, the researcher remains an important part of the analysis. Data mining usually uses very large datasets, oftentimes far larger than the datasets used in business forecasting. But the tools used in data mining are somewhat different than the ones used in traditional business forecasting. You may well be familiar with many of the statistical tools available to us, but tools used in data mining and the way they are used are different from the ones used in traditional business forecasting. Tools used in data mining are discussed in the next section. The premise of data mining is that there is a great deal of information locked up in a database; it's up to the researcher to unlock it. Data mining tools help to unlock that information.
TOOLS OF DATA MINING
There are four categories of data mining tools: 1. 2. 3. 4. Prediction Classification Clustering Analysis Association Rules Discovery
THE JOURNAL OF BUSINESS FORECASTING, FALL 2008
Prediction Tools: They are the methods derived from traditional statistical forecasting for predicting a variable's value. ClassificationTools: Most commonly used in data mining., classification tools attempt to distinguish different classes of objects or actions. For instance, a particular credit card transaction may be either normal or fraudulent. These tools could classify it as one or the other, thereby saving the credit card company a considerable amount of money. In another instance, an advertiser may want to know which aspect of its promotion is most appealing to consumers. Is it price, quality, and/or reliability of a product? Maybe it is a special feature that is missing on competitive products. The classification tools help give such information on all the products, making possible to use the advertising budget in a most effective manner. Clustering Analysis Tools: These are very powerful tools for clustering products into groups that naturally fall together. These groups are identified by the program and not by the researchers. Most of the clusters discovered may not be useilil in business decision. However, they may find one or two that are extremely important, the ones the company can take advantage of. The most common use for clustering tools is probably in what economists refer to as "market segmentation." In market segmentation, a company divides the customer base into segments dependent upon characteristics such as income, wealth, geographic location, lifestyle, and so on. Each segment is then treated with a different marketing approach, one suited precisely to that particular segment. Association Rules Discovery: Here the data mining tools discover associations; e.g., what kinds of books certain groups of people read, what products certain groups of people purchase, what movies certain groups of people watch, etc. Businesses use this information in targeting their markets. Netflix, for example, recommends movies based on movies people have watched and rated in the past. Amazon does the mueh the same thing in recommending books.
SOFTWARE USED IN DATA MINING
The two major pieces of software used at the moment for data mining are SPSS Clementine and SAS Enterprise Miner. Both packages inelude an array of data mining techniques that encompass all four of the data mining categories mentioned above. Current users of SAS products are probably inclined to use SAS Enterprise Miner when selecting a data mining program and current users of SPSS products are likely to choose SPSS Clementine; however, both data mining programs are actuaUy stand-alone packages that do not require the user to be a customer of the other offerings from a particular company. Both software packages can import and process data in virtually any format. In addition, there are several smaller software packages that are only used for data mining. These packages do not have a full set of statistical tools like SAS and SPSS. Neweomers to data mining can use an Excel add-in called XLMiner'". which is available from Resampling Stats, Inc. This Excel add-in lets potential data miners not only examine the usefulness of such a program but also get familiar with some of the data mining techniques. Although Excel is quite limited in the number of observations it ean handle, it eertainly can give a flavor of how valuable data mining can be to a company. Software for data mining also gives a ftill range of diagnostic statistics that can be used in assessing the information obtained from these techniques. Once the user recognizes the value of data mining, he or she can switch to more advanced software that ean handle large datasets. Data mining is here to stay, and forecasters would find it a useftil extension to their toolkit. Although these tools are new, their use is similar to statistical techniques currently used in forecasting. As such, they will serve as an important addition to the foreeasting toolkit. •
Practical Guide to Business Forecasting edited by Chaman L. Jain & Jack Malehom. Flushing. New York: Graceway Publishing Cotiipany. 2005. pp. 510. $59.95 Regression Analysis, Modeling and Forecasting by George C. Wang & Chaman L. Jain, Flushing, New York: Graceway Publishing Company. 2003. pp. 299. $58.95. Benchmarking Forecasting Practices by Chaman L. Jain & Jack Malehom. Flushing, New York: Graceway Publishing Company. 2006. pp. 116. $68.95. Sales & Operations Planning: The How to Handbook by Thomas F. Wallace. 2004. pp. 176. $44.95. Sales & Operations Planning: The Executive's Guide by Thomas F. Wallace and Robert A. Stahl. 2006. pp. 112. $44.95. For Information Call/Contact IBF 350 Northern Blvd., Suite 203 Great Neck, N.Y. 11021 P: 516.504.7576 Email: [email protected]
SHARE YOUR EXPERIENCE
if you have experience in the areas of forecasting and planning and would like to share with our readers, send an outline of your proposal to the editor at; Jainc(â)Stj oh ns.edu
THE JOURNAL OF BUSINESS FORECASTING, FALL 2008