Data Warehousing OLAP and Data Mining

Published on January 2017 | Categories: Documents | Downloads: 75 | Comments: 0 | Views: 700
of 351
Download PDF   Embed   Report

Comments

Content


This page
intentionally left
blank
Copyright © 2006, New Age International (P) Ltd., Publishers
Published by New Age International (P) Ltd., Publishers
All rights reserved.
No part of this ebook may be reproduced in any form, by photostat, microfilm,
xerography, or any other means, or incorporated into any information retrieval
system, electronic or mechanical, without the written permission of the publisher.
All inquiries should be emailed to [email protected]
PUBLISHING FOR ONE WORLD
NEW AGE INTERNATIONAL (P) LIMITED, PUBLISHERS
4835/24, Ansari Road, Daryaganj, New Delhi - 110002
Visit us at www.newagepublishers.com
ISBN (13) : 978-81-224-2705-9
Dedicated Dedicated Dedicated Dedicated Dedicated
To To To To To
Beloved Friends Beloved Friends Beloved Friends Beloved Friends Beloved Friends
This page
intentionally left
blank
PREFACE
Thi s book i s i ntended for I nformati on Technol ogy (I T) professi onal s who have been
hear i ng about or have been tasked to eval uate, l ear n or i mpl ement data war ehousi ng
technol ogi es. Thi s book al so ai ms at provi di ng fundamental techni ques of KDD and Data
Mi ni ng as wel l as i ssues i n practi cal use of Mi ni ng tool s.
Far from bei ng just a passi ng fad, data warehousi ng technol ogy has grown much i n scal e
and reputati on i n the past few years, as evi denced by the i ncreasi ng number of products,
vendors, organi zati ons, and yes, even books, devoted to the subject. Enterpri ses that have
successful l y i mpl emented data warehouses fi nd i t strategi c and often wonder how they ever
managed to survi ve wi thout i t i n the past. Al so Knowl edge Di scovery and Data Mi ni ng (KDD)
has emerged as a rapi dl y growi ng i nterdi sci pl i nary fi el d that merges together databases,
stati sti cs, machi ne l earni ng and rel ated areas i n order to extract val uabl e i nformati on and
knowl edge i n l arge vol umes of data.
Vol ume-I i s i ntended for I T professi onal s, who have been tasked wi th pl anni ng, manag-
i ng, desi gni ng, i mpl ementi ng, supporti ng, mai ntai ni ng and anal yzi ng the organi zati on’s data
warehouse.
The fi rst secti on i ntroduces the Enterpri se Archi tecture and Data Warehouse concepts,
the basi s of the reasons for wri ti ng thi s book.
The second secti on focuses on three of the key Peopl e i n any data warehousi ng i ni ti a-
ti ve: the Project Sponsor, the CI O, and the Project Manager. Thi s secti on i s devoted to
addressi ng the pri mary concerns of these i ndi vi dual s.
The thi rd secti on presents a Process for pl anni ng and i mpl ementi ng a data warehouse
and provi des gui del i nes that wi l l prove extremel y hel pful for both fi rst-ti me and experi enced
warehouse devel opers.
The fourth secti on focuses on the Technol ogy aspect of data warehousi ng. I t l ends order
to the di zzyi ng array of technol ogy components that you may use to bui l d your data ware-
house.
The fi fth secti on opens a wi ndow to the future of data warehousi ng.
The si xth secti on deal s wi th On-Li ne Anal yti cal Processi ng (OLAP), by provi di ng di ffer-
ent features to sel ect the tool s from di fferent vendors.
Vol ume-I I shows how to achi eve success i n understandi ng and expl oi ti ng l arge databases
by uncoveri ng val uabl e i nformati on hi dden i n data; l earn what data has real meani ng and
what data si mpl y takes up space; exami ni ng whi ch data methods and tool s are most effecti ve
for the practi cal needs; and how to anal yze and eval uate obtai ned resul ts.
S. NAGABHUSHANA
This page
intentionally left
blank
ACKNOWLEDGEMENTS
My si ncere thanks to Prof. P. Rama Murthy, Pri nci pal , I ntel l Engi neeri ng Col l ege,
Anantapur, for hi s abl e gui dance and val uabl e suggesti ons - i n fact, i t was he who brought
my attenti on to the wri ti ng of thi s book. I am grateful to Smt. G. Hampamma, Lecturer i n
Engl i sh, I ntel l Engi neeri ng Col l ege, Anantapur and her whol e fami l y for thei r constant sup-
port and assi stance whi l e wri ti ng the book. Prof. Jeffrey D. Ul l man, Department of Computer
Sci ence, Stanford Uni versi ty, U.S.A., deserves my speci al thanks for provi di ng al l the neces-
sary resources. I am al so thankful to Mr. R. Venkat, Seni or Techni cal Associ ate at Vi rtusa,
Hyderabad, for goi ng through the scri pt and encouragi ng me.
Last but not l east, I thank Mr. Saumya Gupta, Managi ng Di rector, New Age I nterna-
ti onal (P) Li mi ted, Publ i shers. New Del hi , for thei r i nterest i n the publ i cati on of the book.
This page
intentionally left
blank
(xi)
CONTENTS
Preface (vii)
Acknowledgements (ix)
VOLUME I: DATA WAREHOUSING
IMPLEMENTATION AND OLAP
PART I : INTRODUCTION
Chapter 1. The Enterprise IT Architecture 5
1.1 The Past: Evol uti on of Enterpri se Archi tectures 5
1.2 The Present: The I T Professi onal ’s Responsi bi l i ty 6
1.3 Busi ness Perspecti ve 7
1.4 Technol ogy Perspecti ve 8
1.5 Archi tecture Mi grati on Scenari os 12
1.6 Mi grati on Strategy: How do We Move Forward? 20
Chapter 2. Data Warehouse Concepts 24
2.1 Gradual Changes i n Computi ng Focus 24
2.2 Data Warehouse Characteri sti cs and Defi ni ti on` 26
2.3 The Dynami c, Ad Hoc Report 28
2.4 The Purposes of a Data Warehouse 29
2.5 Data Marts 30
2.6 Operati onal Data Stores 33
2.7 Data Warehouse Cost-Benefi t Anal ysi s /Return on I nvestment 35
PART II : PEOPLE
Chapter 3. The Project Sponsor 39
3.1 How does a Data Warehouse Affect Deci si on-Maki ng Processes? 39
3.2 How does a Data Warehouse I mprove Fi nanci al Processes? Marketi ng?
Operati ons? 40
3.3 When i s a Data Warehouse Project Justi fi ed? 41
3.4 What Expenses are I nvol ved? 43
3.5 What are the Ri sks? 45
3.6 Ri sk-Mi ti gati ng Approaches 50
3.7 I s Organi zati on Ready for a Data Warehouse? 51
3.8 How the Resul ts are Measured? 51
Chapter 4. The CIO 54
4.1 How i s the Data Warehouse Supported? 54
4.2 How Does Data Warehouse Evol ve? 55
4.3 Who shoul d be I nvol ved i n a Data Warehouse Project? 56
4.4 What i s the Team Structure Li ke? 60
4.5 What New Ski l l s wi l l Peopl e Need? 60
4.6 How Does Data Warehousi ng Fi t i nto I T Archi tecture? 62
4.7 How Many Vendors are Needed to Tal k to? 63
4.8 What shoul d be Looked for i n a Data Warehouse Vendor? 64
4.9 How Does Data Warehousi ng Affect Exi sti ng Systems? 67
4.10 Data Warehousi ng and i ts I mpact on Other Enterpri se I ni ti ati ves 68
4.11 When i s a Data Warehouse not Appropri ate? 69
4.12 How to Manage or Control a Data Warehouse I ni ti ati ve? 71
Chapter 5. The Project Manager 73
5.1 How to Rol l Out a Data Warehouse I ni ti ati ve? 73
5.2 How I mportant i s the Hardware Pl atform? 76
5.3 What are the Technol ogi es I nvol ved? 78
5.4 Are the Rel ati onal Databases Sti l l Used for Data Warehousi ng? 79
5.5 How Long Does a Data Warehousi ng Project Last? 83
5.6 How i s a Data Warehouse Di fferent from Other I T Projects? 84
5.7 What are the Cri ti cal Success Factors of a Data Warehousi ng 85
Pr oject?
(xii)
PART III : PROCESS
Chapter 6. Warehousing Strategy 89
6.1 Strategy Components 89
6.2 Determi ne Organi zati onal Context 90
6.3 Conduct Prel i mi nary Survey of Requi rements 90
6.4 Conduct Prel i mi nary Source System Audi t 92
6.5 I denti fy External Data Sources (I f Appl i cabl e) 93
6.6 Defi ne Warehouse Rol l outs (Phased I mpl ementati on) 93
6.7 Defi ne Prel i mi nary Data Warehouse Archi tecture 94
6.8 Eval uate Devel opment and Producti on Envi ronment and Tool s 95
Chapter 7. Warehouse Management and Support Processes 96
7.1 Defi ne I ssue Tracki ng and Resol uti on Process 96
7.2 Perform Capaci ty Pl anni ng 98
7.3 Defi ne Warehouse Purgi ng Rul es 108
7.4 Defi ne Securi ty Management 108
7.5 Defi ne Backup and Recovery Strategy 111
7.6 Set Up Col l ecti on of Warehouse Usage Stati sti cs 112
Chapter 8. Data Warehouse Planning 114
8.1 Assembl e and Ori ent Team 114
8.2 Conduct Deci si onal Requi rements Anal ysi s 115
8.3 Conduct Deci si onal Source System Audi t 116
8.4 Desi gn Logi cal and Physi cal Warehouse Schema 119
8.5 Produce Source-to-Target Fi el d Mappi ng 119
8.6 Sel ect Devel opment and Producti on Envi ronment and Tool s 121
8.7 Create Prototype for thi s Rol l out 121
8.8 Create I mpl ementati on Pl an of thi s Rol l out 122
8.9 Warehouse Pl anni ng Ti ps and Caveats 124
Chapter 9. Data Warehouse Implementation 128
9.1 Acqui re and Set Up Devel opment Envi ronment 128
9.2 Obtai n Copi es of Operati onal Tabl es 129
9.3 Fi nal i ze Physi cal Warehouse Schema Desi gn 129
(xiii)
(xiv)
9.4 Bui l d or Confi gure Extracti on and Transformati on Subsystems 130
9.5 Bui l d or Confi gure Data Qual i ty Subsystem 131
9.6 Bui l d Warehouse Load Subsystem 135
9.7 Set Up Warehouse Metadata 138
9.8 Set Up Data Access and Retri eval Tool s 138
9.9 Perform the Producti on Warehouse Load 140
9.10 Conduct User Trai ni ng 140
9.11 Conduct User Testi ng and Acceptance 141
PART IV : TECHNOLOGY
Chapter 10. Hardware and Operating Systems 145
10.1 Paral l el Hardware Technol ogy 145
10.2 The Data Parti ti oni ng I ssue 148
10.3 Hardware Sel ecti on Cri teri a 152
Chapter 11. Warehousing Software 154
11.1 Mi ddl eware and Connecti vi ty Tool s 155
11.2 Extracti on Tool s 155
11.3 Transformati on Tool s 156
11.4 Data Qual i ty Tool s 158
11.5 Data Loaders 158
11.6 Database Management Systems 159
11.7 Metadata Reposi tory 159
11.8 Data Access and Retri eval Tool s 160
11.9 Data Model i ng Tool s 162
11.10 Warehouse Management Tool s 163
11.11 Source Systems 163
Chapter 12. Warehouse Schema Design 165
12.1 OLTP Systems Use Normal i zed Data Structures 165
12.2 Di mensi onal Model i ng for Deci si onal Systems 167
12.3 Star Schema 168
12.4 Di mensi onal Hi erarchi es and Hi erarchi cal Dri l l i ng 169
12.5 The Granul ari ty of the Fact Tabl e 170
(xv)
12.6 Aggregates or Summari es 171
12.7 Di mensi onal Attri butes 173
12.8 Mul ti pl e Star Schemas 173
12.9 Advantages of Di mensi onal Model i ng 174
Chapter 13. Warehouse Metadata 176
13.1 Metadata Defi ned 176
13.2 Metadata are a Form of Abstracti on 177
13.3 I mportance of Metadata 178
13.4 Types of Metadata 179
13.5 Metadata Management 181
13.6 Metadata as the Basi s for Automati ng Warehousi ng Tasks 182
13.7 Metadata Trends 182
Chapter 14. Warehousing Applications 184
14.1 The Earl y Adopters 184
14.2 Types of Warehousi ng Appl i cati ons 184
14.3 Fi nanci al Anal ysi s and Management 185
14.4 Speci al i zed Appl i cati ons of Warehousi ng Technol ogy 186
PART V: MAINTENANCE, EVOLUTION AND TRENDS
Chapter 15. Warehouse Maintenance and Evolution 191
15.1 Regul ar Warehouse Loads 191
15.2 Warehouse Stati sti cs Col l ecti on 191
15.3 Warehouse User Profi l es 192
15.4 Securi ty and Access Profi l es 193
15.5 Data Qual i ty 193
15.6 Data Growth 194
15.7 Updates to Warehouse Subsystems 194
15.8 Database Opti mi zati on and Tuni ng 195
15.9 Data Warehouse Staffi ng 195
15.10 Warehouse Staff and User Trai ni ng 196
15.11 Subsequent Warehouse Rol l outs 196
15.12 Chargeback Schemes 197
15.13 Di saster Recovery 197
(xvi)
Chapter 16. Warehousing Trends 198
16.1 Conti nued Growth of the Data Warehouse I ndustry 198
16.2 I ncreased Adopti on of Warehousi ng Technol ogy by More I ndustri es 198
16.3 I ncreased Maturi ty of Data Mi ni ng Technol ogi es 199
16.4 Emergence and Use of Metadata I nterchange Standards 199
16.5 I ncreased Avai l abi l i ty of Web-Enabl ed Sol uti ons 199
16.6 Popul ari ty of Wi ndows NT for Data Mart Projects 199
16.7 Avai l abi l i ty of Warehousi ng Modul es for Appl i cati on Packages 200
16.8 More Mergers and Acqui si ti ons Among Warehouse Pl ayers 200
PART VI: ON-LINE ANALYTICAL PROCESSING
Chapter 17. Introduction 203
17.1 What i s OLAP ? 203
17.2 The Codd Rul es and Features 205
17.3 The ori gi ns of Today’s OLAP Products 209
17.4 What’s i n a Name 219
17.5 Market Anal ysi s 221
17.6 OLAP Archi tectures 224
17.7 Di mensi onal Data Structures 229
Chapter 18. OLAP Applications 233
18.1 Marketi ng and Sal es Anal ysi s 233
18.2 Cl i ck stream Anal ysi s 235
18.3 Database Marketi ng 236
18.4 Budgeti ng 237
18.5 Fi nanci al Reporti ng and Consol i dati on 239
18.6 Management Reporti ng 242
18.7 EI S 242
18.8 Bal anced Scorecard 243
18.9 Profi tabi l i ty Anal ysi s 245
18.10 Qual i ty Anal ysi s 246
VOLUME II: DATA MINING
Chapter 1. Introduction 249
1.1 What i s Data Mi ni ng 251
1.2 Defi ni ti ons 252
1.3 Data Mi ni ng Process 253
1.4 Data Mi ni ng Background 254
1.5 Data Mi ni ng Model s 256
1.6 Data Mi ni ng Methods 257
1.7 Data Mi ni ng Probl ems/I ssues 260
1.8 Potenti al Appl i cati ons 262
1.9 Data Mi ni ng Exampl es 262
Chapter 2. Data Mining with Decision Trees 267
2.1 How a Deci si on Tree Works 269
2.2 Constructi ng Deci si on Trees 271
2.3 I ssues i n Data Mi ni ng wi th Deci si on Trees 275
2.4 Vi sual i zati on of Deci si on Trees i n System CABRO 279
2.5 Strengths and Weakness of Deci si on Tree Methods 281
Chapter 3. Data Mining with Association Rules 283
3.1 When i s Associ ati on Rul e Anal ysi s Useful ? 285
3.2 How does Associ ati on Rul e Anal ysi s Work ? 286
3.3 The Basi c Process of Mi ni ng Associ ati on Rul es 287
3.4 The Probl em of Large Datasets 292
3.5 Strengths and Weakness of Associ ati on Rul es Anal ysi s 293
Chapter 4. Automatic Clustering Detection 295
4.1 Searchi ng for Cl usters 297
4.2 The K-means Method 299
4.3 Aggl omerati ve Methods 309
4.4 Eval uati ng Cl usters 311
4.5 Other Approaches to Cl uster Detecti on 312
4.6 Strengths and Weakness of Automati c Cl uster Detecti on 313
(xvii)
(xviii)
Chapter 5. Data Mining with Neural Network 315
5.1 Neural Networks for Data Mi ni ng 317
5.2 Neural Network Topol ogi es 318
5.3 Neural Network Model s 321
5.4 I terati ve Devel opment Process 327
5.5 Strengths and Weakness of Arti fi ci al Neural Network 320
VOLUME I
DATA WAREHOUSING
IMPLEMENTATION AND OLAP
This page
intentionally left
blank
PART I : INTRODUCTION
The term Enterprise Architecture refers to a collection of
technology components and their interrelationships, which are
integrated to meet the information requirements of an
enterprise. This section introduces the concept of Enterprise
IT Architectures with the intention of providing a framework
for the various types of technologies used to meet an
enterprise’s computing needs.
Data warehousing technologies belong to just one of the many
components in IT architecture. This chapter aims to define
how data warehousing fits within the overall IT architecture,
in the hope that IT professionals will be better positioned to
use and integrate data warehousing technologies with the
other IT components used by the enterprise.
This page
intentionally left
blank
5
This chapter begins with a brief look at the changing business requirements and how,
over time influenced the evolution of Enterprise Architectures. The InfoMotion (“Information
in Motion”) Enterprise Architecture is introduced to provide IT professionals with a framework
with which to classify the various technologies currently available.
1.1 THE PAST: EVOLUTION OF ENTERPRISE ARCHITECTURES
The IT architecture of an enterprise at a given time depends on three main factors:
• the business requirements of the enterprise;
• the available technology at that time; and
• the accumulated investments of the enterprise from earlier technology generations.
The business requirements of an enterprise are constantly changing, and the changes
are coming at an exponential rate. Business requirements have, over the years, evolved
from the day-to-day clerical recording of transactions to the automation of business processes.
Exception reporting has shifted from tracking and correcting daily transactions that have
gone astray to the development of self-adjusting business processes.
Technology has likewise advanced by delivering exponential increases in computing
power and communications capabilities. However, for all these advances in computing
hardware, a significant lag exists in the realms of software development and architecture
definition. Enterprise Architectures thus far have displayed a general inability to gracefully
evolve in line with business requirements, without either compromising on prior technology
investments or seriously limiting their own ability to evolve further.
In hindsight, the evolution of the typical Enterprise Architecture reflects the continuous,
piecemeal efforts of IT professionals to take advantage of the latest technology to improve
the support of business operations. Unfortunately, this piecemeal effort has often resulted
in a morass of incompatible components.
60- -6-42415- 16 )4+016-+674-
1
+0)26-4
6 DATA WAREHOUSING, OLAP AND DATA MINING
1.2 THE PRESENT: THE IT PROFESSIONAL’S RESPONSIBILITY
Today, the IT professional continues to have a two-fold responsibility: Meet business
requirements through Information Technology and integrate new technology into the existing
Enterprise Architecture.
Meet Business Requirements
The IT professional must ensure that the enterprise IT infrastructure properly supports
a myriad set of requirements from different business users, each of whom has different and
constantly changing needs, as illustrated in Figure 1.1.
I need to find out
why our sales in the
South are dropping...
We need to get
this modified order
quickly to our
European supplier...
Where can I find
a copy of last month’s
Newsletter?
Someone from
XYZ, Inc.
wants to know
what the
status of
their order is..
Figure 1.1. Different Business Needs
Take Advantage of Technology Advancements
At the same time, the IT professional must also constantly learn new buzzwords, review
new methodologies, evaluate new tools, and maintain ties with technology partners. Not all
the latest technologies are useful; the IT professional must first sift through the technology
jigsaw puzzle (see Figure 1.2) to find the pieces that meet the needs of the enterprise, then
integrate the newer pieces with the existing ones to form a coherent whole.
Decision
Support
Web
Technology
OLAP
OLTP
Intranet
Data
Warehouse
Flash Monitoring
& Reporting
Legacy
Client/Server
Figure 1.2. The Technology Jigsaw Puzzle
THE ENTERPRISE IT ARCHITECTURE 7
One of the key constraints the IT professional faces today is the current Enterprise IT
Architecture itself. At this point, therefore, it is prudent to step back, assess the current
state of affairs and identify the distinct but related components of modern Enterprise
Architectures.
The two orthogonal perspectives of business and technology are merged to form one
unified framework, as shown in Figure 1.3.
INFOMOTION
ENTERPRISE ARCHITECTURE
Transactional
Web Scripts
Informational
Web Scripts
Decision
Support
Applications
Flash
Monitoring
& Reporting
OLTP
Applications
Workflow
Management
Clients
Transactional
Web
Services
Informational
Web
Services
Data
Warehouse
Operational
Data Store
Active
Data base
Workflow
Management
Services
Legacy Systems
VIRTUAL CORP. INFORMATIONAL DECISIONAL OPERATIONAL
Logical
Client
Layer
Logical
Server
Layer
Legacy
Layer
Figure 1.3. The InfoMotion Enterprise Architecture
1.3 BUSINESS PERSPECTIVE
From the business perspective, the requirements of the enterprise fall into categories
illustrated in Figure 1.4 and described below.
Operational
Technology supports the smooth execution and continuous improvement of day-to-day
operations, the identification and correction of errors through exception reporting and
workflow management, and the overall monitoring of operations. Information retrieved
about the business from an operational viewpoint is used to either complete or optimize the
execution of a business process.
Decisional
Technology supports managerial decision-making and long-term planning. Decision-
makers are provided with views of enterprise data from multiple dimensions and in varying
levels of detail. Historical patterns in sales and other customer behavior are analyzed.
Decisional systems also support decision-making and planning through scenario-based
modeling, what-if analysis, trend analysis, and rule discovery.
Informational
Technology makes current, relatively static information widely and readily available to
as many people as need access to it. Examples include company policies, product and service
information, organizational setup, office location, corporate forms, training materials and
company profiles.
8 DATA WAREHOUSING, OLAP AND DATA MINING
DECISIONAL
VIRTUAL
CORPORATION
INFORMATIONAL
OPERATIONAL
Figure 1.4. The InfoMotion Enterprise Architecture
Virtual Corporation
Technology enables the creation of strategic links with key suppliers and customers to
better meet customer needs. In the past, such links were feasible only for large companies
because of economy of scale. Now, the affordability of Internet technology provides any
enterprise with this same capability.
1.4 TECHNOLOGY PERSPECTIVE
This section presents each architectural component from a technology standpoint and
highlights the business need that each is best suited to support.
Operational Needs
Legacy Systems
The term legacy system refers to any information system currently in use that was built
using previous technology generations. Most legacy systems are operational in nature, largely
because the automation of transaction-oriented business processes had long been the priority
of Information Technology projects.
OPERATIONAL
• Legacy System
• OLTP Aplication
• Active Database
• Operational Data Store
• Flash Monitoring and Reporting
• Workflow Management (Groupware)
OLTP Applications
The term Online Transaction Processing refers to systems that automate and capture
business transactions through the use of computer systems. In addition, these applications
THE ENTERPRISE IT ARCHITECTURE 9
traditionally produce reports that allow business users to track the status of transactions.
OLTP applications and their related active databases compose the majority of client/server
systems today.
Active Databases
Databases store the data produced by Online Transaction Processing applications. These
databases were traditionally passive repositories of data manipulated by business applications.
It is not unusual to find legacy systems with processing logic and business rules contained
entirely in the user interface or randomly interspersed in procedural code.
With the advent of client/server architecture, distributed systems, and advances in
database technology, databases began to take on a more active role through database
programming (e.g., stored procedures) and event management. IT professionals are now
able to bullet-proof the application by placing processing logic in the database itself. This
contrasts with the still-popular practice of replicating processing logic (sometimes in an
inconsistent manner) across the different parts of a client application or across different
client applications that update the same database. Through active databases, applications
are more robust and conducive to evolution.
Operational Data Stores
An Operational Data Store or ODS is a collection of integrated databases designed to
support the monitoring of operations. Unlike the databases of OLTP applications (that are
function oriented), the Operational Data Store contains subject-oriented, volatile, and current
enterprise-wide detailed information; it serves as a system of record that provides
comprehensive views of data in operational systems.
Data are transformed and integrated into a consistent, unified whole as they are obtained
from legacy and other operational systems to provide business users with an integrated and
current view of operations (see Figure 1.5). Data in the Operational Data Store are constantly
refreshed so that the resulting image reflects the latest state of operations.
Legacy System Y
Legacy System X Other Systems
Legacy System Z
Integration and
Transformation of
Legacy Data
Operational
Data Store
Figure 1.5. Legacy Systems and the Operational Data Store
Flash Monitoring and Reporting
These tools provide business users with a dashboard-meaningful online information on
the operational status of the enterprise by making use of the data in the Operational Data
10 DATA WAREHOUSING, OLAP AND DATA MINING
Store. The business user obtains a constantly refreshed, enterprise-wide view of operations
without creating unwanted interruptions or additional load on transaction processing systems.
Workflow Management and Groupware
Workflow management systems are tools that allow groups to communicate and
coordinate their work. Early incarnations of this technology supported group scheduling,
e-mail, online discussions, and resource sharing. More advanced implementations of this
technology are integrated with OLTP applications to support the execution of business
processes.
Decisional Needs
Data Warehouse
The data warehouse concept developed as IT professionals increasingly realized that
the structure of data required for transaction reporting was significantly different from the
structure required to analyze data.
DECISIONAL
• Data Warehouse
• Decision Support Application
(OLAP)
The data warehouse was originally envisioned as a separate architectural component
that converted and integrated masses of raw data from legacy and other operational systems
and from external sources. It was designed to contain summarized, historical views of data
in production systems. This collection provides business users and decision-makers with a
cross functional, integrated, subject-oriented view of the enterprise.
The introduction of the Operational Data Store has now caused the data warehouse
concept to evolve further. The data warehouse now contains summarized, historical views
of the data in the Operational Data Store. This is achieved by taking regular “snapshots”
of the contents of the Operational Data Store and using these snapshots as the basis for
warehouse loads.
In doing so, the enterprise obtains the information required for long term and historical
analysis, decision-making, and planning.
Decision Support Applications
Also known as OLAP (Online Analytical Processing), these applications provide
managerial users with meaningful views of past and present enterprise data. User-friendly
formats, such as graphs and charts are frequently employed to quickly convey meaningful
data relationships.
Decision support processing typically does not involve the update of data; however,
some OLAP software allows users to enter data for budgeting, forecasting, and “what-if ”
analysis.
THE ENTERPRISE IT ARCHITECTURE 11
Informational Needs
Informational Web Services and Scripts
Web browsers provide their users with a universal tool or front-end for accessing
information from web servers. They provide users with a new ability to both explore and
publish information with relative ease. Unlike other technologies, web technology makes
any user an instant publisher by enabling the distribution of knowledge and expertise, with
no more effort than it takes to record the information in the first place.
INFORMATIONAL
• Informational Web Services
By its very nature, this technology supports a paperless distribution process. Maintenance
and update of information is straightforward since the information is stored on the web server.
Virtual Corporation Needs
Transactional Web Services and Scripts
Several factors now make Internet technology and electronic commerce a realistic option
for enterprises that wish to use the Internet for business transactions.
VIRTUAL
CORPORATION
• Transactional Web Services
• Cost. The increasing affordability of Internet access allows businesses to establish
cost-effective and strategic links with business partners. This option was originally
open only to large enterprises through expensive, dedicated wide-area networks or
metropolitan area networks.
• Security. Improved security and encryption for sensitive data now provide customers
with the confidence to transact over the Internet. At the same time, improvements
in security provide the enterprise with the confidence to link corporate computing
environments to the Internet.
• User-friendliness. Improved user-friendliness and navigability from web technology
make Internet technology and its use within the enterprise increasingly popular.
Figure 1.6 recapitulates the architectural components for the different types of business
needs. The majority of the architectural components support the enterprise at the operational
level. However, separate components are now clearly defined for decisional and information
purposes, and the virtual corporation becomes possible through Internet technologies.
12 DATA WAREHOUSING, OLAP AND DATA MINING
Other Components
Other architectural components are so pervasive that most enterprises have begun to
take their presence for granted. One example is the group of applications collectively known
as office productivity tools (such as Microsoft Office or Lotus SmartSuite). Components of
this type can and should be used across the various layers of the Enterprise Architecture
and, therefore, are not described here as a separate item.
DECISIONAL
VIRTUAL
CORPORATION
• Transactional Web
Services
• Informational Web
Services
OPERATIONAL
INFORMATIONAL
• Legacy Systems
• OLTP Application
• Active Database
• Operational Data Store
• Flash Monitoring and Reporting
• Workflow Management (Groupware)
• Data Warehouse
• Decision Support Applications
(OLAP)
Figure 1.6. InfoMotion Enterprise Architecture Components (Applicability to Business Needs)
1.5 ARCHITECTURE MIGRATION SCENARIOS
Given the typical path that most Enterprise Architectures have followed, an enterprise
will find itself in need of one or more of the following six migration scenarios. Which are
recommended for fulfilling those needs.
Legacy Integration
The Need
The integration of new and legacy systems is a constant challenge because of the
architectural templates upon which legacy systems were built. Legacy systems often attempt
to meet all types of information requirements through a single architectural component;
consequently, these systems are brittle and resistant to evolution.
Despite attempts to replace them with new applications, many legacy systems remain
in use because they continue to meet a set of business requirements: they represent significant
investments that the enterprise cannot afford to scrap, or their massive replacement would
result in unacceptable levels of disruption to business operations.
The Recommended Approach
The integration of legacy systems with the rest of the architecture is best achieved
through the Operational Data Store and/or the data warehouse. Figure 1.7 modifies
Figure 1.5 to show the integration of legacy systems.
Legacy programs that produce and maintain summary information are migrated to the
data warehouse. Historical data are likewise migrated to the data warehouse. Reporting
THE ENTERPRISE IT ARCHITECTURE 13
functionality in legacy systems is moved either to the flash reporting and monitoring tools
(for operational concerns), or to decision support applications (for long-term planning and
decision-making). Data required for operational monitoring are moved to the Operational
Data Store. Table 1.1 summarizes the migration avenues.
The Operational Data Store and the data warehouse present IT professionals with a
natural migration path for legacy migration. By migrating legacy systems to these two
components, enterprises can gain a measure of independence from legacy components that
were designed with old, possibly obsolete, technology. Figure 1.8 highlights how this approach
fits into the Enterprise Architecture.
Data Warehouse
Operational
Data Store
Integration and
Transformation of
Legacy Data
Legacy System N
Legacy System 2
Legacy System 1
Other Systems
Figure 1.7. Legacy Integration
Operational Monitoring
The Need
Today’s typical legacy systems are not suitable for supporting the operational monitoring
needs of an enterprise. Legacy systems are typically structured around functional or
organizational areas, in contrast to the cross-functional view required by operations monitoring.
Different and potentially incompatible technology platforms may have been used for different
systems. Data may be available in legacy databases but are not extracted in the format
required by business users. Or data may be available but may be too raw to be of use for
operational decision-making (further summarization, calculation, or conversion is required).
And lastly, several systems may contain data about the same item but may examine the data
from different viewpoints or at different time frames, therefore requiring reconciliation.
Table 1.1. Migration of Legacy Functionality to the Appropriate Architectural Component
Functionality in Legacy Systems Should be Migrated to . . .
Summary Information Data Warehouse
Historical Data Data Warehouse
Operational Reporting Flash Monitoring and Reporting Tools
Data for Operational Monitoring Operational Data Store
Decisional Reporting Decision Support Applications
14 DATA WAREHOUSING, OLAP AND DATA MINING
INFOMOTION
LEGACY INTEGRATION
Transactional
Web Scripts
Informational
Web Scripts
Decision
Support
Applications
Flash
Monitoring
& Reporting
OLTP
Applications
Workflow
Management
Clients
Transactional
Web
Services
Informational
Web
Services
Data
Warehouse
Operational
Data Store
Active
Data base
Workflow
Management
Services
Legacy Systems
VIRTUAL CORP. INFORMATIONAL DECISIONAL OPERATIONAL
Logical
Client
Layer
Logical
Server
Layer
Legacy
Layer
Figure 1.8. Legacy Integration: Architectural View
The Recommended Approach
An integrated view of current, operational information is required for the successful
monitoring of operations. Extending the functionality of legacy applications to meet this
requirement would merely increase the enterprise’s dependence on increasingly obsolete
technology. Instead, an Operational Data Store, coupled with flash monitoring and reporting
tools, as shown in Figure 1.9, meets this requirement without sacrificing architectural integrity.
Like a dashboard on a car, flash monitoring and reporting tools keep business users
apprised of the latest cross-functional status of operations. These tools obtain data from the
Operational Data Store, which is regularly refreshed with the latest information from legacy
and other operational systems.
Business users are consequently able to step in and correct problems in operations
while they are still smaller or better, to prevent problems from occurring altogether. Once
alerted of a potential problem, the business user can manually intervene or make use of
automated tools (i.e., control panel mechanisms) to fine-tune operational processes. Figure
1.10 highlights how this approach fits into the Enterprise Architecture.
Legacy System 1
Legacy System 2
Legacy System N
Integration and
Transformation of
Legacy Data
Operational
Data Store
Flash Monitoring
and Reporting
Other System
Figure 1.9. Operational Monitoring
THE ENTERPRISE IT ARCHITECTURE 15
Process Implementation
The Need
In the early 90s, the popularity of business process reengineering (BPR) caused businesses
to focus on the implementation of new and redefined business processes.
Raymond Manganelli and Mark Klein, in their book The Reengineering Handbook
(AMACOM, 1994, ISBN: 0-8144-0236-4) define BPR as “the rapid and radical redesign of
strategic, value-added business processes–and the systems, policies, and organizational
structures that support them–to optimize the work flow and productivity in an organization.”
Business processes are redesigned to achieve desired results in an optimum manner.
INFOMOTION
OPERATIONAL MONITORING
Transactional
Web Scripts
Informational
Web Scripts
Decision
Support
Applications
Flash
Monitoring
& Reporting
OLTP
Applications
Workflow
Management
Clients
Transactional
Web
Services
Informational
Web
Services
Data
Warehouse
Operational
Data Store
Active
Data base
Workflow
Management
Services
Legacy Systems
VIRTUAL CORP. INFORMATIONAL DECISIONAL OPERATIONAL
Logical
Client
Layer
Logical
Server
Layer
Legacy
Layer
Figure 1.10. Operational Monitoring: Architectural View
The Recommended Approach
With BPR, the role of Information Technology shifted from simple automation to enabling
radically redesigned processes. Client/server technology, such as OLTP applications serviced
by active databases, is particularly suited to supporting this type of business need. Technology
advances have made it possible to build and modify systems quickly in response to changes
in business processes. New policies, procedures and controls are supported and enforced by
the systems.
In addition, workflow management systems can be used to supplement OLTP
applications. A workflow management system converts business activities into a goal-directed
process that flows through the enterprise in an orderly fashion (see Figure 1.11). The
workflow management system alerts users through the automatic generation of notification
messages or reminders and routes work so that the desired business result is achieved in
an expedited manner.
Figure 1.12 highlights how this approach fits into the Enterprise Architecture.
16 DATA WAREHOUSING, OLAP AND DATA MINING
Figure 1.11. Process Implementation
Decision Support
The Need
It is not possible to anticipate the information requirements of decision makers for the
simple reason that their needs depend on the business situation that they face. Decision-
makers need to review enterprise data from different dimensions and at different levels of
detail to find the source of a business problem before they can attack it. They likewise need
information for detecting business opportunities to exploit.
Decision-makers also need to analyze trends in the performance of the enterprise.
Rather than waiting for problems to present themselves, decision-makers need to proactively
mobilize the resources of the enterprise in anticipation of a business situation.
INFOMOTION
PROCESS IMPLEMENTATION
Transactional
Web Scripts
Informational
Web Scripts
Decision
Support
Applications
Flash
Monitoring
& Reporting
OLTP
Applications
Workflow
Management
Clients
Transactional
Web
Services
Informational
Web
Services
Data
Warehouse
Operational
Data Store
Active
Data base
Workflow
Management
Services
Legacy Systems
VIRTUAL CORP. INFORMATIONAL DECISIONAL OPERATIONAL
Logical
Client
Layer
Logical
Server
Layer
Legacy
Layer
Figure 1.12. Process Implementation: Architectural View
THE ENTERPRISE IT ARCHITECTURE 17
Since these information requirements cannot be anticipated, the decision maker often
resorts to reviewing pre-designed inquiries or reports in an attempt to find or derive needed
information. Alternatively, the IT professional is pressured to produce an ad hoc report from
legacy systems as quickly as possible. If unlucky, the IT professional will find the data
needed for the report are scattered throughout different legacy systems. An even unluckier
may find that the processing required to produce the report will have a toll on the operations
of the enterprise.
These delays are not only frustrating both for the decision-maker and the IT professional,
but also dangerous for the enterprise. The information that eventually reaches the decision-
maker may be inconsistent, inaccurate, worse, or obsolete.
The Recommended Approach
Decision support applications (or OLAP) that obtain data from the data warehouse are
recommended for this particular need. The data warehouse holds transformed and integrated
enterprise-wide operational data appropriate for strategic decision-making, as shown in
Figure 1.13. The data warehouse also contains data obtained from external-sources, whenever
this data is relevant to decision-making.
Alert System
Exception Reporting
Data Mining
EIS/DSS
Report
Writers
OLAP
Data
Warehouse
Legacy System 1
Legacy System 2
Legacy System N
Figure 1.13. Decision Support
Decision support applications analyze and make data warehouse information available
in formats that are readily understandable by decision-makers. Figure 1.14 highlights how
this approach fits into the Enterprise Architecture.
Hyperdata Distribution
The Need
Past informational requirements were met by making data available in physical form
through reports, memos, and company manuals. This practice resulted in an overflow of
documents providing much data and not enough information.
18 DATA WAREHOUSING, OLAP AND DATA MINING
Paper-based documents also have the disadvantage of becoming dated. Enterprises
encountered problems in keeping different versions of related items synchronized. There
was a constant need to update, republish and redistribute documents.
(INFO) INFOMOTION (MOTION)
DECISION SUPPORT
Transactional
Web Scripts
Informational
Web Scripts
Decision
Support
Applications
Flash
Monitoring
& Reporting
OLTP
Applications
Workflow
Management
Clients
Transactional
Web
Services
Informational
Web
Services
Data
Warehouse
Operational
Data Store
Active
Data base
Workflow
Management
Services
Legacy Systems
VIRTUAL CORP. INFORMATIONAL DECISIONAL OPERATIONAL
Logical
Client
Layer
Logical
Server
Layer
Legacy
Layer
Figure 1.14. Decision Support: Architectural View
In response to this problem, enterprises made data available to users over a network
to eliminate the paper. It was hoped that users could selectively view the data whenever
they needed it. This approach likewise proved to be insufficient because users still had to
navigate through a sea of data to locate the specific item of information that was needed.
The Recommended Approach
Users need the ability to browse through nonlinear presentations of data. Web technology
is particularly suitable to this need because of its extremely flexible and highly visual
method of organizing information (see Figure 1.15).

Corporate Forms,
Training Materials
Company Profiles,
Product, and
Service Information
Company Policies,
Organizational Setup
Figure 1.15. Hyperdata Distribution
THE ENTERPRISE IT ARCHITECTURE 19
Web technology allows users to display charts and figures; navigate through large
amounts of data; visualize the contents of database files; seamlessly navigate across charts,
data, and annotation; and organize charts and figures in a hierarchical manner. Users are
therefore able to locate information with relative ease. Figure 1.16 highlights how this
approach fits into the Enterprise Architecture.
INFOMOTION
HYPERDATA DISTRIBUTION
Transactional
Web Scripts
Informational
Web Scripts
Decision
Support
Applications
Flash
Monitoring
& Reporting
OLTP
Applications
Workflow
Management
Clients
Transactional
Web
Services
Informational
Web
Services
Data
Warehouse
Operational
Data Store
Active
Data base
Workflow
Management
Services
Legacy Systems
VIRTUAL CORP. INFORMATIONAL DECISIONAL OPERATIONAL
Logical
Client
Layer
Logical
Server
Layer
Legacy
Layer
Figure 1.16. Hyperdata Distribution: Architectural View
Virtual Corporation
The Need
A virtual corporation is an enterprise that has extended its business processes to
encompass both its key customers and suppliers. Its business processes are newly redesigned;
its product development or service delivery is accelerated to better meet customer needs and
preferences; its management practices promote new alignments between management and
labor, as well as new linkages among enterprise, supplier and customer. A new level of
cooperation and openness is created and encouraged between the enterprise and its key
business partners.
The Recommended Approach
Partnerships at the enterprise level translate into technological links between the
enterprise and its key suppliers or customers (see Figure 1.17). Information required by
each party is identified, and steps are taken to ensure that this data crosses organizational
boundaries properly. Some organizations seek to establish a higher level of cooperation with
their key business partners by jointly redesigning their business processes to provide greater
value to the customer.
Internet and web technologies are well suited to support redesigned, transactional
processes. Thanks to decreasing Internet costs, improved security measures, improved user-
friendliness, and navigability. Figure 1.18 highlights how this approach fits into the Enterprise
Architecture.
20 DATA WAREHOUSING, OLAP AND DATA MINING
Supplier
Enterprise
Customer
Figure 1.17. Virtual Corporation
INFOMOTION
VIRTUAL CORPORATION
Transactional
Web Scripts
Informational
Web Scripts
Decision
Support
Applications
Flash
Monitoring
& Reporting
OLTP
Applications
Workflow
Management
Clients
Transactional
Web
Services
Informational
Web
Services
Data
Warehouse
Operational
Data Store
Active
Data base
Workflow
Management
Services
Legacy Systems
VIRTUAL CORP. INFORMATIONAL DECISIONAL OPERATIONAL
Logical
Client
Layer
Logical
Server
Layer
Legacy
Layer
Figure 1.18. Virtual Corporation: Architectural View
1.6 MIGRATION STRATEGY: HOW DO WE MOVE FORWARD?
The strategies presented in the previous section enable organizations to move from
their current technology architectures into the InfoMotion Enterprise Architecture. This
section describes the tasks for any migration effort.
Review the Current Enterprise Architecture
As simple as this may sound, the starting point is a review of the current Enterprise
Architecture. It is important to have an idea of whatever that is already available before
planning for further achievements.
The IT department or division should have this information readily available, although
it may not necessarily be expressed in terms of the architectural components identified
above. A short and simple exercise of mapping the current architecture of an enterprise to
the architecture described above should quickly highlight any gaps in the current architecture.
THE ENTERPRISE IT ARCHITECTURE 21
Identify Information Architecture Requirements
Knowing that the Enterprise IT Architecture has gaps is not sufficient. It is important
to know whether these can be considered real gaps when viewed within the context of the
enterprise’s requirements. Gaps should cause concern only if the absence of an architectural
component prevents the IT infrastructure from meeting present requirements or from
supporting long-term strategies.
For example, if transactional web scripts are not critical to an enterprise given its
current needs and strategies, there should be no cause for concern.
Develop a Migration Plan Based on Requirements
It is not advisable for an enterprise to use this list of architectural gaps to justify a
dramatic overhaul of its IT infrastructure; such an undertaking would be expensive and
would cause unnecessary disruption of business operations. Instead, the enterprise would do
well to develop a migration plan that consciously maps coming IT projects to the InfoMotion
Enterprise Architecture.
The Natural Migration Path
While developing the migration plan, the enterprise should consider the natural migration
path that the InfoMotion architecture implies, as illustrated in Figure 1.19.
Internet
Intranet
Client Server
Legacy
Integration
Figure 1.19. Natural Migration Roadmap
• The legacy layer at the very core of the Enterprise Architecture. For most companies,
this core layer is where the majority of technology investments have been made.
It should also be the starting point of any architecture migration effort, i.e., the
enterprise should start from this core technology before focusing its attention on
newer forms or layers of technology.
• The Legacy Integration layer insulates the rest of the Enterprise Architecture from
the growing obsolescence of the Legacy layer. It also provides the succeeding
technology layers with a more stable foundation for future evolution.
• Each of the succeeding technology layers (i.e., Client/Server, Intranet, Internet)
builds upon its predecessors.
• At the outermost layer, the public Internet infrastructure itself supports the
operations of the enterprise.
22 DATA WAREHOUSING, OLAP AND DATA MINING
The Customized Migration Path
Depending on the priorities and needs of the enterprise, one or more of the migration
scenarios described in the previous section will be helpful starting points. The scenarios
provide generic roadmaps that address typical architectural needs.
The migration plan, however, must be customized to address the specific needs of the
enterprise. Each project defined in the plan must individually contribute to the enterprise
in the short term, while laying the groundwork for achieving long-term enterprise and IT
objectives.
By incrementally migrating its IT infrastructure (one component and one project at a
time), the enterprise will find itself slowly but surely moving towards a modern, resilient
Enterprise Architecture, with minimal and acceptable levels of disruption in operations.
Monitor and Update the Migration Plan
The migration plan must be monitored, and the progress of the different projects fed
back into the planning task. One must not lose sight of the fact that a modern Enterprise
Architecture is a moving target; inevitable new technology renders continuous evolution of
the Enterprise Architecture.
IN Summary
An enterprise has longevity in the business arena only when its products and services
are perceived by its customers to be of value.
Likewise, Information Technology has value in an enterprise only when its cost is
outweighed by its ability to increase and guarantee quality, improve service, cut costs or
reduce cycle time, as depicted in Figure 1.20.
The Enterprise Architecture is the foundation for all Information Technology efforts. It
therefore must provide the enterprise with the ability to:
Value
Quality× Service
Cost× CycleTime
=
Figure 1.20. The Value Equation
• distill information of value from the data which surrounds it, which it continuously
generates (information/data); and
• get that information to the right people and processes at the right time (motion).
These requirements form the basis for the InfoMotion equation, shown in Figure 1.21.
Info
Information
Data× Motion
Motion =
Figure 1.21. The InfoMotion Equation
By identifying distinct architectural components and their interrelationships, the
InfoMotion Enterprise Architecture increases the capability of the IT infrastructure to meet
present business requirements while positioning the enterprise to leverage emerging trends,
THE ENTERPRISE IT ARCHITECTURE 23
such as data warehousing, in both business and technology. Figure 1.22 shows the InfoMotion
Enterprise Architecture, the elements of which we have discussed.
INFOMOTION
ENTERPRISE ARCHITECTURE
Transactional
Web Scripts
Informational
Web Scripts
Decision
Support
Applications
Flash
Monitoring
& Reporting
OLTP
Applications
Workflow
Management
Clients
Transactional
Web
Services
Informational
Web
Services
Data
Warehouse
Operational
Data Store
Active
Data base
Workflow
Management
Services
Legacy Systems
VIRTUAL CORP. INFORMATIONAL DECISIONAL OPERATIONAL
Logical
Client
Layer
Logical
Server
Layer
Legacy
Layer
Figure 1.22. The InfoMotion Architecture
24
ThIs chnµfor oxµInIns how comµufIng hns chnngou Ifs focus from oµornfIonnI fo uocIsIonnI
concorns. If nIso uofInos unfn wnrohousIng concoµfs nnu cIfos fho fyµIcnI ronsons for buIIuIng
unfn wnrohousos.
2.1 GRADUAL CHANGES IN COMPUTING FOCUS
In rofrosµocf, If Is onsy fo soo how comµufIng hns shIffou Ifs focus from oµornfIonnI fo
uocIsIonnI concorns. Tho uIfforoncos In oµornfIonnI nnu uocIsIonnI InformnfIon roquIromonfs
µrosonfou now chnIIongos fhnf oIu comµufIng µrncfIcos couIu nof moof. IoIow, wo oInbornfo
on how fhIs chnngo In comµufIng focus bocnmo fho Imµofus for fho uovoIoµmonf of unfn
wnrohousIng fochnoIogIos.
Early Computing Focused on Operational Requirements
Tho IusInoss UycIo (uoµIcfou In IIguro 2.l) shows fhnf nny onforµrIso musf oµornfo nf
fhroo IovoIs: oµornfIonnI (I.o., fho uny-fo-uny runnIng of fho busInoss), fncfIcnI (I.o., fho
uofInIfIon of µoIIcy nnu fho monIforIng of oµornfIons) nnu sfrnfogIc (I.o., fho uofInIfIon of
orgnnIznfIon`s vIsIon, gonIs nnu objocfIvos).
Strategic
Tactical
Operational
Strategic
Monitoring
(Decisional Systems)
Policy
Operations
(Operational Systems)
IIguro 2.l. Tho IusInoss UycIo
,)6) 9)4-075- ++-265
2
+0)26-4
In Uhnµfor l, If Is nofou fhnf much of fho offorf nnu monoy In comµufIng hns boon
focusou on moofIng fho oµornfIonnI busInoss roquIromonfs of onforµrIsos. Affor nII, wIfhouf
fho O!TI nµµIIcnfIons fhnf rocorus fhousnnus, ovon mIIIIons of uIscrofo frnnsncfIons onch
uny, If wouIu nof bo µossIbIo for nny onforµrIso fo moof cusfomor noous whIIo onforcIng
busInoss µoIIcIos consIsfonfIy. Þor wouIu If bo µossIbIo for nn onforµrIso fo grow wIfhouf
sIgnIfIcnnfIy oxµnnuIng Ifs mnnµowor bnso.
WIfh oµornfIonnI sysfoms uoµIoyou nnu uny-fo-uny InformnfIon noous boIng mof by fho
O!TI sysfoms, fho focus of comµufIng hns ovor fho roconf yonrs shIffou nnfurnIIy fo moofIng
fho uocIsIonnI busInoss roquIromonfs of nn onforµrIso. IIguro 2.l IIIusfrnfos fho busInoss
cycIo ns If Is vIowou founy.
Decisional Requirements Cannot be Fully Anticipated
!nforfunnfoIy, If Is nof µossIbIo for IT µrofossIonnIs fo nnfIcIµnfo fho InformnfIon
roquIromonfs of nn onforµrIso`s uocIsIon-mnkors, for fho sImµIo ronson fhnf fhoIr InformnfIon
noous nnu roµorf roquIromonfs chnngo ns fho busInoss sIfunfIon chnngos.
ÐocIsIon-mnkors fhomsoIvos cnnnof bo oxµocfou fo know fhoIr InformnfIon roquIromonfs
nhonu of fImo; fhoy rovIow onforµrIso unfn from uIfforonf µorsµocfIvos nnu nf uIfforonf IovoIs
of uofnII fo fInu nnu nuuross busInoss µrobIoms ns fho µrobIoms nrIso. ÐocIsIon-mnkors nIso
noou fo Iook fhrough busInoss unfn fo IuonfIfy oµµorfunIfIos fhnf cnn bo oxµIoIfou. Thoy
oxnmIno µorformnnco fronus fo IuonfIfy busInoss sIfunfIons fhnf cnn µrovIuo comµofIfIvo
nuvnnfngo, Imµrovo µrofIfs, or rouuco cosfs. Thoy nnnIyzo mnrkof unfn nnu mnko fho fncfIcnI
ns woII ns sfrnfogIc uocIsIons fhnf uoformIno fho courso of fho onforµrIso.
Operational Systems Fail to Provide Decisional Information
SInco fhoso InformnfIon roquIromonfs cnnnof bo nnfIcIµnfou, oµornfIonnI sysfoms (whIch
corrocfIy focus on rocoruIng nnu comµIofIng uIfforonf fyµos of busInoss frnnsncfIons) nro
unnbIo fo µrovIuo uocIsIon-mnkors wIfh fho InformnfIon fhoy noou. As n rosuIf, busInoss
mnnngors fnII bnck on fho fImo-consumIng, nnu offon frusfrnfIng µrocoss of goIng fhrough
oµornfIonnI InquIrIos or roµorfs nIronuy suµµorfou by oµornfIonnI sysfoms In nn nffomµf fo
fInu or uorIvo fho InformnfIon fhoy ronIIy noou. AIfornnfIvoIy, IT µrofossIonnIs nro µrossurou
fo µrouuco nn nuhoc roµorf from fho oµornfIonnI sysfoms ns quIckIy ns µossIbIo.
If wIII nof bo unusunI for fho IT µrofossIonnI fo fInu fhnf fho unfn noouou fo µrouuco
fho roµorf nro scnfforou fhroughouf uIfforonf oµornfIonnI sysfoms nnu musf fIrsf bo cnrofuIIy
Infogrnfou. Worso, If Is IIkoIy fhnf fho µrocossIng roquIrou fo oxfrncf fho unfn from onch
oµornfIonnI sysfom wIII uomnnu so much of fho sysfom rosourcos fhnf fho IT µrofossIonnI
musf wnIf unfII non-oµornfIonnI hours boforo runnIng fho quorIos roquIrou fo µrouuco fho
roµorf.
Thoso uoInys nro nof onIy fImo-consumIng nnu frusfrnfIng bofh for fho IT µrofossIonnIs
nnu fho uocIsIon-mnkors, buf nIso unngorous for fho onforµrIso. Whon fho roµorf Is fInnIIy
µrouucou, fho unfn mny bo InconsIsfonf, Innccurnfo, or obsoIofo. Thoro Is nIso fho vory ronI
µossIbIIIfy fhnf fhIs now roµorf wIII frIggor fho roquosf for nnofhor nuhoc roµorf.
Decisional Systems have Evolved to Meet Decisional Requirements
Ovor fho yonrs, uocIsIonnI sysfoms hnvo boon uovoIoµou nnu ImµIomonfou In fho hoµo
of moofIng fhoso InformnfIon noous. Somo onforµrIsos hnvo ncfunIIy succoouou In uovoIoµIng
26 ÐATA WA!IHO!SIÞC, O!AI AÞÐ ÐATA MIÞIÞC
nnu uoµIoyIng unfn wnrohousos wIfhIn fhoIr rosµocfIvo orgnnIznfIons, Iong boforo fho form
unfn wnrohouso bocnmo fnshIonnbIo.
Mosf uocIsIonnI sysfoms, howovor, hnvo fnIIou fo uoIIvor on fhoIr µromIsos. ThIs book
Infrouucos unfn wnrohousIng fochnoIogIos nnu shnros Iossons Ionrnf from fho succoss nnu
fnIIuros of fhoso who hnvo boon on fho ¨bIoouIng ougo.¨
2.2 DATA WAREHOUSE CHARACTERISTICS AND DEFINITION
A unfn wnrohouso cnn bo vIowou ns nn InformnfIon sysfom wIfh fho foIIowIng nffrIbufos:
: If Is n unfnbnso uosIgnou for nnnIyfIcnI fnsks, usIng unfn from muIfIµIo nµµIIcnfIons.
: If suµµorfs n roInfIvoIy smnII numbor of usors wIfh roInfIvoIy Iong InforncfIons.
: Ifs usngo Is ronu-InfonsIvo.
: Ifs confonf Is µorIouIcnIIy uµunfou (mosfIy nuuIfIons).
: If confnIns curronf nnu hIsforIcnI unfn fo µrovIuo n hIsforIcnI µorsµocfIvo of
InformnfIon.
: If confnIns n fow Inrgo fnbIos.
Inch quory froquonfIy rosuIfs In n Inrgo rosuIfs sof nnu InvoIvos froquonf fuII fnbIo scnn
nnu muIfI-fnbIo joIns.
Whnf Is n unfn wnrohouso` WIIIInm H. Inmon In IuIIuIng fho Ðnfn Wnrohouso (QIÐ
TochnIcnI IubIIshIng Crouµ, l992 ISIÞ: 0-89435-404-3) uofInos n unfn wnrohouso ns ¨n
coIIocfIon of Infogrnfou subjocf-orIonfou unfnbnsos uosIgnou fo suµµIy fho InformnfIon roquIrou
for uocIsIon-mnkIng.¨
A moro fhorough Iook nf fho nbovo uofInIfIon yIoIus fho foIIowIng obsorvnfIons.
Integrated
A unfn wnrohouso confnIns unfn oxfrncfou from fho mnny oµornfIonnI sysfoms of fho
onforµrIso, µossIbIy suµµIomonfou by oxfornnI unfn. Ior oxnmµIo, n fyµIcnI bnnkIng unfn
wnrohouso wIII roquIro fho InfogrnfIon of unfn urnwn from fho uoµosIf sysfoms, Ionn sysfoms,
nnu fho gonornI Iougor.
Inch of fhoso oµornfIonnI sysfoms rocorus uIfforonf fyµos of busInoss frnnsncfIons nnu
onforcos fho µoIIcIos of fho onforµrIso rognruIng fhoso frnnsncfIons. If onch of fho oµornfIonnI
sysfoms hns boon cusfom buIIf or nn Infogrnfou sysfom Is nof ImµIomonfou ns n soIufIon,
fhon If Is unIIkoIy fhnf fhoso sysfoms nro Infogrnfou. Thus, Uusfomor A In fho uoµosIf
sysfom nnu Uusfomor I In fho Ionn sysfom mny bo ono nnu fho snmo µorson, buf fhoro Is
no nufomnfou wny for nnyono In fho bnnk fo know fhIs. Uusfomor roInfIonshIµs nro mnnngou
InformnIIy fhrough roInfIonshIµs wIfh bnnk offIcors.
A unfn wnrohouso brIngs fogofhor unfn from fho vnrIous oµornfIonnI sysfoms fo µrovIuo
nn Infogrnfou vIow of fho cusfomor nnu fho fuII scoµo of hIs or hor roInfIonshIµ wIfh fho
bnnk.
Subject Oriented
TrnuIfIonnI oµornfIonnI sysfoms focus on fho unfn roquIromonfs of n uoµnrfmonf or
uIvIsIon, µrouucIng fho much-crIfIcIzou ¨sfovoµIµo¨ sysfoms of mouoI onforµrIsos. WIfh fho
ÐATA WA!IHO!SI UOÞUIITS 2?
nuvonf of busInoss µrocoss roongInoorIng, onforµrIsos bognn osµousIng µrocoss-conforou fonms
nnu cnso workors. Mouorn oµornfIonnI sysfoms, In furn, hnvo shIffou fhoIr focus fo fho
oµornfIonnI roquIromonfs of nn onfIro busInoss µrocoss nnu nIm fo suµµorf fho oxocufIon of
fho busInoss µrocoss from sfnrf fo fInIsh.
A unfn wnrohouso goos boyonu frnuIfIonnI InformnfIon vIows by focusIng on onforµrIso-
wIuo subjocfs such ns cusfomors, snIos, nnu µrofIfs. Thoso subjocfs sµnn bofh orgnnIznfIonnI
nnu µrocoss boununrIos nnu roquIro InformnfIon from muIfIµIo sourcos fo µrovIuo n comµIofo
µIcfuro.
Databases
AIfhough fho form unfn wnrohousIng fochnoIogIos Is usou fo rofor fo fho gnmuf of
fochnoIogy comµononfs fhnf nro roquIrou fo µInn, uovoIoµ, mnnngo, ImµIomonf, nnu uso n unfn
wnrohouso, fho form unfn wnrohouso IfsoIf rofors fo n Inrgo, ronu-onIy roµosIfory of unfn.
Af fho vory honrf of ovory unfn wnrohouso IIo fho Inrgo unfnbnsos fhnf sforo fho Infogrnfou
unfn of fho onforµrIso, obfnInou from bofh InfornnI nnu oxfornnI unfn sourcos. Tho form
InfornnI unfn rofors fo nII unfn fhnf nro oxfrncfou from fho oµornfIonnI sysfoms of fho
onforµrIso. IxfornnI unfn nro unfn µrovIuou by fhIru-µnrfy orgnnIznfIons, IncIuuIng busInoss
µnrfnors, cusfomors, govornmonf bouIos, nnu orgnnIznfIons fhnf chooso fo mnko n µrofIf by
soIIIng fhoIr unfn (o.g., crouIf buronus).
AIso sforou In fho unfnbnsos nro fho mofnunfn fhnf uoscrIbo fho confonfs of fho unfn
wnrohouso. A moro fhorough uIscussIon on mofnunfn nnu fhoIr roIo In unfn wnrohousIng Is
µrovIuou In Uhnµfor 3.
Required for Decision-Making
!nIIko fho unfnbnsos of oµornfIonnI sysfoms, whIch nro offon normnIIzou fo µrosorvo
nnu mnInfnIn unfn InfogrIfy, n unfn wnrohouso Is uosIgnou nnu sfrucfurou In n uomornIIzou
mnnnor fo boffor suµµorf fho usnbIIIfy of fho unfn wnrohouso. !sors nro boffor nbIo fo
oxnmIno, uorIvo, summnrIzo, nnu nnnIyzo unfn nf vnrIous IovoIs of uofnII, ovor uIfforonf
µorIous of fImo, whon usIng n uomornIIzou unfn sfrucfuro.
Tho unfnbnso Is uomornIIzou fo mImIc n busInoss usor`s uImonsIonnI vIow of fho busInoss.
Ior oxnmµIo, whIIo n fInnnco mnnngor Is Inforosfou In fho µrofIfnbIIIfy of fho vnrIous µrouucfs
of n comµnny, n µrouucf mnnngor wIII bo moro Inforosfou In fho snIos of fho µrouucf In fho
vnrIous snIos rogIons. In unfn wnrohousIng µnrInnco, usors noou fo ¨sIIco nnu uIco¨ fhrough
uIfforonf nrons of fho unfnbnso nf uIfforonf IovoIs of uofnII fo obfnIn fho InformnfIon fhoy
noou. In fhIs mnnnor, n uocIsIon-mnkor cnn sfnrf wIfh n hIgh-IovoI vIow of fho busInoss, fhon
urIII uown fo gof moro uofnII on fho nrons fhnf roquIro hIs nffonfIon, or vIco vorsn.
Each Unit of Data is Relevant to a Point in Time
Ivory unfn wnrohouso wIII InovIfnbIy hnvo n TImo uImonsIon; onch unfn Ifom {nIso
cnIIou fncfs or monsuros) In fho unfn wnrohouso Is fImo-sfnmµou fo suµµorf quorIos or
roµorfs fhnf roquIro fho comµnrIson of fIguros from µrIor monfhs or yonrs.
Tho fImo-sfnmµIng of onch fncf nIso mnkos If µossIbIo for uocIsIon-mnkors fo rocognIzo
fronus nnu µnfforns In cusfomor or mnrkof bohnvIor ovor fImo.
28 ÐATA WA!IHO!SIÞC, O!AI AÞÐ ÐATA MIÞIÞC
A Data Warehouse Contains both Atomic and Summarized Data
Ðnfn wnrohousos hoIu unfn nf uIfforonf IovoIs of uofnII. Ðnfn nf fho mosf uofnIIou IovoI,
I.o., fho nfomIc IovoI, nro usou fo uorIvo fho summnrIzou nggrognfou vnIuos. Aggrognfos
(µrosummnrIzou unfn) nro sforou In fho wnrohouso fo sµoou uµ rosµonsos fo quorIos nf
hIghor IovoIs of grnnuInrIfy.
If fho unfn wnrohouso sforos unfn onIy nf summnrIzou IovoIs, Ifs usors wIII nof bo nbIo
fo urIII uown on unfn Ifoms fo gof moro uofnIIou InformnfIon. Howovor, fho sforngo of vory
uofnIIou unfn rosuIfs In Inrgor sµnco roquIromonfs.
2.3 THE DYNAMIC, AD HOC REPORT
Tho mosf IuonI sconnrIo for onforµrIso uocIsIon-mnkors (nnu for IT µrofossIonnIs) Is fo
hnvo n roµosIfory of unfn nnu n sof of fooIs fhnf wIII nIIow uocIsIon-mnkors fo cronfo fhoIr
own sof of uynnmIc roµorfs. Tho form uynnmIc roµorf rofors fo n roµorf fhnf cnn bo quIckIy
mouIfIou by Ifs usor fo µrosonf oIfhor gronfor or Iossor uofnII, wIfhouf nny nuuIfIonnI
µrogrnmmIng roquIrou. ÐynnmIc roµorfs nro fho onIy kInu of roµorfs fhnf µrovIuo fruo, nu-
hoc roµorfIng cnµnbIIIfIos. IIguro 2.2 µrosonfs nn oxnmµIo of n uynnmIc roµorf.
Ior Curronf Yonr, 2Q
SnIos !ogIon Tnrgofs AcfunIs
(`000s) (`000s)
AsIn 24,000 25,550
Iuroµo l0,000 l2,200
Þorfh AmorIcn 8,000 2,000
AfrIcn 5,600 6,200
. . .
IIguro 2.2. Tho ÐynnmIc !oµorfSummnry VIow
A uocIsIon-mnkor shouIu bo nbIo fo sfnrf wIfh n shorf roµorf fhnf summnrIzos fho
µorformnnco of fho onforµrIso. Whon fho summnry cnIIs nffonfIon fo nn nron fhnf bonrs
cIosor InsµocfIng, fho uocIsIon-mnkor shouIu bo nbIo fo µoInf fo fhnf µorfIon of fho roµorf,
fhon obfnIn gronfor uofnII on If uynnmIcnIIy, on nn ns-noouou bnsIs, wIfh no furfhor
µrogrnmmIng. IIguro 2.3 µrosonfs n uofnIIou vIow of fho summnry shown In IIguro 2.2,
Ior Curronf Yonr, 2Q
SnIos !ogIon Counfry Tnrgofs (`000s) AcfunIs (`000s)
AsIn IhIIIµµInos l4,000 l5,050
Hong Kong l0,000 l0,500
Iuroµo Irnnco 4,000 4,050
IfnIy 6,000 8,l50
Þorfh AmorIcn !nIfou Sfnfos l,000 l,500
Unnnun ?,000 500
AfrIcn Igyµf 5,600 6,200
IIguro 2.3. Tho ÐynnmIc !oµorfÐofnIIou VIow
ÐATA WA!IHO!SI UOÞUIITS 29
Iy µrovIuIng busInoss usors wIfh fho nbIIIfy fo uynnmIcnIIy vIow moro or Ioss of fho
unfn on nn nu hoc, ns noouou bnsIs, fho unfn wnrohouso oIImInnfos uoInys In goffIng
InformnfIon nnu romovos fho IT µrofossIonnI from fho roµorf-cronfIon Iooµ.
2.4 THE PURPOSES OF A DATA WAREHOUSE
Af fhIs µoInf, If Is hoIµfuI fo summnrIzo fho fyµIcnI ronsons, fho onforµrIsos unuorfnko
unfn wnrohousIng InIfInfIvos.
To Provide Business Users with Access to Data
Tho unfn wnrohouso µrovIuos nccoss fo Infogrnfou onforµrIso unfn µrovIousIy Iockou
nwny In unfrIonuIy, uIffIcuIf-fo-nccoss onvIronmonfs. IusInoss usors cnn now osfnbIIsh, wIfh
mInImnI offorf, n socurou connocfIon fo fho wnrohouso fhrough fhoIr uoskfoµ IU. SocurIfy
Is onforcou oIfhor by fho wnrohouso fronf-onu nµµIIcnfIon, or by fho sorvor unfnbnso, or by
fho bofh.
Iocnuso of Ifs Infogrnfou nnfuro, n unfn wnrohouso sµnros busInoss usors from fho noou
fo Ionrn, unuorsfnnu, or nccoss oµornfIonnI unfn In fhoIr nnfIvo onvIronmonfs nnu unfn
sfrucfuros.
To Provide One Version of the Truth
Tho unfn In fho unfn wnrohouso nro consIsfonf nnu qunIIfy nssurou boforo boIng roIonsou
fo busInoss usors. SInco n common sourco of InformnfIon Is now usou, fho unfn wnrohouso
µufs fo rosf nII uobnfos nbouf fho vorncIfy of unfn usou or cIfou In moofIngs. Tho unfn
wnrohouso bocomos fho common InformnfIon rosourco for uocIsIonnI µurµosos fhroughouf
fho orgnnIznfIon.
Þofo fhnf ¨ono vorsIon of fho frufh¨ Is offon µossIbIo onIy nffor much uIscussIon nnu
uobnfo nbouf fho forms usou wIfhIn fho orgnnIznfIon. Ior oxnmµIo, fho form cusfomor cnn
hnvo uIfforonf monnIngs fo uIfforonf µooµIoIf Is nof unusunI for somo µooµIo fo rofor fo
µrosµocfIvo cIIonfs ns ¨cusfomors,¨ whIIo ofhors In fho snmo orgnnIznfIon mny uso fho form
¨cusfomors¨ fo monn onIy ncfunI, curronf cIIonfs.
WhIIo fhoso uIfforoncos mny soom frIvInI nf fho fIrsf gInnco, fho subfIo nunncos fhnf
oxIsf uoµonuIng on fho confoxf mny rosuIf In mIsIonuIng numbors nnu III-Informou uocIsIons.
Ior oxnmµIo, whon fho Wosforn !ogIon snIos mnnngor nsks for fho numbor of cusfomors,
ho µrobnbIy monns fho ¨numbor of cusfomors from fho Wosforn !ogIon,¨ nof fho ¨numbor
of cusfomors sorvou by fho onfIro comµnny.¨
To Record the Past Accurately
Mnny of fho fIguros nrIu numbors fhnf mnnngors rocoIvo hnvo IIffIo monnIng unIoss
comµnrou fo hIsforIcnI fIguros. Ior oxnmµIo, roµorfs fhnf comµnro fho comµnny`s µrosonf
µorformnnco wIfh fhnf of fho Insf yonr`s nro quIfo common. !oµorfs fhnf show fho comµnny`s
µorformnnco for fho snmo monfh ovor fho µnsf fhroo yonrs nro IIkowIso of Inforosf fo
uocIsIon-mnkors.
Tho oµornfIonnI sysfoms wIII nof bo nbIo fo moof fhIs kInu of InformnfIon noou for n
goou ronson. A unfn wnrohouso shouIu bo usou fo rocoru fho µnsf nccurnfoIy, IonvIng fho
O!TI sysfoms froo fo focus on rocoruIng curronf frnnsncfIons nnu bnInncos. AcfunI hIsforIcnI
30 ÐATA WA!IHO!SIÞC, O!AI AÞÐ ÐATA MIÞIÞC
vnIuos nro noIfhor sforou on fho oµornfIonnI sysfom nor uorIvou by nuuIng or subfrncfIng
frnnsncfIon vnIuos ngnInsf fho Infosf bnInnco. Insfonu, hIsforIcnI unfn nro Ionuou nnu Infogrnfou
wIfh ofhor unfn In fho wnrohouso for quIck nccoss.
To Slice and Dice Through Data
As sfnfou onrIIor In fhIs chnµfor, uynnmIc roµorfs nIIow usors fo vIow wnrohouso unfn
from uIfforonf nngIos, nf uIfforonf IovoIs of uofnII busInoss usors wIfh fho monns nnu fho
nbIIIfy fo sIIco nnu uIco fhrough wnrohouso unfn cnn ncfIvoIy moof fhoIr own InformnfIon
noous.
Tho ronuy nvnIInbIIIfy of uIfforonf unfn vIows nIso Imµrovos busInoss nnnIysIs by rouucIng
fho fImo nnu offorf roquIrou fo coIIocf, formnf, nnu uIsfIII InformnfIon from unfn.
To Separate Analytical and Operational Processing
ÐocIsIonnI µrocossIng nnu oµornfIonnI InformnfIon µrocossIng hnvo fofnIIy uIvorgonf
nrchIfocfurnI roquIromonfs. Affomµfs fo moof bofh uocIsIonnI nnu oµornfIonnI InformnfIon
noous fhrough fho snmo sysfom or fhrough fho snmo sysfom nrchIfocfuro moroIy Incronso
fho brIffIonoss of fho IT nrchIfocfuro nnu wIII cronfo sysfom mnInfonnnco nIghfmnros.
Ðnfn wnrohousIng uIsonfnngIos nnnIyfIcnI from oµornfIonnI µrocossIng by µrovIuIng n
soµnrnfo sysfom nrchIfocfuro for uocIsIonnI ImµIomonfnfIons. ThIs mnkos fho ovornII IT
nrchIfocfuro of fho onforµrIso moro rosIIIonf fo chnngIng roquIromonfs.
To Support the Reengineering of Decisional Processes
Af fho onu of onch II! InIfInfIvo como fho µrojocfs roquIrou fo osfnbIIsh fho fochnoIogIcnI
nnu orgnnIznfIonnI sysfoms fo suµµorf fho nowIy roongInoorou busInoss µrocoss.
AIfhough roongInoorIng µrojocfs hnvo frnuIfIonnIIy focusou on oµornfIonnI µrocossos,
unfn wnrohousIng fochnoIogIos mnko If µossIbIo fo roongInoor uocIsIonnI busInoss µrocossos
ns woII. Ðnfn wnrohousos, wIfh fhoIr focus on moofIng uocIsIonnI busInoss roquIromonfs, nro
fho IuonI sysfoms for suµµorfIng roongInoorou uocIsIonnI busInoss µrocossos.
2.5 DATA MARTS
A uIscussIon of unfn wnrohousos Is nof comµIofo wIfhouf n nofo on unfn mnrfs. Tho
concoµf of fho unfn mnrf Is cnusIng n Iof of oxcIfomonf nnu nffrncfs much nffonfIon In fho
unfn wnrohouso Inuusfry. MosfIy, unfn mnrfs nro µrosonfou ns nn InoxµonsIvo nIfornnfIvo
fo n unfn wnrohouso fhnf fnkos sIgnIfIcnnfIy Ioss fImo nnu monoy fo buIIu. Howovor, fho
form unfn mnrf monns uIfforonf fhIngs fo uIfforonf µooµIo. A rIgorous uofInIfIon of fhIs form
Is n unfn sforo fhnf Is subsIuInry fo n unfn wnrohouso of Infogrnfou unfn. Tho unfn mnrf Is
uIrocfou nf n µnrfIfIon of unfn (offon cnIIou n subjocf nron) fhnf Is cronfou for fho uso of n
uouIcnfou grouµ of usors. A unfn mnrf mIghf, In fncf, bo n sof of uonormnIIzou, summnrIzou,
or nggrognfou unfn. SomofImos, such n sof couIu bo µIncou on fho unfn wnrohouso unfnbnso
rnfhor fhnn n µhysIcnIIy soµnrnfo sforo of unfn. In mosf Insfnncos, howovor, fho unfn mnrf
Is n µhysIcnIIy soµnrnfo sforo of unfn nnu Is normnIIy rosIuonf on n soµnrnfo unfnbnso sorvor,
offon on fho IocnI nron onforµrIsos roInfIonnI O!AI fochnoIogy whIch cronfos hIghIy
uonormnIIzou sfnr schomn roInfIonnI uosIgns or hyµorcubos of unfn for nnnIysIs by grouµs
of usors wIfh n common Inforosf In n IImIfou µorfIon of fho unfnbnso. In ofhor cnsos, fho unfn
wnrohouso nrchIfocfuro mny Incorµornfo unfn mInIng fooIs fhnf oxfrncf sofs of unfn for n
ÐATA WA!IHO!SI UOÞUIITS 3l
µnrfIcuInr fyµo of nnnIysIs. AII fhoso fyµo of unfn mnrfs, cnIIou uoµonuonf unfn mnrfs
bocnuso fhoIr unfn confonf Is sourcou from fho unfn wnrohouso, hnvo n hIgh vnIuo bocnuso
no mnffor how mnny nro uoµIoyou nnu no mnffor how mnny uIfforonf onnbIIng fochnoIogIos
nro usou, fho uIfforonf usors nro nII nccossIng fho InformnfIon vIows uorIvou from fho snmo
sIngIo Infogrnfou vorsIon of fho unfn.
!nforfunnfoIy, fho mIsIonuIng sfnfomonfs nbouf fho sImµIIcIfy nnu Iow cosf of unfn
mnrfs somofImos rosuIf In orgnnIznfIons or vonuors IncorrocfIy µosIfIonIng fhom ns nn
nIfornnfIvo fo fho unfn wnrohouso. ThIs vIowµoInf uofInos Inuoµonuonf unfn mnrfs fhnf In
fncf roµrosonf frngmonfou µoInf soIufIons fo n rnngo of busInoss µrobIoms In fho onforµrIso.
ThIs fyµo of ImµIomonfnfIon shouIu rnroIy bo uoµIoyou In fho confoxf of nn ovornII fochnoIogy
of nµµIIcnfIons nrchIfocfuro. Inuoou, If Is mIssIng fho IngrouIonf fhnf Is nf fho honrf of fho
unfn wnrohousIng concoµf: unfn InfogrnfIon. Inch Inuoµonuonf unfn mnrf mnkos Ifs own
nssumµfIons nbouf how fo consoIIunfo fho unfn, nnu fho unfn ncross sovornI unfn mnrfs mny
nof bo consIsfonf.
Moroovor, fho concoµf of nn Inuoµonuonf unfn mnrf Is unngorous ns soon ns fho fIrsf
unfn mnrf Is cronfou, ofhor orgnnIznfIons, grouµs, nnu subjocf nrons wIfhIn fho onforµrIso
ombnrk on fho fnsk of buIIuIng fhoIr own unfn mnrfs. As n rosuIf, nn onvIronmonf Is cronfou
In whIch muIfIµIo oµornfIonnI sysfoms foou muIfIµIo non-Infogrnfou unfn mnrfs fhnf nro
offon ovorInµµIng In unfn confonf, job schouuIIng, connocfIvIfy, nnu mnnngomonf. In ofhor
worus, n comµIox mnny-fo-ono µrobIom of buIIuIng n unfn wnrohouso Is frnnsformou from
oµornfIonnI nnu oxfornnI unfn sourcos fo n mnny-fo-mnny sourcIng nnu mnnngomonf
nIghfmnro.
Anofhor consIuornfIon ngnInsf Inuoµonuonf unfn mnrfs Is roInfou fo fho µofonfInI
scnInbIIIfy µrobIom: fho fIrsf sImµIo nnu InoxµonsIvo unfn mnrf wns mosf µrobnbIy uosIgnou
wIfhouf nny sorIous consIuornfIon nbouf fho scnInbIIIfy (for oxnmµIo, nn oxµonsIvo µnrnIIoI
comµufIng µInfform for nn ¨InoxµonsIvo¨ nnu ¨smnII¨ unfn mnrf wouIu nof bo consIuorou).
Iuf, ns usngo bogofs usngo, fho InIfInI smnII unfn mnrf noous fo grow (I.o., In unfn sIzos nnu
fho numbor of concurronf usors), wIfhouf nny nbIIIfy fo uo so In n scnInbIo fnshIon.
If Is cIonr fhnf fho µoInf-soIufIon-Inuoµonuonf unfn mnrf Is nof nocossnrIIy n bnu fhIng,
nnu If Is offon n nocossnry nnu vnIIu soIufIon fo n µrossIng busInoss µrobIom, fhus nchIovIng
fho gonI of rnµIu uoIIvory of onhnncou uocIsIon suµµorf funcfIonnIIfy fo onu usors. Tho
busInoss urIvors unuorIyIng such uovoIoµmonfs IncIuuo:
: IxfromoIy urgonf usor roquIromonfs.
: Tho nbsonco of n buugof for n fuII unfn wnrohouso sfrnfogy.
: Tho nbsonco of n sµonsor for nn onforµrIso wIuo uocIsIon suµµorf sfrnfogy.
: Tho uoconfrnIIznfIon of busInoss unIfs.
: Tho nffrncfIon of onsy-fo-uso fooIs nnu n mInu-sIzou µrojocf.
To nuuross unfn InfogrnfIon Issuos nssocInfou wIfh unfn mnrfs, fho rocommonuou
nµµronch µroµosou by !nIµh KImbnII Is ns foIIows. Ior nny fwo unfn mnrf In nn onforµrIso,
fho common uImonsIons musf conform fo fho oqunIIfy nnu roII-uµ ruIo, whIch sfnfos fhnf
fhoso uImonsIons nro oIfhor fho snmo or fhnf ono Is n sfrIcf roII-uµ of nnofhor.
Thus, In n rofnII sforo chnIn, If fho µurchnso oruors unfnbnso Is ono unfn mnrf nnu fho
snIos unfnbnso Is nnofhor unfn mnrf, fho fwo unfn mnrfs wIII form n cohoronf µnrf of nn
32 ÐATA WA!IHO!SIÞC, O!AI AÞÐ ÐATA MIÞIÞC
ovornII onforµrIso unfn wnrohouso If fhoIr common uImonsIons (o.g., fImo nnu µrouucf)
conform. Tho fImo uImonsIons from bofh unfn mnrfs mIghf bo nf fho InuIvIuunI uny IovoI,
or, convorsoIy, ono fImo uImonsIon Is nf fho uny IovoI buf fho ofhor Is nf fho wook IovoI.
Iocnuso unys roII uµ fo wooks, fho fwo fImo uImonsIons nro conformou. Tho fImo uImonsIons
wouIu nof bo conformou If ono fImo uImonsIon woro wooks nnu fho ofhor fImo uImonsIon,
n fIscnI qunrfor. Tho rosuIfIng unfn mnrfs couIu nof usofuIIy cooxIsf In fho snmo nµµIIcnfIon.
In summnry, unfn mnrfs µrosonf fwo µrobIoms: (l) scnInbIIIfy In sIfunfIons whoro In
InIfInI smnII unfn mnrf grows quIckIy In muIfIµIo uImonsIons nnu (2) unfn InfogrnfIon.
Thoroforo, whon uosIgnIng unfn mnrfs, fho orgnnIznfIons shouIu µny cIoso nffonfIon fo sysfom
scnInbIIIfy, unfn consIsfoncy, nnu mnnngonbIIIfy Issuos. Tho koy fo n succossfuI unfn mnrf
sfrnfogy Is fho uovoIoµmonf of ovornII scnInbIo unfn wnrohouso nrchIfocfuro; nnu koy sfoµ
In fhnf nrchIfocfuro Is IuonfIfyIng nnu ImµIomonfIng fho common uImonsIons.
A numbor of mIsconcoµfIons oxIsf nbouf unfn mnrfs nnu fhoIr roInfIonshIµs fo unfn
wnrohousos uIscuss fwo of fhoso mIsconcoµfIons boIow. .
MISCONCEPTION
Data Warehouses and Data Marts cannot Coexist
Thoro nro µnrfIos who sfrongIy nuvocnfo fho uoµIoymonf of unfn mnrfs ns oµµosou fo
fho uoµIoymonf of unfn wnrohousos. Thoy corrocfIy µoInf ouf fho uIffIcuIfIos of buIIuIng nn
onforµrIso wIuo unfn wnrohouso In ono Inrgo µrojocf nnu Ionu unsusµocfIng orgnnIznfIons
uown fho ¨unfn mnrf vs. unfn wnrohouso¨ µnfh.
Whnf mnny uo nof ImmouInfoIy ronIIzo Is fhnf unfn wnrohousos nnu unfn mnrfs cnn
cooxIsf wIfhIn fho snmo orgnnIznfIon; fho corrocf nµµronch Is ¨unfn mnrf nnu unfn wnrohouso.¨
ThIs Is uIscussou moro fhoroughIy In fho ¨WnrohousIng ArchIfocfuros¨ socfIon of Uhnµfor 5.
Data Marts can be Built Independently of One Another
Somo onforµrIsos fInu If onsIor fo uoµIoy muIfIµIo unfn mnrfs InuoµonuonfIy of ono
nnofhor. Af fho fIrsf gInnco, such nn nµµronch Is Inuoou onsIor sInco fhoro nro no InfogrnfIon
Issuos. ÐIfforonf grouµs of usors nro InvoIvou wIfh onch unfn mnrf, whIch ImµIIos fowor
confIIcfs nbouf fho uso of forms nnu nbouf busInoss ruIos. Inch unfn mnrf Is froo fo oxIsf
wIfhIn Ifs own IsoInfou worIu, nnu nII fho usors nro hnµµy.
!nforfunnfoIy, fhnf onforµrIsos fnII fo ronIIzo unfII much Infor Is fhnf by uoµIoyIng ono
IsoInfou unfn mnrf nffor nnofhor, fho onforµrIso hns ncfunIIy cronfou now IsInnus of
nufomnfIon. WhIIo nf fho onsof fhoso unfn mnrfs nro corfnInIy onsIor fo uovoIoµ, fho fnsk
of mnInfnInIng mnny unroInfou unfn mnrfs Is oxcoouIngIy comµIox nnu wIII cronfo unfn
mnnngomonf, synchronIznfIon, nnu consIsfoncy Issuos. Inch unfn mnrf µrosonfs Ifs own
vorsIon of ¨fho frufh¨ nnu wIII quIfo nnfurnIIy µrovIuo InformnfIon fhnf confIIcfs wIfh fho
roµorfs from ofhor unfn mnrfs.
MuIfIµIo unfn mnrfs nro uofInIfoIy nµµroµrInfo wIfhIn nn orgnnIznfIon, buf fhoso shouIu
bo ImµIomonfou onIy unuor fho InfogrnfIng frnmowork of nn onforµrIso-wIuo unfn wnrohouso.
Inch unfn mnrf Is uovoIoµou ns nn oxfonsIon of fho unfn wnrohouso nnu Is fou by fho unfn
wnrohouso. Tho unfn wnrohousos onforcos n consIsfonf sof of busInoss ruIos nnu onsuros fho
consIsfonf uso of forms nnu uofInIfIons.
ÐATA WA!IHO!SI UOÞUIITS 33
2.6 OPERATIONAL DATA STORES
Ðnfn wnrohouso uIscussIons wIII nIso nnfurnIIy Ionu fo OµornfIonnI Ðnfn Sforos, whIch
nf fho fIrsf gInnco mny nµµonr no uIfforonf from unfn wnrohousos.
AIfhough bofh fochnoIogIos suµµorf uocIsIonnI InformnfIon noous of onforµrIso uocIsIon-
mnkors, fho fwo nro uIsfIncfIy uIfforonf nnu nro uoµIoyou fo moof uIfforonf fyµos of uocIsIonnI
InformnfIon noous.
Definition of Operational Data Stores
In IuIIuIng fho OµornfIonnI Ðnfn Sforo (John WIIoy & Sons, l996, ISIÞ: 0-4?l-l2822-8),
W.H. Inmon, U. Imhoff, nnu C. Inffns uofIno nn OµornfIonnI Ðnfn Sforo ns ¨fho nrchIfocfurnI
consfrucf whoro coIIocfIvo Infogrnfou oµornfIonnI unfn Is sforou.¨ OÐS cnn nIso bo uofInou
ns n coIIocfIon of Infogrnfou unfnbnsos uosIgnou fo suµµorf oµornfIonnI monIforIng. !nIIko
fho unfnbnsos of O!TI nµµIIcnfIons (fhnf nro oµornfIonnI or funcfIon orIonfou), fho OµornfIonnI
Ðnfn Sforo confnIns subjocf-orIonfou, onforµrIso-wIuo unfn. Howovor, unIIko unfn wnrohousos,
fho unfn In OµornfIonnI Ðnfn Sforos nro voInfIIo, curronf nnu uofnIIou.
Howovor, somo sIgnIfIcnnf chnIIongos of fho OÐS sfIII romnIn. Among fhom nro
: !ocnfIon of fho nµµroµrInfo sourcos of unfn
: TrnnsformnfIon of fho sourco unfn fo snfIsfy fho OÐS unfn mouoI roquIromonfs.
: UomµIoxIfy of nonr-ronI-fImo µroµngnfIon of chnngos from fho oµornfIonnI sysfoms
fo fho OÐS (IncIuuIng fnsks fo rocognIzo, obfnIn, synchronIzo, nnu movo chnngos
from n muIfIfuuo of uIsµnrnfo sysfoms.)
: A ÐIMS fhnf combInos offocfIvo quory µrocossIng wIfh frnnsncfIonnI µrocossIng
cnµnbIIIfIos fhnf onsuro fho AUIÐ frnnsncfIon µroµorfIos.
TnbIo 2.l. Comµnros fho Ðnfn Wnrohouso wIfh fho OµornfIonnI Ðnfn Sforo
1a1a 1a1e7uuse t¡e1a11uta1 1a1a S1u1es
Iurµoso: SfrnfogIc ÐocIsIon Suµµorf OµornfIonnI MonIforIng
SImIInrIfIos: Infogrnfou Ðnfn Infogrnfou Ðnfn
Subjocf-OrIonfou Subjocf-OrIonfou
ÐIfforoncos: SfnfIc Ðnfn VoInfIIo Ðnfn
HIsforIcnI Ðnfn Uurronf Ðnfn
SummnrIzou Ðnfn Moro ÐofnIIou
Tho OÐS µrovIuos nn Infogrnfou vIow of fho unfn In fho oµornfIonnI sysfoms. Ðnfn nro
frnnsformou nnu Infogrnfou Info n consIsfonf, unIfIou whoIo ns fhoy nro obfnInou from Iogncy
nnu ofhor oµornfIonnI sysfoms fo µrovIuo busInoss usors wIfh nn Infogrnfou nnu curronf
vIow of oµornfIons. Ðnfn In fho OµornfIonnI Ðnfn Sforo nro consfnnfIy rofroshou so fhnf fho
rosuIfIng Imngo rofIocfs fho Infosf sfnfo of oµornfIons.
Flash Monitoring and Reporting Tools
As monfIonou In Uhnµfor l, fInsh monIforIng nnu roµorfIng fooIs nro IIko n unshbonru
fhnf µrovIuos monnIngfuI onIIno InformnfIon on fho oµornfIonnI sfnfus of fho onforµrIso.
ThIs sorvIco Is nchIovou by fho uso of OÐS unfn ns Inµufs fo fho fInsh monIforIng nnu
34 ÐATA WA!IHO!SIÞC, O!AI AÞÐ ÐATA MIÞIÞC
roµorfIng fooIs, fo µrovIuo busInoss usors wIfh n consfnnfIy rofroshou, onforµrIso-wIuo vIow
of oµornfIons wIfhouf cronfIng unwnnfou InforruµfIons or nuuIfIonnI Ionu on frnnsncfIon-
µrocossIng sysfoms. IIguro 2.4 uIngrnms how fhIs schomo works.
Legacy System 1
Legacy System 2
Legacy System N
Integration and
Transformation of
Legacy Data
Operational
Data Store
Flash Monitoring
and Reporting
Other Systems
IIguro 2.4. OµornfIonnI MonIforIng
Relationship of Operational Data Stores to Data Warehouse
InforµrIsos wIfh OµornfIonnI Ðnfn Sforos fInu fhomsoIvos In fho onvInbIo µosIfIon of
boIng nbIo fo uoµIoy unfn wnrohousos wIfh consIuornbIo onso. SInco oµornfIonnI unfn sforos
nro Infogrnfou, mnny of fho Issuos roInfou fo oxfrncfIng, frnnsformIng, nnu frnnsµorfIng
unfn from Iogncy sysfoms hnvo boon nuurossou by fho OÐS, ns IIIusfrnfou In IIguro 2.5.
Legacy System 1
Integration and
Transformation of
Operational Data
Operational
Data Store
Data
Warehouse
Flash Monitoring
and Reporting
Decision Support
Legacy System N
IIguro 2.5. Tho OµornfIonnI Ðnfn Sforo Ioous fho Ðnfn Wnrohouso
Tho unfn wnrohouso Is µoµuInfou by monns of roguInr snnµshofs of fho unfn In fho
OµornfIonnI Ðnfn Sforo. Howovor, unIIko 'fho OÐS, fho unfn wnrohouso mnInfnIns fho
hIsforIcnI snnµshofs of fho unfn for comµnrIsons ncross uIfforonf fImo frnmos. Tho OÐS Is
froo fo focus onIy on fho curronf sfnfo of oµornfIons nnu Is consfnnfIy uµunfou In ronI fImo.
ÐATA WA!IHO!SI UOÞUIITS 35
2.7 DATA WAREHOUSE COST-BENEFIT ANALYSIS/RETURN ON INVESTMENT
SonIor mnnngomonf fyµIcnIIy roquIros n cosf-bonofIf nnnIysIs (UIA) or n sfuuy of rofurn
on Invosfmonf (!OI) µrIor fo ombnrkIng on n unfn wnrohousIng InIfInfIvo. AIfhough fho fnsk
of cnIcuInfIng !OI for unfn wnrohousIng InIfInfIvos Is unIquo fo onch onforµrIso, If Is µossIbIo
fo cInssIfy fho fyµo of bonofIfs nnu cosfs fhnf nro nssocInfou wIfh unfn wnrohousIng.
Benefits
Ðnfn wnrohousIng bonofIfs cnn bo oxµocfou from fho foIIowIng nrons:
: !odoµIoymonf of sfnff nssIgnod fo oId docIsIonnI sysfoms. Tho cosf of µrouucIng
founy`s mnnngomonf roµorfs Is fyµIcnIIy unuocumonfou nnu unknown wIfhIn nn
onforµrIso. Tho qunnfIfIcnfIon of such cosfs In forms of sfnff hours nnu orronoous
unfn mny yIoIu surµrIsIng rosuIfs. IonofIfs of fhIs nnfuro, howovor, nro fyµIcnIIy
mInImnI, sInco wnrohouso mnInfonnnco nnu onhnncomonfs roquIro sfnff ns woII. Af
bosf, sfnff wIII bo rouoµIoyou fo moro µrouucfIvo fnsks.
: Imµrovod µroducfIvIfy of nnnIyfIcnI sfnff duo fo nvnIInbIIIfy of dnfn. AnnIysfs
go fhrough sovornI sfoµs In fhoIr uny-fo-uny work: IocnfIng unfn, rofrIovIng unfn,
nnnIyzIng unfn fo yIoIu InformnfIon, µrosonfIng InformnfIon, nnu rocommonuIng n
courso of ncfIon. !nforfunnfoIy, much of fho fImo (somofImos uµ fo 40 µorconf) sµonf
by onforµrIso nnnIysfs on n fyµIcnI uny Is uovofou fo IocnfIng nnu rofrIovIng unfn. Tho
nvnIInbIIIfy of Infogrnfou, ronuIIy nccossIbIo unfn (In fho unfn wnrohouso) shouIu
sIgnIfIcnnfIy rouuco fho fImo fhnf nnnIysfs sµonu wIfh unfn coIIocfIon fnsks nnu Incronso
fho fImo nvnIInbIo fo ncfunIIy nnnIyzo fho unfn fhoy hnvo coIIocfou. ThIs Ionus oIfhor
fo shorfor uocIsIon cycIo fImos or Imµrovomonfs In fho qunIIfy of fho nnnIysIs.
: IusInoss Imµrovomonfs rosuIfIng from nnnIysIs of wnrohouso dnfn. Tho
mosf sIgnIfIcnnf busInoss Imµrovomonfs In wnrohousIng rosuIf from fho nnnIysIs of
wnrohouso unfn, osµocInIIy If fho onsy nvnIInbIIIfy of InformnfIon yIoIus InsIghfs
horo boforo unknown fo fho onforµrIso. Tho gonI of fho unfn wnrohouso Is fo moof
uocIsIonnI InformnfIon noous, fhoroforo If foIIows nnfurnIIy fho gronfosf bonofIfs of
wnrohousIng fhnf nro obfnInou whon uocIsIonnI InformnfIon noous nro ncfunIIy mof
nnu sounu busInoss uocIsIons nro mnuo bofh nf fho fncfIcnI nnu sfrnfogIc IovoI.
!nuorsfnnunbIy, such bonofIfs nro moro sIgnIfIcnnf nnu fhoroforo, moro uIffIcuIf fo
µrojocf nnu qunnfIfy.
Costs
Ðnfn wnrohousIng cosfs fyµIcnIIy fnII Info ono of fho four cnfogorIos. Thoso nro:
: Hnrdwnro. ThIs Ifom rofors fo fho cosfs nssocInfou wIfh soffIng uµ fho hnruwnro nnu
oµornfIng onvIronmonf roquIrou by fho unfn wnrohouso. In mnny Insfnncos, fhIs sofuµ
mny roquIro fho ncquIsIfIon of now oquIµmonf or fho uµgrnuo of oxIsfIng oquIµmonf.
!nrgor wnrohouso ImµIomonfnfIons nnfurnIIy ImµIy hIghor hnruwnro cosfs.
: Soffwnro. ThIs Ifom rofors fo fho cosfs of µurchnsIng fho IIconsos fo uso soffwnro
µrouucfs fhnf nufomnfo fho oxfrncfIon, cIonnsIng, IonuIng, rofrIovnI, nnu µrosonfnfIon
of wnrohouso unfn.
: SorvIcos. ThIs Ifom rofors fo sorvIcos µrovIuou by sysfoms Infogrnfors, consuIfnnfs,
nnu frnInors uurIng fho courso of n unfn wnrohouso µrojocf. InforµrIsos fyµIcnIIy
36 ÐATA WA!IHO!SIÞC, O!AI AÞÐ ÐATA MIÞIÞC
roIy moro on fho sorvIcos of fhIru µnrfIos In onrIy wnrohousIng ImµIomonfnfIons,
whon fho fochnoIogy Is quIfo now fo fho onforµrIso.
: InfornnI sfnff cosfs. ThIs Ifom rofors fo cosfs Incurrou by nssIgnIng InfornnI sfnff
fo fho unfn wnrohousIng offorf, ns woII ns fo cosfs nssocInfou wIfh frnInIng InfornnI
sfnff on now fochnoIogIos nnu fochnIquos.
ROI Considerations
Tho cosfs nnu bonofIfs nssocInfou wIfh unfn wnrohousIng vnry sIgnIfIcnnfIy from ono
onforµrIso fo nnofhor. Tho uIfforoncos nro chIofIy InfIuoncou by
: Tho curronf sfnfo of fochnoIogy wIfhIn fho onforµrIso;
: Tho cuIfuro of fho orgnnIznfIon In forms of uocIsIon-mnkIng sfyIos nnu nffIfuuos
fownrus fochnoIogy nnu
: Tho comµnny`s µosIfIon In Ifs choson mnrkof vs. Ifs comµofIfors.
Tho offocf of unfn wnrohousIng on fho fncfIcnI nnu sfrnfogIc mnnngomonf of nn onforµrIso
Is offon IIkonou fo cIonnIng fho muuuy wInushIoIu of n cnr. If Is uIffIcuIf fo qunnfIfy fho
vnIuo of urIvIng n cnr wIfh n cIonnor wInushIoIu. SImIInrIy, If Is uIffIcuIf fo qunnfIfy fho
vnIuo of mnnngIng nn orgnnIznfIon wIfh boffor InformnfIon nnu InsIghf.
!nsfIy, If Is Imµorfnnf fo nofo fhnf unfn wnrohouso jusfIfIcnfIon Is offon comµIIcnfou by
fho fncf fhnf much of fho bonofIf mny fnko somofImo fo ronIIzo nnu fhoroforo Is uIffIcuIf fo
qunnfIfy In nuvnnco.
In Summary
Ðnfn wnrohousIng fochnoIogIos hnvo ovoIvou ns n rosuIf of fho unsnfIsfIou uocIsIonnI
InformnfIon noous of onforµrIsos. WIfh fho Incronsou sfnbIIIfy of oµornfIonnI sysfoms,
InformnfIon fochnoIogy µrofossIonnIs hnvo IncronsIngIy furnou fhoIr nffonfIon fo moofIng
fho uocIsIonnI roquIromonfs of fho onforµrIso.
A unfn wnrohouso, nccoruIng fo IIII Inmon, Is n coIIocfIon of Infogrnfou, subjocf-orIonfou
unfnbnsos uosIgnou fo suµµIy fho InformnfIon roquIrou for uocIsIon-mnkIng. Inch unfn Ifom
In fho unfn wnrohouso Is roIovnnf fo somo momonf In fImo.
A unfn mnrf hns frnuIfIonnIIy boon uofInou ns n subsof of fho onforµrIso-wIuo unfn
wnrohouso. Mnny onforµrIsos, uµon ronIIzIng fho comµIoxIfy InvoIvou In uoµIoyIng n unfn
wnrohouso, wIII oµf fo uoµIoy unfn mnrfs Insfonu. AIfhough unfn mnrfs nro nbIo fo moof fho
ImmouInfo noous of n fnrgofou grouµ of usors, fho onforµrIso shouIu shy nwny from uoµIoyIng
muIfIµIo, unroInfou unfn mnrfs. Tho µrosonco of such IsInnus of InformnfIon wIII onIy rosuIf
In unfn mnnngomonf nnu synchronIznfIon µrobIoms.
!Iko unfn wnrohousos, OµornfIonnI Ðnfn Sforos nro Infogrnfou nnu subjocf-orIonfou.
Howovor, nn OÐS Is nIwnys curronf nnu Is consfnnfIy uµunfou (IuonIIy In ronI fImo). Tho
OµornfIonnI Ðnfn Sforo Is fho IuonI unfn sourco for n unfn wnrohouso, sInco If nIronuy
confnIns Infogrnfou oµornfIonnI unfn ns of n gIvon µoInf In fImo.
AIfhough unfn wnrohousos hnvo µrovon fo hnvo sIgnIfIcnnf rofurns on Invosfmonf,
µnrfIcuInrIy whon fhoy nro moofIng n sµocIfIc, fnrgofou busInoss noou, If Is oxfromoIy uIffIcuIf
fo qunnfIfy fho oxµocfou bonofIfs of n unfn wnrohouso. Tho cosfs nro onsIor fo cnIcuInfo, ns
fhoso bronk uown sImµIy Info hnruwnro, soffwnro, sorvIcos, nnu In-houso sfnffIng cosfs.
PART II : PEOPLE
Al though a number of peopl e are i nvol ved i n a si ngl e data
war ehousi ng pr oject, ther e ar e thr ee key r ol es that car r y
enormous responsi bi l i ti es. Negl i gence i n carryi ng out any of
these thr ee r ol es can easi l y der ai l a wel l -pl anned data
war ehousi ng i ni ti ati ve. Thi s secti on of the book ther efor e
focuses on the Project Sponsor, the Chi ef I nformati on Offi cer,
and the Project Manager and seeks to answer the questi ons
fr equentl y asked by i ndi vi dual s who have accepted the
responsi bi l i ti es that come wi th these rol es.
• Project Sponsor. Every data warehouse i ni ti ati ve has a
Pr oj ect Sponsor -a hi gh-l evel executi ve who pr ovi des
str ategi c gui dance, suppor t, and di r ecti on to the data
war ehousi ng pr oject. The Pr oject Sponsor ensur es that
project objecti ves are al i gned wi th enterpri se objecti ves,
resol ves organi zati onal i ssues, and usual l y obtai ns fundi ng
for the project.
• Chief Information Officer (CIO). The CI O i s responsi bl e
for the effecti ve depl oyment of i nfor mati on technol ogy
resources and staff to meet the strategi c, deci si onal , and
oper ati onal i nfor mati on r equi r ements of the enter pr i se.
Data warehousi ng, wi th i ts accompanyi ng array of new
technol ogy and i ts dependence on oper ati onal systems,
natur al l y makes str ong demands on the physi cal and
human resources under the juri sdi cti on of the CI O, not
onl y dur i ng desi gn and devel opment but al so dur i ng
mai ntenance and subsequent evol uti on.
• Project Manager. The war ehouse Pr oject Manager i s
r es pon s i bl e for al l tech n i cal acti v i ti es r el ated to
i mpl ementi ng a data warehouse. I deal l y, an I T professi onal
fr om the enter pr i se ful fi l l s thi s cr i ti cal r ol e. I t i s not
unusual , however, for thi s rol e to be outsourced for earl y
or pi l ot projects, because of the newness of warehousi ng
technol ogi es and techni ques.
This page
intentionally left
blank
39
Before the Project Sponsor becomes comfortabl e wi th the data warehousi ng effort, qui te
a number of hi s or her questi ons and concerns wi l l have to be addressed. Thi s chapter
attempts to provi de answers to questi ons frequentl y asked by Project Sponsors.
3.1 HOW DOES A DATA WAREHOUSE AFFECT DECISION-MAKING PROCESSES?
I t i s nai ve to expect an i mmedi ate change to the deci si on-maki ng pr ocesses i n an
or gani zati on when a data war ehouse fi r st goes i nto pr oducti on. End user s wi l l i ni ti al l y
be occupi ed mor e wi th l ear ni ng how to use the data war ehouse than wi th changi ng the
way they obtai n i nfor mati on and make deci si ons. I t i s al so l i kel y that the fi r st set of
pr edefi ned r epor ts and quer i es suppor ted by the data war ehouse wi l l di ffer l i ttl e fr om
exi sti ng r epor ts.
Deci si on-makers wi l l experi ence varyi ng l evel s of i ni ti al di ffi cul ty wi th the use of the
data warehouse; proper usage assumes a l evel of desktop computi ng ski l l s, data knowl edge,
and busi ness knowl edge.
Desktop Computing Skills
Not al l busi ness users are fami l i ar and comfortabl e wi th the desktop computers, and
i t i s unreal i sti c to expect al l the busi ness users i n an organi zati on to make di rect, personal
use of the front-end warehouse tool s. On the other hand, there are power users wi thi n the
organi zati on who enjoy usi ng computers, l ove spreadsheets, and wi l l qui ckl y push the tool s
to the l i mi t wi th thei r queri es and reporti ng requi rements.
Data Knowledge
I t i s cri ti cal that busi ness users be fami l i ar wi th the contents of the data warehouse
before they make use of i t. I n many cases, thi s requi rement entai l s extensi ve communi cati on
on two l evel s. Fi rst, the scope of the warehouse must be cl earl y communi cated to property
manage user expectati ons about the type of i nformati on they can retri eve, parti cul arl y i n
the earl i er rol l outs of the warehouse. Second, busi ness users who wi l l have di rect access to
the data warehouse must be trai ned on the use of the sel ected front-end tool s and on the
meani ng of the warehouse contents.
THE FRO]ECT $FON$OR
3
CHAFTER
40 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Business Knowledge
Warehouse users must have a good understandi ng of the nature of thei r busi ness and
of the types of busi ness i ssues that they need to address. The answers that the warehouse
wi l l provi de are onl y as good as the questi ons that are di rected to i t.
As end users gai n confi dence both i n thei r own ski l l s and i n the veraci ty of the warehouse
contents, data warehouse usage and overal l support of the warehousi ng i ni ti ati ve wi l l i ncrease.
Users wi l l begi n to “outgrow” the canned reports and wi l l move to a more ad hoc, anal yti cal
styl e when usi ng the data warehouse.
As the data scope of the warehouse i ncreases and addi ti onal standard reports are
produced from the warehouse data, deci si on-makers wi l l start feel i ng overwhel med by the
number of standard reports that they recei ve. Deci si on-makers ei ther gradual l y want to
l essen thei r dependence on the regul ar reports or want to start rel yi ng on excepti on reporti ng
or hi ghl i ghti ng, and al ert systems.
Exception Reporting or Highlighting
I nstead of scanni ng regul ar reports one l i ne i tem at a ti me, deci si on-makers want to
r ecei ve excepti on r epor ts that enumer ate onl y the i tems that meet thei r defi ni ti on of
“excepti ons”. For exampl e, i nstead of recei vi ng sal es reports per regi on for al l regi ons wi thi n
the company, a sal es executi ve may i nstead prefer to recei ve sal es reports for areas where
actual sal es fi gures are ei ther 10 percent more or l ess than the budgeted fi gures.
Alert Systems
Al ert systems al so fol l ow the same pri nci pl e, that of hi ghl i ghti ng or bri ngi ng to the fore
areas or i tems that requi re manageri al attenti on and acti on. However, i nstead of reports,
deci si on-makers wi l l recei ve noti fi cati on of excepti ons through other means, for exampl e, an
e-mai l message.
As the warehouse gai ns acceptance, deci si on-maki ng styl es wi l l evol ve from the current
practi ce of wai ti ng for regul ar reports from I T or MI S to usi ng the data warehouse to
understand the current status of operati ons and, further, to usi ng the data warehouse as
the basi s for strategi c deci si on-maki ng. At the most sophi sti cated l evel of usage, a data
warehouse wi l l al l ow seni or management to understand and dri ve the busi ness changes
needed by the enterpri se.
3.2 HOW DOES A DATA WAREHOUSE IMPROVE FINANCIAL PROCESSES?
MARKETING? OPERATIONS?
A successful enterpri se-wi de data warehouse effort wi l l i mprove fi nanci al , marketi ng
and operati onal processes through the si mpl e avai l abi l i ty of i ntegrated data vi ews. Previ ousl y
unavai l abl e perspecti ves of the enterpri se wi l l i ncrease understandi ng of cross-functi onal
oper ati ons. The i ntegr ati on of enter pr i se data r esul ts i n standar di zed ter ms acr oss
organi zati onal uni ts (e.g., a uni form defi ni ti on of customer and profi t). A common set of
metr i cs for measur i ng per for mance wi l l emer ge fr om the data war ehousi ng effor t.
Communi cati on among these di fferent groups wi l l al so i mprove.
Financial Processes
Consol i dated fi nanci al reports, profi tabi l i ty anal ysi s, and ri sk moni tori ng i mprove the
fi nanci al processes of an enterpri se, parti cul arl y i n fi nanci al servi ce i nsti tuti ons, such as
THE PROJECT SPONSOR 41
banks. The very process of consol i dati on requi res the use of a common vocabul ary and
i ncreased understandi ng of operati ons across di fferent groups i n the organi zati on.
Whi l e fi nanci al processes wi l l i mprove because of the newl y avai l abl e i nformati on, i t i s
i mportant to note that the warehouse can provi de i nformati on based onl y on avai l abl e data.
For exampl e, one of the most popul ar banki ng appl i cati ons for data warehousi ng i s profi tabi l i ty
anal ysi s. Unfortunatel y, enterpri ses may encounter a rude shock when i t becomes apparent
that revenues and costs are not tracked at the same l evel of detai l wi thi n the organi zati on.
Banks frequentl y track thei r expenses at the l evel of branches or organi zati on uni ts but
wi sh to compute profi tabi l i ty on a per customer basi s. Wi th profi t fi gures at the customer
l evel and costs at the branch l evel , there i s no di rect way to compute profi t. As a resul t,
enterpri ses may resort to formul as that al l ow them to compute or deri ve cost and revenue
fi gures at the same l evel for compari son purposes.
Marketing
Data warehousi ng supports marketi ng organi zati ons by provi di ng a comprehensi ve
vi ew of each customer and hi s many rel ati onshi ps wi th the enterpri se. Over the years,
marketi ng efforts have shi fted i n focus. Customers are no l onger vi ewed as i ndi vi dual
accounts but i nstead are vi ewed as i ndi vi dual s wi th mul ti pl e accounts. Thi s change i n
perspecti ve provi des the enterpri se wi th cross-sel l i ng opportuni ti es.
The noti on of customers as i ndi vi dual s al so makes possi bl e the segmentati on and profi l i ng
of customers to i mprove target-marketi ng efforts. The avai l abi l i ty of hi stori cal data makes
i t possi bl e to i denti fy trends i n customer behavi or, hopeful l y wi th posi ti ve resul ts i n revenue.
Operations
By provi di ng enterpri se management wi th deci si onal i nformati on, data warehouses
have the potenti al of greatl y affecti ng the operati ons of an enterpri se by hi ghl i ghti ng both
probl ems and opportuni ti es that here before went undetected.
Strategi c or tacti cal deci si ons based on warehouse data wi l l natural l y affect the operati ons
of the enterpri se. I t i s i n thi s area that the greatest return on i nvestment and, therefore,
greatest i mprovement can be found.
3.3 WHEN IS A DATA WAREHOUSE PROJECT JUSTIFIED?
As menti oned i n Chapter 2, return on i nvestment (ROI ) from data warehousi ng projects
var i es fr om or gani zati on to or gani zati on and i s qui te di ffi cul t to quanti fy pr i or to a
warehousi ng i ni ti ati ve.
However, a common l i st of probl ems encountered by enterpri ses can be i denti fi ed as a
resul t of uni ntegrated customer data and l ack of hi stori cal data. A properl y depl oyed data
warehouse can sol ve the probl ems, as di scussed bel ow.
Lack of Information Sharing
• Di vi si ons or departments have the same customers but do not share i nformati on
wi th each other.
• As a r esul t, cr oss-sel l i ng oppor tuni ti es ar e mi ssed, and i mpr oved customer
understandi ng i s l ost. Customers are annoyed by requests for the same i nformati on
by di fferent uni ts wi thi n the same enterpri se.
42 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Sol vi ng thi s probl em resul ts i n the fol l owi ng benefi ts: better customer management
deci si ons can be made, customers are treated as i ndi vi dual s; new busi ness opportuni ti es can
be expl ored.
Different Groups Produce Conflicting Reports
• Data i n di fferent operati onal systems provi de di fferent i nformati on on the same
subject. The i nconsi stent use of terms resul ts i n di fferent busi ness rul es for the
same i tem.
• Thus, facts and fi gures are i nconsi stentl y i nterpreted, and di fferent versi ons of the
“truth” exi st wi thi n the enterpri se. Deci si on-makers have to rel y on confl i cti ng data
and may l ose credi bi l i ty wi th customers, suppl i ers, or partners.
• Sol vi ng thi s probl em resul ts i n the fol l owi ng benefi ts: a consi stent vi ew of enterpri se
operati ons becomes avai l abl e; better i nformed deci si ons can be made.
Tedious Report Creation Process
• Cri ti cal reports take too l ong a ti me to be produced. Data gatheri ng i s ad hoc,
i nconsi stent, and manual l y performed. There are no formal rul es to govern the
creati on of these reports.
• As a resul t, busi ness deci si ons based on these reports may be bad deci si ons. Busi ness
anal ysts wi thi n the organi zati on spend more ti me col l ecti ng data i nstead of anal yzi ng
data. Competi tors wi th more sophi sti cated means of produci ng si mi l ar reports have
a consi derabl e advantage.
• Sol vi ng thi s probl em resul ts i n the fol l owi ng benefi ts: the report creati on process
i s dramati cal l y streaml i ned, and the ti me requi red to produce the same reports i s
si gni fi cantl y reduced. More ti me can be spent on anal yzi ng the data, and deci si on-
makers do not have to work wi th “ ol d data”.
Reports are not Dynamic, and do not Support an ad hoc Usage Style
• Manageri al reports are not dynami c and often do not support the abi l i ty to dri l l
down for further detai l .
• As a resul t, when a report hi ghl i ghts i nteresti ng or al armi ng fi gures, the deci si on-
maker i s unabl e to zoom i n and get more detai l .
• When thi s probl em i s sol ved, deci si on-makers can obtai n more detai l as needed.
Anal ysi s for trends and causal rel ati onshi ps are possi bl e.
Reports that Require Historical Data are Difficult to Produce
• Customer, product, and fi nanci al hi stori es are not stored i n operati onal systems
data structures.
• As a resul t, deci si on-makers are unabl e to anal yze trends over ti me. The enterpri se
i s unabl e to anti ci pate events and behave proacti vel y or aggressi vel y. Customer
demands come as a surpri se, and the enterpri se must scrambl e to react.
• Deci si on-makers can i ncrease or strengthen rel ati onshi ps wi th current customers.
Marketi ng campai gns can be predi cti ve i n nature, based on hi stori cal data.
THE PROJECT SPONSOR 43
Apart from sol vi ng the probl ems above, other reasons commonl y used to justi fy a data
warehouse i ni ti ati ve are the fol l owi ng:
Support of Enterprise Strategy
The data warehouse i s a key supporti ng factor i n the successful executi on of one or
more parts of the enterpri se’s strategy, i ncl udi ng enhanced revenue or cost control i ni ti ati ves.
Enterprise Emphasis on Customer and Product Profitability
I ncrease the focus and effi ci ency of the enterpri se by gai ni ng a better understandi ng of
i ts customers and products.
Perceived Need Outside the IT Group
Data warehousi ng i s sought and supported by busi ness users who demand i ntegrated
data for deci si on-maki ng. A true busi ness does not need technol ogi cal experi mentati on, but
dri ves the i ni ti ati ve.
Integrated Data
The enterpri se l acks a reposi tory of i ntegrated and hi stori cal data that are requi red for
deci si on-maki ng.
Cost of Current Efforts
The current cost of produci ng standard, regul ar manageri al reports i s typi cal l y hi dden
wi thi n an organi zati on. A study of these costs can yi el d unexpected resul ts that hel p justi fy
the data warehouse i ni ti ati ve.
The Competition does it
Just because competi tors are goi ng i nto data warehousi ng, i t does not mean that an
enterpri se shoul d pl unge headl ong i nto i t. However, the fact that the competi ti on i s appl yi ng
data war ehousi ng technol ogy shoul d make any manager stop and see whether data
warehousi ng i s somethi ng that hi s own organi zati on needs.
3.4 WHAT EXPENSES ARE INVOLVED?
The costs associ ated wi th devel opi ng and i mpl ementi ng a data warehouse typi cal l y fal l
i nto the categori es descri bed bel ow:
Hardware
Warehousi ng hardware can easi l y account for up to 50 percent of the costs i n a data
war ehouse pi l ot pr oject. A separ ate machi ne or ser ver i s often r ecommended for data
warehousi ng so as not to burden operati onal I T envi ronments. The operati onal and deci si onal
envi ronments may be connected vi a the enterpri se’s network, especi al l y i f automated tool s
have been schedul ed to extract data from operati onal systems or i f repl i cati on technol ogy
i s used to create copi es of operati onal data.
Enterpri ses heavi l y dependent on mai nframes for thei r operati onal systems can l ook to
powerful cl i ent/server pl atforms for thei r data warehouse sol uti ons.
Hardware costs are general l y hi gher at the start of the data warehousi ng i ni ti ati ve due
to the purchase of new hardware. However, data warehouses grow qui ckl y, and subsequent
extensi ons to the warehouse may qui ckl y requi re hardware upgrades.
44 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
A good way to cut down on the earl y hardware costs i s to i nvest i n server technol ogy
that i s hi ghl y scal abl e. As the warehouse grows both i n vol ume and i n scope, subsequent
i nvestments i n hardware can be made as needed.
Software
Software refers to al l the tool s requi red to create, set up, confi gure, popul ate, manage,
use, and mai ntai n the data warehouse. The data warehousi ng tool s currentl y avai l abl e from
a vari ety of vendors are staggeri ng i n thei r features and pri ce range (Chapter 11 provi des
an overvi ew of these tool s).
Each enterpri se wi l l be best served by a combi nati on of tool s, the choi ce of whi ch i s
determi ned or i nfl uenced not onl y by the features of the software but al so by the current
computi ng envi r onment of the oper ati onal system, as wel l as the i ntended computi ng
envi ronment of the warehouse.
Services
Servi ces from consul tants or system i ntegrators are often requi red to manage and
i ntegrate the di sparate components of the data warehouse. The use of system i ntegrators,
i n parti cul ar, i s appeal i ng to enterpri ses that prefer to use the “best of breed” of hardware
and software products and have no wi sh to assume the responsi bi l i ty for i ntegrati ng the
vari ous components.
The use of consul tants i s al so popul ar , par ti cul ar l y wi th ear l y war ehousi ng
i mpl ementati ons, when the enterpri se i s just l earni ng about data warehousi ng technol ogi es
and techni ques.
Servi ce-rel ated costs can account for roughl y 30 percent to 35 percent of the overal l cost
of a pi l ot project but may drop as the enterpri se decreases i ts dependence on external resources.
Internal Staff
I nternal staff costs refer to costs i ncurred as a resul t of assi gni ng enterpri se staff to the
warehousi ng project. The staff coul d otherwi se have been assi gned to other acti vi ti es.
The heavi est demands are on the ti me of the I T staff who have the task of pl anni ng,
desi gni ng, bui l di ng, popul ati ng, and managi ng the warehouse. The parti ci pati on of end
users, typi cal l y anal ysts and managers, i s al so cruci al to a successful warehousi ng effort.
The Project Sponsor, the CI O, and the Project manager wi l l al so be heavi l y i nvol ved
because of the nature of thei r rol es i n the warehousi ng i ni ti ati ve.
Summary of Typical Costs
The external costs of a typi cal data warehouse pi l ot project of three to si x months can
range anywhere from US$ 0.8M to US$ 2M, dependi ng on the combi nati on of hardware,
software, and servi ces requi red.
Tabl e 3.1 provi des an i ndi cati ve breakdown of the external costs of a warehousi ng pi l ot
project where new hardware i s purchased.
THE PROJECT SPONSOR 45
Table 3.1. Typical External Cost Breakdown for a Data Warehouse Pilot
(Amounts Expressed in US$)
I tem Min Min. as% Max Max. as %
of Total of Total
Hardware 400,000 49.26 1,000,000 51.81
Software 132,000 16.26 330,000 17.10
Servi ces 280,000 34.48 600,000 31.09
Total s 812,000 1,930,000
Note that the costs l i sted above do not yet consi der any i nfrastructure i mprovements
or upgrades (e.g., network cabl i ng or upgrades) that may be requi red to properl y i ntegrate
the warehousi ng envi ronment i nto the rest of the enterpri se I T archi tecture.
3.5 WHAT ARE THE RISKS?
The typi cal ri sks encountered on data warehousi ng projects fal l i nto the fol l owi ng
categori es:
• Or gani zati onal : These r i sks r el ate ei ther to the pr oject team str uctur e and
composi ti on or to the cul ture of the enterpri se.
• Technol ogi cal : These ri sks rel ate to the pl anni ng, sel ecti on, and use of warehousi ng
technol ogi es. Technol ogi cal ri sks al so ari se from the exi sti ng computi ng envi ronment,
as wel l as the manner by whi ch warehousi ng technol ogi es are i ntegrated i nto the
exi sti ng enterpri se I T archi tecture.
• Project management: These ri sks are true of most technol ogy projects but are
parti cul arl y dangerous i n data warehousi ng because of the scal e and scope of
warehousi ng projects.
• Data warehouse desi gn: Data warehousi ng requi res a new set of desi gn techni ques
that di ffer si gni fi cantl y fr om the wel l -accepted pr acti ces i n OLTP system
devel opment.
Organizational
Wrong Project Sponsor
The Project sponsor must be a busi ness executi ve, not an I T executi ve. Consi deri ng i ts
scope and scal e, the war ehousi ng i ni ti ati ve shoul d be busi ness dr i ven; other wi se, the
organi zati on wi l l vi ew the enti re effort as a technol ogy experi ment.
A strong Project Sponsor i s requi red to address and resol ve organi zati onal i ssues before
these have a choi ce to derai l the project (e.g., l ack of user parti ci pati on, di sagreements
regardi ng defi ni ti on of data, pol i ti cal di sputes). The Project Sponsor must be someone who
wi l l be a user of the warehouse, someone who can publ i cl y assume responsi bi l i ty for the
warehousi ng i ni ti ati ve, and someone wi th suffi ci ent cl out.
Thi s rol e cannot be del egated to a commi ttee. Unfortunatel y, many an organi zati on
chooses to establ i sh a data war ehouse steer i ng commi ttee to take on the col l ecti ve
responsi bi l i ty of thi s rol e. I f such a commi ttee i s establ i shed, the head of the commi ttee may
by defaul t become the Project Sponsor.
46 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
End-User Community Not Involved
The end-user communi ty provi des the data warehouse i mpl ementati on team wi th the
detai l ed busi ness requi rements. Unl i ke OLTP busi ness requi rements, whi ch tend to be exact
and transacti on based, data warehousi ng requi rements are movi ng targets and are subject
to constant change.
Despi te thi s, the i ntended warehouse end-users shoul d be i ntervi ewed to provi de an
understandi ng of the types of queri es and reports (query profi l es) they requi re. By tal ki ng
to the di fferent users, the warehousi ng team al so gai ns a better understandi ng of the I T
l i teracy of the users (user profi l es) they wi l l be servi ng and wi l l better understand the types
of data access and retri eval tool s that each user wi l l be more l i kel y to use. The end-user
communi ty al so provi des the team wi th the securi ty requi rements (access profi l es) of the
warehouse.
These busi ness requi rements are cri ti cal i nputs to the desi gn of the data warehouse.
Senior Management Expectations Not Managed
Because of the costs, data war ehousi ng al ways r equi r es a go-si gnal fr om seni or
management, often obtai ned after a l ong, protracted ROI presentati on.
I n thei r bi d to obtai n seni or management support, warehousi ng supporters must be
careful not to overstate the benefi ts of the data warehouse, parti cul arl y duri ng requests for
budgets and busi ness case presentati ons. Rai si ng seni or management expectati ons beyond
manageabl e l evel s i s one sure way to court extremel y embarrassi ng and hi ghl y vi si bl e di sasters.
End-User Community Expectations not Managed
Apart from managi ng seni or management expectati ons, the warehousi ng team must, i n
the same manner, manage the expectati ons of thei r end users.
Warehouse anal ysts must bear i n mi nd that the expectati ons of end users are i mmedi atel y
rai sed when thei r requi rements are fi rst di scussed. The warehousi ng team must constantl y
manage these expectati ons by emphasi zi ng the phased natur e of data war ehouse
i mpl ementati on projects and by cl earl y i denti fyi ng the i ntended scope of each data warehouse
rol l out.
End users shoul d al so be remi nded that the reports they wi l l get from the warehouse
are heavi l y dependent on the avai l abi l i ty and qual i ty of the data i n the enterpri se’s operati onal
systems.
Political Issues
Attempts to produce i ntegrated vi ews of enterpri se data are l i kel y to rai se pol i ti cal
i ssues. For exampl e, di fferent uni ts have been known to wrestl e for “ownershi p” of the
warehouse, especi al l y i n organi zati ons where access to i nformati on i s associ ated wi th pol i ti cal
power. I n other enterpri ses, the vari ous uni ts want to have as l i ttl e to do wi th warehousi ng
as possi bl e, for fear of havi ng warehousi ng costs al l ocated to thei r uni ts.
Understandabl y, the uni que combi nati on of cul ture and pol i ti cs wi thi n each enterpri se
wi l l exert i ts own posi ti ve and negati ve i nfl uences on the warehousi ng effort.
THE PROJECT SPONSOR 47
Logistical Overhead
A number of tasks i n data warehousi ng requi re coordi nati on wi th mul ti pl e parti es, not
onl y wi thi n the enterpri se, but wi th external suppl i ers and servi ce provi ders as wel l . A
number of factors i ncrease the l ogi sti cal overhead i n data warehousi ng. A few among are:
• Formality. Hi ghl y formal organi zati ons general l y have hi gher l ogi sti cal overhead
because of the need to compl y wi th pre-establ i shed methods for getti ng thi ngs
done.
• Organizational hierarchies. El aborate chai ns of command l i kewi se may cause
del ays or may requi re greater coordi nati on efforts to achi eve a gi ven resul t.
• Geographical dispersion. Logi sti cal del ays al so ar i se fr om geogr aphi cal
di str i buti on, as i n the case of mul ti -br anch banks, nati onwi de oper ati ons or
i nternati onal corporati ons. Mul ti pl e, stand-al one appl i cati ons wi th no central i zed
data store have the same effect. Movi ng data from one l ocati on to another wi thout
the benefi t of a network or a transparent connecti on i s di ffi cul t and wi l l add to
l ogi sti cal overhead.
Technological
Inappropriate Use of Warehousing Technology
A data warehouse i s an i nappropri ate sol uti on for enterpri ses that need operati onal
i ntegrati on on a real -ti me, onl i ne basi s. An ODS i s the i deal sol uti on to needs of the nature.
Mul ti pl e unrel ated data marts are l i kewi se not the appropri ate archi tecture for meeti ng
enterpri se deci si onal i nformati on needs. Al l data warehouse and data mart projects shoul d
remai n under a si ngl e archi tectural framework.
Poor Data Quality of Operational Systems
When the data qual i ty of the operati onal systems i s suspect, the team wi l l , by necessi ty,
devote much of i ts ti me and effort to data scrubbi ng and data qual i ty checki ng. Poor data
qual i ty al so adds to the di ffi cul ti es of extracti ng, transformi ng, and l oadi ng data i nto the
warehouse.
The i mportance of data qual i ty cannot be overstated. Warehouse end users wi l l not
make use of the warehouse i f the i nformati on they retri eve i s wrong or of dubi ous qual i ty.
The percepti on of l ack of data qual i ty, whether such a percepti on i s true or not, i s al l that
i s requi red to derai l a data warehousi ng i ni ti ati ve.
Inappropriate End-user Tools
The wi de range of end-user tool s provi des data warehouse users wi th di fferent l evel s
of functi onal i ty and requi res di fferent l evel s of I T sophi sti cati on from the user communi ty.
Provi di ng seni or management users wi th the i nappropri ate tool s i s one of the qui ckest
ways to ki l l enthusi asm for the data warehouse effort. Li kewi se, power users wi l l qui ckl y
become di senchanted wi th si mpl e data access and retri eval tool s.
Over Dependence on Tools to Solve Data Warehousing Problems
The data warehouse sol uti on shoul d not be bui l t around tool s or sets of tool s. Most of
the warehousi ng tool s (e.g., extracti on, transformati on, mi grati on, data qual i ty, and metadata
tool s) are far from mature at thi s poi nt.
48 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Unfortunatel y, enterpri ses are frequentl y on the recei vi ng end of sal es pi tches that
promi se to sol ve al l the vari ous probl ems (data qual i ty/extracti on /repl i cati on/l oadi ng) that
pl ague warehousi ng efforts through the sel ecti on of the ri ght tool or, even, hardware pl atform.
What enterpri ses soon real i ze i n thei r fi rst warehousi ng project i s that much of the
effort i n a warehousi ng project sti l l cannot be automated.
Manual Data Capture and Conversion Requirements
The extracti on process i s hi ghl y dependent on the extent to whi ch data are avai l abl e
i n the appropri ate el ectroni c format. I n cases where the requi red data si mpl y do not exi st
i n any of the operati onal systems, a warehousi ng team may fi nd i tsel f resorti ng to the
strongl y di scouraged practi ce of usi ng data capture screens to obtai n data through manual
encodi ng operati ons. Unfortunatel y, a data warehouse qui te si mpl y cannot be fi l l ed up
through manual data that wi l l be avai l abl e i n the warehouse.
Technical Architecture and Networking
Study and moni tor the i mpact of the data warehouse devel opment and usage on the
network i nfrastructure. Assumpti ons about batch wi ndows, mi ddl eware, extract mechani sms,
etc.. shoul d be veri fi ed to avoi d nasty surpri ses mi dway i nto the project.
Project Management
Defining Project Scope Inappropriately
The mantr a for data war ehousi ng shoul d be star t smal l and bui l d i ncr emental l y.
Organi zati ons that prefer the bi g-bang approach qui ckl y fi nd themsel ves on the path to
certai n fai l ure. Monol i thi c projects are unwi el dy and di ffi cul t to mange, especi al l y when the
warehousi ng team i s new to the technol ogy and techni ques.
I n contrast, the phased, i terati ve approach has consi stentl y proven i tsel f to be effecti ve,
not onl y i n data warehousi ng but al so i n most i nformati on technol ogy i ni ti ati ves. Each
phase has a manageabl e scope, requi res a smal l er team, and l ends i tsel f wel l to a coachi ng
and l earni ng envi ronment. The l essons l earned by the team on each phase are a form of
di rect feedback i nto subsequent phases.
Underestimating Project Time Frame
Esti mates i n data war ehousi ng pr ojects often fai l to devote suffi ci ent ti me to the
extracti on, i ntegrati on, and transformati on tasks. Unfortunatel y, i t i s not unusual for thi s
area of the project to consume anywhere between 60 percent to 80 percent of a team’s ti me
and effort. Fi gure 3.1 i l l ustrates the di stri buti on of efforts.
The project team should therefore work on stabilizing the back-end of the warehouse as
quickly as possible. The front-end tools are useless if the warehouse itself is not yet ready for use.
Underestimating Project Overhead
Ti me esti mates i n data warehousi ng projects often fai l to consi der del ays due to l ogi sti cs.
Keep an eye on the l ead ti me for hardware del i very, especi al l y i f the machi ne i s yet to be
i mported i nto the ci ty or country. Qui ckl y determi ne the acqui si ti on ti me for mi ddl eware or
warehousi ng tool s. Watch out for l ogi sti cal overhead.
Al l ocate suffi ci ent ti me for team ori entati on and trai ni ng pri or to and duri ng the course
of the project to ensure that everyone remai ns al i gned. Devote suffi ci ent ti me and effort to
creati ng and promoti ng effecti ve communi cati on wi thi n the team.
THE PROJECT SPONSOR 49
TOP-DOWN
• User Requirements
BACK-END
• Extraction
• Integration
• QA
• DW Load
• Aggregates
• Metadata
FRONT-END
• OLAP Tool
• Canned Reports
• Canned Queries
• Source Systems
• External Data
BOTTOM-UP
20% to 40%
60% to 80%
Figure 3.1. Typi cal Effort Di stri buti on on a Warehousi ng Project
Losing Focus
The data warehousi ng effort shoul d be focused enti rel y on del i veri ng the essenti al
mi ni mal characteri sti cs (EMCs) of each phase of the i mpl ementati on. I t i s easy for the team
to be di stracted by requests for nonessenti al or l ow-pri ori ty features (i .e., ni ce-to have data
or functi onal i ty.) These shoul d be ruthl essl y deferred to a l ater phase; otherwi se, val uabl e
project ti me and effort wi l l be fri ttered away on nonessenti al features, to the detri ment of
the warehouse scope or schedul e.
Not Looking Beyond the First Data Warehouse Rollout
A data warehouse needs to be strongl y supported and nurtured (al so known as “care
and feedi ng”) for at l east a year after i ts i ni ti al l aunch. End users wi l l need conti nuous
trai ni ng and support, especi al l y new users are gradual l y granted access to the warehouse.
Col l ect warehouse usage and query stati sti cs to get an i dea of warehouse acceptance and to
obtai n i nputs for database opti mi zati on and tuni ng. Pl an subsequent phases or rol l outs of
the warehouse, taki ng i nto account the l essons l earned from the fi rst rol l out. Al l ocate,
acqui re, or trai n the appropri ate resources for support acti vi ti es.
Data Warehouse Design
Using OLTP Database Design Strategies for the Data Warehouse
Enterpri ses that venture i nto data warehousi ng for the fi rst ti me may make the mi stake
of appl yi ng OLTP database desi gn techni ques to thei r data warehouse. Unfortunatel y, data
warehousi ng requi res desi gn strategi es that are very di fferent from the desi gn strategi es for
transacti onal , operati on systems.
For exampl e, OLTP database are ful l y normal i zed and are desi gned to consi stentl y
store operati onal data, one transacti on at a ti me. I n di rect contrast, a data warehouse
requi res database desi gns that even busi ness users fi nd di rectl y usabl e. Di mensi onal or star
schemas wi th hi ghl y denormal i zed di mensi on tabl es on rel ati onal technol ogy requi re di fferent
desi gn techni ques and di fferent i ndexi ng strategi es. Data warehousi ng may al so requi re the
use of hypercubes or mul ti di mensi onal database technol ogy for certai n functi ons and users.
50 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Choosing the Wrong Level of Granularity
The warehouse contai ns both atomi c (extremel y detai l ed) and summari zed (hi gh-l evel )
data. To get the most val ue out of the system, the most detai l ed data requi red by users
shoul d be l oaded i nto the data warehouse. The degree to whi ch users can sl i ce and di ce
through the data warehouse i s enti rel y dependent on the granul ari ty of the facts. Too hi gh
a grai n makes detai l ed reports or queri es i mpossi bl e to produce. To l ow a grai n unnecessari l y
i ncreases the space requi rements (and the cost) of the data warehouse.
Not Defining Strategies to Key Database Design Issues
The sui tabi l i ty of the warehouse desi gn si gni fi cantl y i mpacts the si ze, performance,
i ntegr i ty, futur e scal abi l i ty, and adaptabi l i ty of the war ehouse. Outl i ne(or hi gh-l evel )
war ehouse desi gns may over l ook the demands of sl owl y changi ng di mensi ons, l ar ge
di mensi ons, and key generati on requi rements among others.
3.6 RISK-MITIGATING APPROACHES
The above ri sks are best addressed through the peopl e and mechani sms descri bed
bel ow:
• The right project sponsor and project manager. Havi ng the appropri ate l eaders
setti ng the tone, scope, and di recti on of a data warehousi ng i ni ti ati ve can spel l the
di fference between fai l ure and success.
• Appropriate architecture. The enterpri se must veri fy that a data warehouse i s
the appropri ate sol uti on to i ts needs. I f the need i s for operati onal i ntegrati on, then
an Operati onal Data Store i s more appropri ate.
• Phased approach. The enti re data warehousi ng effort must be phased so that the
warehouse can be i terati vel y extended i n a cost-justi fi ed and pri ori ti zed manner.
A number of pr i or i ti zed ar eas shoul d be del i ver ed fi r st; subsequent ar eas ar e
i mpl emented i n i ncremental steps. Work on no urgent components i s deferred.
• Cyclical refinement. Obtai n feedback from users as each rol l out or phase i s
compl eted, and as users make use of the data warehouse and the front-end tool s.
Any feedback shoul d serve as i nputs to subsequent rol l outs. Wi th each new rol l out,
user s ar e expected to speci fy addi ti onal r equi r ements and gai n a better
understandi ng of the types of queri es that are now avai l abl e to them.
• Evolutionary life cycle. Each phase of the project shoul d be conducted i n a
manner that promotes evol uti on, adaptabi l i ty, and scal abi l i ty. An overal l data
warehouse archi tecture shoul d be defi ned when a hi gh-l evel understandi ng of user
n eeds h as been obtai n ed an d th e ph as ed i mpl emen tati on path h as
been studi ed.
• Completeness of data warehouse design. The data warehouse desi gn must
address sl owl y changi ng di mensi ons, aggregati on, key general i zati on, heterogeneous
facts and di mensi ons, and mi ni di mensi ons. These di mensi onal model i ng concerns
are addressed i n Chapter 12.
THE PROJECT SPONSOR 51
3.7 IS THE ORGANIZATION READY FOR A DATA WAREHOUSE?
Al though there are no hard-and-fast rul es for determi ni ng when your organi zati on i s
ready to l aunch a data warehouse i ni ti ati ve, the fol l owi ng posi ti ve si gns are good cl ues:
Decision-Makers Feel the Need for Change
A successful data warehouse implementation will have a significant impact on the enterprise’s
deci si on-maki ng processes, whi ch i n turn wi l l have si gni fi cant i mpact on the operati ons of the
enterpri se. The performance measures and reward mechani sms are l i kel y to change, and they
bri ng about correspondi ng changes to the processes and the cul ture of the organi zati on.
I ndi vi dual s who have an i nterest i n preservi ng the status quo are l i kel y to resi st the
data war ehousi ng i ni ti ati ve, once i t becomes appar ent that such technol ogi es enabl e
organi zati onal change.
Users Clamor for Integrated Decisional Data
A data warehouse i s l i kel y to get strong support from both the I T and user communi ty
i f there i s a strong and unsati sfi ed demand for i ntegrated deci si onal data (as opposed to
i ntegrated operati onal data). I t wi l l be fool i sh to try usi ng data warehousi ng technol ogi es to
meet operati onal i nformati on needs.
I T professi onal s wi l l benefi t from a l ong-term, archi tected sol uti on to users’ i nformati on
needs, and users wi l l benefi t from havi ng i nformati on at thei r fi ngerti ps.
The Operational Systems are Fairly Stable
An I T department, di vi si on, or uni t that conti nuousl y fi ghts fi res on unstabl e operati onal
systems wi l l qui ckl y depri ori ti ze the data warehousi ng effort. Organi zati ons wi l l al most
al ways defer the warehousi ng effort i n favor of operati onal concerns–after al l , the enterpri se
has survi ved wi thout a data warehouse for year; another few months wi l l not hurt.
When the operati onal systems are up i n producti on and are fai rl y stabl e, there are i nternal
data sources for the warehouse and a data warehouse i ni ti ati ve wi l l be gi ven hi gher pri ori ty.
There is Adequate Funding
A data warehouse project cannot afford to fi zzl e out i n the mi ddl e of the effort due to
a shortage of funds. Be aware of l ong-term fundi ng requi rements beyond the fi rst data
warehouse rol l out before starti ng on the pi l ot project.
3.8 HOW THE RESULTS ARE MEASURED?
Data warehousi ng resul ts come i n di fferent forms and can, therefore, be measured i n
one or more of the fol l owi ng ways:
New Reports/Queries Support
Resul ts are seen cl earl y i n the new reports and queri es that are now readi l y avai l abl e
but woul d have been di ffi cul t to obtai n wi thout the data warehouse.
The extent to whi ch these reports and queri es actual l y contri bute to more i nformed
deci si ons and the transl ati on of these i nformed deci si ons to bottom l i ne benefi ts may not be
easy to trace, however.
52 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Turnaround Time
Now, Resul ts are al so evi dent i n the l esser ti me that i t takes to obtai n i nformati on on
the subjects covered by the warehouse. Seni or managers can al so get the i nformati on they
need di rectl y, thus i mprovi ng the securi ty and confi denti al i ty of such i nformati on.
Turnaround ti me for deci si on-maki ng i s dramati cal l y reduced. I n the past, deci si on-
makers i n meeti ngs ei ther had to make an uni nformed deci si on or tabl e a di scussi on i tem
because they l acked i nformati on. The abi l i ty of the data warehouse to qui ckl y provi de
needed i nformati on speeds up the deci si on-maki ng process.
Timely Alerts and Exception Reporting
The data warehouse proves i ts worth each ti me i t sounds an al ert or hi ghl i ghts an
excepti on i n enterpri se operati ons. Earl y detecti on makes i t possi bl e to avert or correct
potenti al l y major probl ems and al l ows deci si on-makers to expl oi t busi ness si tuati ons wi th
smal l or i mmedi ate wi ndows of opportuni ty.
Number of Active Users
The number of acti ve users provi des a concrete measure for the usage and acceptance
of the warehouse.
Frequency of Use
The number of ti mes a user actual l y l ogs on to the data warehouse wi thi n a gi ven ti me
peri od (e.g., weekl y) shows how often the warehouse i s used by any gi ven users. Frequent
usage i s a strong i ndi cati on of warehouse acceptance and usabi l i ty. An i ncrease i n usage
i ndi cates that users are aski ng questi ons more frequentl y. Tracki ng the ti me of day when
the data warehouse i s frequentl y used wi l l al so i ndi cate peak usage hours.
Session Times
The l ength of ti me a user spends each ti me he l ogs on to the data warehouse shows how
much the data warehouse contri butes to hi s job.
Query Profiles
The number and types of queri es users make provi de an i dea how sophi sti cated the
users have become. As the queri es become more sophi sti cated, users wi l l most l i kel y request
addi ti onal functi onal i ty or i ncreased data scope.
Thi s metri c al so provi des the Warehouse Database Admi ni strator (DBA) wi th val uabl e
i nsi ght as to the types of stored aggregates or summari es that can further opti mi ze query
per for mance. I t al so i ndi cates whi ch tabl es i n the war ehouse ar e fr equentl y accessed.
Conversel y, i t al so al l ows the warehouse DBA to i denti fy tabl es that are hardl y used and
therefore are candi dates for purgi ng.
Change Requests
An anal ysi s of users change requests can provi de i nsi ght i nto how wel l users are
appl yi ng the data warehouse technol ogy. Unl i ke most I T projects, a hi gh number of data
warehouse change requests i s a good si gn; i t i mpl i es that users are di scoveri ng more and
more how warehousi ng can contri bute to thei r jobs.
THE PROJECT SPONSOR 53
Business Changes
The i mmedi ate resul ts of data warehousi ng are fai rl y easy to quanti fy. However; true
warehousi ng ROI comes from busi ness changes and deci si ons that have been made possi bl e
by i nformati on obtai ned from the warehouse. These, unfortunatel y, are not as easy to quanti fy
and measure.
In Summary
The i mportance of the Project Sponsor i n a data warehousi ng i ni ti ati ve cannot be
overstated. The project sponsor i s the hi ghest-l evel busi ness representati ve of the warehousi ng
team and therefore must be a vi si onary, respected, and deci si ve l eader.
At the end of the day, the Project Sponsor i s responsi bl e for the success of the data
warehousi ng i ni ti ati ve wi thi n the enterpri se.
54 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
54
The Chi ef I nfor mati on Offi cer (CI O) i s r esponsi bl e for the effecti ve depl oyment of
i nformati on technol ogy resources and staff to meet the strategi c, deci si onal , and operati onal
i nformati on requi rements of the enterpri se.
Data warehousi ng, wi th i ts accompanyi ng array of new technol ogi es and i ts dependence
on oper ati onal systems, natur al l y makes str ong demands on the techni cal and human
resources under the juri sdi cti on of the CI O.
For thi s reason, i t i s natural for the CI O to be strongl y i nvol ved i n any data warehousi ng
effort. Thi s chapter attempts to answer the typi cal questi ons of CI Os who parti ci pate i n data
warehousi ng i ni ti ati ves.
4.1 HOW IS THE DATA WAREHOUSE SUPPORTED?
After the data warehouse goes i nto producti on, di fferent support servi ces are requi red
to ensure that the i mpl ementati on i s not derai l ed. These support servi ces fal l i nto the
categori es descri bed bel ow.
Regular Warehouse Load
The data warehouse needs to be constantl y l oaded wi th addi ti onal data. The amount of
work requi red to l oad data i nto the warehouse on a regul ar basi s depends on the extent to
whi ch the extracti on, transformati on, and l oadi ng processes have been automated, as wel l
as the l oad frequency requi red by the warehouse.
The frequency of the l oad depends on the user requi rements, as determi ned duri ng the
data warehouse desi gn acti vi ty. The most frequent l oad possi bl e wi th a data warehouse i s
once a day, al though i t i s not unusual to fi nd organi zati ons that l oad thei r warehouses once
a week, or even once a month.
The regul ar l oadi ng acti vi ti es fal l under the responsi bi l i ti es of the warehouse support
team, who al most i nvari abl y report di rectl y or i ndi rectl y to the CI O.
Applications
After the data warehouse and rel ated data marts have been depl oyed, the I T department
of di vi si on may turn i ts attenti on to the devel opment and depl oyment of executi ve systems
60- +1
4
+0)26-4
THE CI O 55
or custom appl i cati ons that run di rectl y agai nst the data warehouse or the data marts.
These appl i cati ons are devel oped or targeted to meet the needs of speci fi c user groups.
Any i n-house appl i cati on devel opment wi l l l i kel y be handl ed by i nter nal I T staff;
otherwi se, such projects shoul d be outsourced under the watchful eye of the CI O.
Warehouse DB Optimization
Apart from the day-to-day database admi ni strati on support of producti on systems, the
warehouse DBA must al so col l ect and moni tor new sets of query stati sti cs wi th each rol l out
or phase of the data warehouse.
The data structure of the warehouse i s then refi ned or opti mi zed on the basi s of these
usage stati sti cs, parti cul arl y i n the area of stored aggregates and tabl e i ndexi ng strategi es.
User Assistance or Help Desk
As wi th any i nformati on system i n the enterpri se, a user assi stance desk or hel p desk
can provi de users wi th general i nformati on, assi stance, and support. An anal ysi s of the hel p
requests recei ved by the hel p desk provi des i nsi ght on possi bl e subjects for fol l ow-on trai ni ng
wi th end users.
I n addi ti on, the hel p desk i s an i deal si te for publ i ci zi ng the status of the system after
every successful l oad.
Training
Provi de more trai ni ng as more end users gai n access to the data warehouse. Apart from
coveri ng the standard capabi l i ti es, appl i cati ons, and tool s that are avai l abl e to the users, the
warehouse trai ni ng shoul d al so cl earl y convey what data are avai l abl e i n the warehouse.
Advanced trai ni ng topi cs may be appropri ate for more advanced users. Speci al i zed
work groups or one-on-one trai ni ng may be appropri ate as fol l ow-on trai ni ng, dependi ng on
the type of questi ons and hel p requests that the hel p desk recei ves.
Preparation for Subsequent Rollouts
Al l i nternal preparatory work for subsequent rol l outs must be performed whi l e support
acti vi ti es for pri or rol l outs are underway. Thi s acti vi ty may create resource contenti on and
therefore shoul d be careful l y managed.
4.2 HOW DOES DATA WAREHOUSE EVOLVE?
One of the toughest deci si ons any data warehouse pl anner has to make i s to deci de
when to evol ve the system wi th new data and when to wai t for the user base, I T organi zati on,
and busi ness to catch up wi th the l atest rel ease of the warehouse.
Warehouse evol uti on i s not onl y a techni cal and management i ssue, i t i s al so a pol i ti cal
i ssue. The I T organi zati on must conti nual l y ei ther:
• Mar ket or sel l the war ehouse for conti nued fundi ng and suppor t of exi sti ng
capabi l i ti es; or
• Attempt to control the demand for new capabi l i ti es.
56 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Each new extensi on of the data warehouse resul ts i n i ncreased compl exi ty i n terms of
data scope, data management, and warehouse opti mi zati on. I n addi ti on, each rol l out of the
warehouse i s l i kel y to be i n di fferent stages and therefore they have di fferent support needs.
For exampl e, an enterpri se may fi nd i tsel f busy wi th the pl anni ng and desi gn of the
thi rd phase of the warehouse, whi l e depl oyment and trai ni ng acti vi ti es are underway for the
second phase, and hel p desk support i s avai l abl e for the fi rst phase. The CI O wi l l undoubtedl y
face the unwel come task of maki ng cri ti cal deci si ons regardi ng resource assi gnments.
I n general , data warehouse evol uti on takes pl ace i n one or more of the fol l owi ng areas:
Data
Evol uti on i n thi s area typi cal l y resul ts i n an i ncrease i n scope (al though a decrease i s
not i mpossi bl e). The extracti on subsystem wi l l requi re modi fi cati on i n cases where the
source systems are modi fi ed or new operati onal systems are depl oyed.
Users
New users wi l l be gi ven access to the data warehouse, or exi sti ng users wi l l be trai ned
on advanced features. Thi s i mpl i es new or addi ti onal trai ni ng requi rements, the defi ni ti on
of new users and access profi l es, and the col l ecti on of new usage stati sti cs. New securi ty
measures may al so be requi red.
IT Organization
New ski l l sets are requi red to bui l d, manage, and support the data warehouse. New
types of support acti vi ti es wi l l be needed.
Business
Changes i n the busi ness resul t i n changes i n the operati ons, moni tori ng needs, and
performance measures used by the organi zati on. The busi ness requi rements that dri ve the
data warehouse change as the busi ness changes.
Application Functionality
New functi onal i ty can be added to exi sti ng OLAP tool s, or new tool s can be depl oyed
to meet end-user needs.
4.3 WHO SHOULD BE INVOLVED IN A DATA WAREHOUSE PROJECT?
Every data warehouse project has a team of peopl e wi th di verse ski l l s and rol es. The
i nvol vement of i nternal staff duri ng the warehouse devel opment i s cri ti cal to the warehouse
mai ntenance and support tasks once the data warehouse i s i n producti on. Not al l the rol es
i n a data warehouse project can be outsourced to thi rd parti es; of the typi cal rol es l i sted
bel ow, i nternal enterpri se staff shoul d ful fi l l the rol es l i sted bel ow:
• Steeri ng Commi ttee
• User Reference Group
• Warehouse Dri ver
• Warehouse Project Manager
• Busi ness Anal ysts
THE CI O 57
• Warehouse Data Archi tect
• Metadata Admi ni strator
• Warehouse DBA
• Source System DBA and System Admi ni strator
• Project Sponsor (see Chapter 3)
Every data warehouse project has a team of peopl e wi th di verse ski l l s and rol es. Bel ow
i s a l i st of typi cal rol es i n a data warehouse project. Note that the same person may pl ay
more than one rol e.
Steering Committee
The steeri ng commi ttee i s composed of hi gh-l evel executi ves representi ng each major
type of user requi ri ng access to the data warehouse. The project sponsor i s a member of the
commi ttee. I n most cases, the sponsor heads the commi ttee. The steeri ng commi ttee shoul d
al ready be formed by the ti me data warehouse i mpl ementati on starts; however, the exi stence
of a steer i ng commi ttee i s not a pr er equi si te for data war ehouse pl anni ng. Dur i ng
i mpl ementati on, the steeri ng commi ttee recei ves regul ar status reports from the project
team and i ntervenes to redi rect project efforts whenever appropri ate.
User Reference Group
Representati ves from the user communi ty (typi cal l y mi ddl e-l evel managers and anal ysts)
provi de cri ti cal i nputs to data warehousi ng projects by speci fyi ng detai l ed data requi rements,
busi ness rul es, predefi ned queri es, and report l ayouts. User representati ves al so test the
outputs of the data warehousi ng effort.
I t i s not unusual for end-user representati ves to spend up to 80 percent of thei r ti me
on the project, parti cul arl y duri ng the requi rements anal ysi s and data warehouse desi gn
acti vi ti es. Toward the end of a rol l out, up to 80 percent of the representati ves ti me may be
requi red agai n for testi ng the usabi l i ty and correctness of warehouse data.
End users al so parti ci pate i n regul ar meeti ngs or i ntervi ews wi th the warehousi ng
team throughout the l i fe of each rol l out (up to 50 percent i nvol vement).
Warehouse Driver
The warehouse dri ver reports to the steeri ng commi ttee, ensures that the project i s
movi ng i n the ri ght di recti on, and i s responsi bl e for meeti ng project deadl i nes.
The warehouse dri ver i s a busi ness manager, but i s a busi ness manager responsi bl e for
defi ni ng the data warehouse strategy (wi th the assi stance of the warehouse project manager)
and for pl anni ng and managi ng the data warehouse i mpl ementati on from the busi ness si de
of operati ons.
The warehouse dri ver al so communi cates the warehouse objecti ves to other areas of the
enter pr i se thi s i ndi vi dual nor mal l y ser ves as the coor di nator i n cases wher e the
i mpl ementati on team has cross-functi onal team members. I t i s therefore not unusual for the
warehouse dri ver to be the l i ai son to the user reference group.
58 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Warehouse Project Manager
The project manager i s usual l y an i ndi vi dual who i s very wel l versed i n technol ogy and
i n managi ng technol ogy projects. Thi s person’s techni cal ski l l s strongl y compl ement the
busi ness acumen of the warehouse dri ver.
The project manager normal l y reports to the warehouse dri ver and joi ntl y defi nes the
data warehouse strategy. I t i s not unusual , however to fi nd organi zati ons where the warehouse
dri ver and project manager joi ntl y manage the project. I n such cases, the project manager
i s actual l y a techni cal manager.
The project manager i s responsi bl e for i mpl ementi ng the project pl ans and acts as
coordi nator and the technol ogy si de of the project, parti cul arl y when the project i nvol ves
several vendors. The warehouse project manager keeps the warehouse dri ver updated on
the techni cal aspects of the project but i sol ates the warehouse dri ver from the techni cal
detai l s.
Business Analyst(s)
The anal ysts act as l i ai sons between the user reference group and the more techni cal
members of the project team. Through i ntervi ews wi th members of the user reference group,
the anal ysts i denti fy, document, and model the current busi ness requi rements and usage
scenari os.
Anal ysts pl ay a cri ti cal rol e i n managi ng end-user expectati ons, si nce most of the
contact between the user reference group and the warehousi ng team takes pl ace through
the anal ysts. Anal ysts represent the i nterests of the end users i n the project and therefore
have the responsi bi l i ty of ensuri ng that the resul ti ng i mpl ementati on wi l l meet end-user
needs.
Warehouse Data Architect
The warehouse data archi tect devel ops and mai ntai ns the warehouse’s enterpri se-wi de
vi ew of the data. Thi s i ndi vi dual anal yzes the i nformati on requi rements speci fi ed by the
user communi ty and desi gns the data structures of the data warehouse accordi ngl y.
The workl oad of the archi tect i s heavi er at the start of each rol l out, when most of the
desi gn deci si ons are made. The workl oad tapers off as the rol l out gets underway.
The warehouse data archi tect has an i ncreasi ngl y cri ti cal rol e as the warehouse evol ves.
Each successi ve rol l out that extends the warehouse must respect an overal l i ntegrati ng
archi tecture–and the responsi bi l i ty for the i ntegrati ng archi tecture fal l s squarel y on the
warehouse data archi tect. Data mart depl oyments that are fed by the warehouse shoul d
l i kewi se be consi dered part of the archi tecture to avoi d the data admi ni strati on probl ems
created by mul ti pl e, unrel ated data marts.
Metadata Administrator
The metadata admi ni strator defi nes metadata standards and manages the metadata
reposi tory of the warehouse. The workl oad of the metadata admi ni strator i s qui te hi gh both
at the start and toward the end of each warehouse rol l out. Workl oad i s hi gh at the start
pri mari l y due to metadata defi ni ti on and setup work. Workl oad toward the end of a rol l out
i ncreases as the schema, the aggregate strategy, and the metadata reposi tory contents are
fi nal i zed.
THE CI O 59
Metadata pl ays i mportant rol e i n data warehousi ng projects and therefore requi res a
separate di scussi on i n Chapter 13.
Warehouse DBA
The warehouse database admi ni strator works cl osel y wi th the warehouse data archi tect.
The workl oad of the warehouse DBA i s typi cal l y heavy throughout a data warehouse project.
Much of thi s i ndi vi dual s ti me wi l l be devoted to setti ng up the warehouse schema at the
star t of each r ol l out. As the r ol l out gets under way, the war ehouse DBA takes on the
responsi bi l i ty of l oadi ng the data, moni tori ng the performance of the warehouse, refi ni ng
the i ni ti al schema, and creati ng dummy data for testi ng the deci si on support front-end tool s.
Toward the end of the rol l out, the warehouse DBA wi l l be busy wi th database opti mi zati on
tasks as wel l as aggregate tabl e creati on and popul ati on.
As expected, the warehouse DBA and the metadata admi ni strator work cl osel y together.
The warehouse DBA i s responsi bl e for creati ng and popul ati ng metadata tabl es wi thi n the
war ehouse i n compl i ance wi th the standar ds that have been defi ned by the metadata
admi ni strator.
Source System Database Administrators (DBAs) and System Administrators (SAs)
These I T professi onal s pl ay extremel y cri ti cal rol es i n the data warehousi ng effort.
Among thei r typi cal responsi bi l i ti es are:
• Identify best extraction mechanisms. Gi ven thei r fami l i ari ty wi th the current
computi ng envi ronment, source system DBAs and SAs are often asked to i denti fy
the data tr ansfer and extr acti on mechani sms best sui ted for thei r r especti ve
operati onal systems.
• Contribute to source-to-target field mapping. These i ndi vi dual s are fami l i ar
wi th the data structures of the operati onal systems and are therefore the most
qual i fi ed to contri bute to or fi nal i ze the mappi ng of source system fi el ds to warehouse
fi el ds.
• Data quality assessment. I n the course of thei r day-to-day operati ons, the DBAs
and SAs encounter data qual i ty probl ems and are therefore i n a posi ti on to hi ghl i ght
areas that requi re speci al attenti on duri ng data cl eansi ng and transformati on.
Dependi ng on the status of the operati onal systems, these i ndi vi dual s may spend
the majori ty of thei r ti me on the above acti vi ti es duri ng the course of a rol l out.
Conversion and Extraction Programmer(s)
The programmers wri te the extracti on and conversi on programs that pul l data from the
operati onal databases. They al so wri te programs that i ntegrate, convert, and summari ze the
data i nto the format requi red by the data warehouse. Thei r pri mary resource persons for the
extracti on programs wi l l be the source system DBAs and SAs.
I f data extracti on, transformati on, and transportati on tool s are used, these i ndi vi dual s
are responsi bl e for setti ng up and confi guri ng the sel ected tool s and ensuri ng that the
correct data records are retri eved for l oadi ng i nto the warehouse.
60 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Technical and Network Architect
The techni cal and network archi tect ensures that the techni cal archi tecture defi ned for
the data warehouse rol l out i s sui tabl e for meeti ng the stated requi rements. Thi s i ndi vi dual
al so ensures that the techni cal and network archi tecture of the data warehouse i s consi stent
wi th the exi sti ng enterpri se i nfrastructure.
The network archi tect coordi nates wi th the project team on the extensi ons to the
enterpri se’s network i nfrastructure requi red to support the data warehouse and constantl y
moni tors the warehouse’s i mpact on network capaci ty and throughput.
Trainer
The trai ner devel ops al l requi red trai ni ng materi al s and conducts the trai ni ng courses
for the data war ehousi ng pr oject. The war ehouse pr oject team wi l l r equi r e some data
warehousi ng trai ni ng, parti cul arl y duri ng earl y or pi l ot projects. Toward the end of each
rol l out, end users of the data warehouse wi l l al so requi re trai ni ng on the warehouse contents
and on the tool s that wi l l be used for anal ysi s and reporti ng.
4.4 WHAT IS THE TEAM STRUCTURE LIKE?
Fi gure 4.1 i l l ustrates a typi cal project team structure for a data warehouse project. Note
that there are many other vi abl e al ternati ve team structures. Al so, unl ess the team i s qui te
l arge and involves many contracted parties, a formal team structure may not even be necessary.

Warehouse
Driver
Technical and
Network

Project
Architect ure Mgt
.
Met adata
Administrator
Warehouse data
archit ecture
Project Sponsor


Business Analyst
User
Representatives
Trainees Warehouse DBA
Conversion and
extr act

Source System DBA
and Syst em Admin.
Figure 4.1. Typi cal Project Team Structure for Devel opment
The team str uctur e wi l l evol ve once the pr oject has been compl eted. Day-to-day
mai ntenance and support of the data warehouse wi l l cal l for a di fferent organi zati onal
structure–someti mes one that i s more permanent.
Resource contenti on wi l l ari se when a new rol l out i s underway and resources are
requi red for both warehouse devel opment and support.
4.5 WHAT NEW SKILLS WILL PEOPLE NEED?
I T professi onal s and end-users wi l l both requi re new but di fferent ski l l sets, as descri bed
bel ow:
THE CI O 61
IT Professionals
Data warehousi ng pl aces new demands on the I T professi onal s of an enterpri se. New
ski l l sets are requi red, parti cul arl y i n the fol l owi ng areas:
• New database design skills. Tradi ti onal database desi gn pri nci pl es do not work
wel l wi th a data warehouse. Di mensi onal model i ng concepts break many of the
OLTP desi gn rul es. Al so, the l arge si ze of warehouse tabl es requi res database
opti mi zati on and i ndexi ng techni ques that are appropri ate for very l arge database
(VLDB) i mpl ementati ons.
• Technical capabilities. New techni cal ski l l s are requi red, especi al l y i n enterpri ses
where new hardware or software i s purchased (e.g., new hardware pl atform, new
RDBMS, etc.). System archi tecture ski l l s are requi red for warehouse evol uti on, and
networki ng management and desi gn ski l l s are requi red to ensure the avai l abi l i ty
of network bandwi dth for warehousi ng requi rements.
• Familiarity with tools. I n many cases, data warehousi ng works better when tool s
are purchased and i ntegrated i nto one sol uti on. I T professi onal s must become
fami l i ar wi th the vari ous warehousi ng tool s that are avai l abl e and must be abl e to
separate the wheat from the chaff. I T professi onal s must al so l earn to use, and
l earn to work around, the l i mi tati ons of the tool s they sel ect.
• Knowledge of the business. Thorough understandi ng of the busi ness and of how
the busi ness wi l l uti l i ze data are cri ti cal i n a data warehouse effort. I T professi onal s
cannot afford to focus on technol ogy onl y. Busi ness anal ysts, i n parti cul ar, have to
understand the busi ness wel l enough to properl y represent the i nterests of end
users. Busi ness terms have to be standardi zed, and the correspondi ng data i tems
i n the operati onal systems have to be found or deri ved.
• End-user support. Al though I T professi onal s have constantl y provi ded end-user
support to the rest of the enterpri se, data warehousi ng puts the I T professi onal i n
di rect contact wi th a speci al ki nd of end user: seni or management. Thei r successful
day-to-day use of the data warehouse (and the resul ti ng success of the warehousi ng
effort) depends greatl y on the end-user support that they recei ve.
The I T professi onal ’s focus changes from meeti ng operati onal user requi rements to
hel pi ng users sati sfy thei r own i nformati on needs.
End Users
Gone are the days when end users had to wai t for the I T department to rel ease pri ntouts
or reports or to respond to requests for i nformati on. End users can now di rectl y access the
data warehouse and can tap i t for requi red i nformati on, l ooki ng up data themsel ves.
The advance assumes that end users have acqui red the fol l owi ng ski l l s:
• Desktop computing. End users must be abl e to use OLAP tool s (under a graphi cal
user i nterface envi ronment) to gai n di rect access to the warehouse. Wi thout desktop
computi ng ski l l s, end users wi l l al ways rel y on other parti es to obtai n the i nformati on
they requi re.
• Business knowledge. The answers that the data warehouse provides are only as good
as the questions that it receives. End users will not be able to formulate the correct
questions or queries without a sufficient understanding of their own business environment.
62 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Data knowledge. End users must understand the data that is available in the ware-
house and must be able to relate the warehouse data to their knowledge of the business.
Apart from the above ski l l s, data warehousi ng i s more l i kel y to succeed i f end users are
wi l l i ng to make the warehouse an i ntegral part of the management and deci si on-maki ng
process of the organi zati on. The warehouse support team must hel p end users overcome a
natural tendency to revert to “busi ness as usual ” after the warehouse i s rol l ed out.
4.6 HOW DOES DATA WAREHOUSING FIT INTO IT ARCHITECTURE?
As di scussed i n Chapter 1, the data warehouse i s an enti rel y separate archi tectural
component, di sti nct from the operati onal systems. Each ti me a new archi tectural component
i s added or i ntroduced, the enterpri se archi tect must consci ousl y study i ts i mpact on the rest
of the I T archi tecture and ensure that
• The I T archi tecture does not become bri ttl e as a resul t of thi s new component; and
• The new archi tectural components are i sol ated from the obsol escence of l egacy
appl i cati ons.
A data warehouse pl aces new demands on the techni cal i nfrastructure of the enterpri se.
The fol l owi ng factors determi ne the techni cal envi ronment requi red.
• User requirements. The user requi rements l argel y determi ne the scope of the
warehouse, i .e., the requi rements are the basi s for i denti fyi ng the source systems
and the requi red data i tems.
• Location of the source systems and the warehouse. I f the source systems and
the data warehouse are not i n the same l ocati on, the extracti on of data from
operati onal systems i nto the warehouse may present di ffi cul ti es wi th regard to
l ogi sti cs or network communi cati ons. I n actual practi ce, the i ni ti al extracti on i s
rarel y 100 percent correct–some fi ne-tuni ng wi l l be requi red because of errors i n
the source-to-target fi el d mappi ng, mi sunderstood requi rements, or changes i n
requi rements. Easy, i mmedi ate access to both the source systems and the warehouse
make i t easi er to modi fy and correct the data extracti on, transformati on, and l oadi ng
processes. The avai l abi l i ty of easy access to both types of computi ng envi ronments
depends on the current techni cal archi tecture of the enterpri se.
• Number and location of warehouse users. The number of users that may
access the warehouse concurrentl y i mpl i es a certai n l evel of network traffi c. The
l ocati on of each user wi l l al so be a factor i n determi ni ng how users wi l l be granted
access to the warehouse data. For exampl e, i f the warehouse users are di spersed
over several remote l ocati ons, the enterpri se may deci de to use secure connecti ons
through the publ i c i nternet i nfrastructure to del i ver the warehouse data.
• Existing enterprise IT architecture. The exi sti ng enterpri se I T archi tecture
defi nes or sets the l i mi ts on what i s techni cal l y feasi bl e and practi cal for the data
warehouse team.
• Budget allocated to the data warehousing effort. The budget for the
warehousi ng effort determi nes how much can be done to upgrade or i mprove the
current techni cal i nfrastructure i n preparati on for the data warehouse.
THE CI O 63
I t i s al ways prudent to fi rst study and pl an the techni cal archi tecture (as part of
defi ni ng the data warehouse strategy) before the start of any warehouse i mpl ementati on
pr oject.
4.7 HOW MANY VENDORS ARE NEEDED TO TALK TO?
A warehousi ng project, l i ke any I T project, wi l l requi re a combi nati on of hardware,
software, and servi ces, whi ch may not be avai l abl e i n al l from one vendor. Some enterpri ses
choose to i sol ate themsel ves from the vendor sel ecti on and l i ai son process by hi ri ng a
systems i ntegrator, who subcontracts work to other vendors. Other enterpri ses prefer to
deal wi th each vendor di rectl y, and therefore assume the responsi bi l i ty of i ntegrati ng the
vari ous tool s and servi ces they acqui re.
Vendor Categories
Al though some vendors have products or servi ces that al l ow them to fi t i n more than
one of the vendor categori es bel ow, most i f not al l vendors are parti cul arl y strong i n onl y
one of the categori es di scussed bel ow:
• Hardware or operating system vendors. Data warehouses requi re powerful
server pl atforms to store the data and to make these data avai l abl e to mul ti pl e
users. Al l the major hardware vendors offer computi ng envi ronments that can be
used for data warehousi ng.
• Middleware/data extraction and transformation tool vendors. These vendors
provi de software products that faci l i tate or automate the extracti on, transportati on,
and tr ansfor mati on of oper ati onal data i nto the for mat r equi r ed for the data
warehouse.
• RDBMS vendors. These vendors provi de the rel ati onal database management
systems that are capabl e of stori ng up to terabytes of data for warehousi ng purposes.
These vendor s have been i ntr oduci ng mor e and mor e featur es (e.g., advanced
i ndexi ng features) that support VLDB i mpl ementati ons.
• Consultancy and integration services supplier. These vendors provi de servi ces
ei ther by taki ng on the r esponsi bi l i ty of i ntegr ati ng al l components of the
warehousi ng sol uti on on behal f of the enterpri se, by offeri ng techni cal assi stance
on speci fi c ar eas of exper ti se, or by accepti ng outsour ci ng wor k for the data
warehouse devel opment or mai ntenance.
• Front-end/OLAP/decision support/data access and retrieval tool vendors.
These vendors offer products that access, retri eve, and present warehouse data i n
meani ngful and attracti ve formats. Data mi ni ng tool s, whi ch acti vel y search for
previ ousl y unrecogni zed patterns i n the data, al so fal l i nto thi s category.
Enterprise Options
The number of vendors that an enterpri se wi l l work wi th depends on the approach the
enterpri se wi shes to take. There are three mai n al ternati ves to bui l d a data warehouse. An
enterpri se can:
• Build its own. The enterpri se can bui l d the data warehouse, usi ng a custom
ar chi tectur e. A “best of br eed” pol i cy i s appl i ed to the sel ecti on of war ehouse
64 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
components and vendor s. The data war ehouse team accepts r esponsi bi l i ty for
i ntegrati ng al l the di sti nct sel ected products from mul ti pl e vendors.
• Use a framework. Nearl y al l data warehousi ng vendors present a warehousi ng
fr amewor k to i nfl uence and gui de the data war ehouse mar ket. Most of these
frameworks are si mi l ar i n scope and substance, wi th di fferences greatl y i nfl uenced
by the vendor’s core technol ogy or product. Vendors have al so opportuni sti cal l y
establ i shed al l i ances wi th one another and are offeri ng thei r product combi nati ons
as the cl osest thi ng to an “off the shel f” warehousi ng sol uti on.
• Use an anchor supplier (hardware, software, or service vendor). Enterpri ses
may al so sel ect a suppl i er for a product or servi ce as i ts key or anchor vendor. The
anchor suppl i er’s products or servi ces are then used to i nfl uence the sel ecti on of
other warehousi ng products and servi ces.
4.8 WHAT SHOULD BE LOOKED FOR IN A DATA WAREHOUSE VENDOR?
The fol l owi ng secti ons provi de eval uati on cri teri a for the di fferent components that
make up a data warehouse sol uti on. Di fferent wei ghti ng shoul d be appl i ed to each cri teri on,
dependi ng on i ts i mportance to the organi zati on.
Solution Framework
The fol l owi ng eval uati on cri teri a can be appl i ed to the overal l warehousi ng sol uti on:
• Relational data warehouse. The data warehouse resi des on a rel ati onal DBMS.
(Mul ti di mensi onal databases are not an appropri ate pl atform for an enterpri se
data war ehouse, al though they may be used for data mar ts wi th power -user
requi rements.)
• Scalability. The warehouse sol uti on can scal e up i n terms of di sk space, processi ng
power, and warehouse desi gn as the warehouse scope i ncreases. Thi s scal abi l i ty i s
parti cul arl y i mportant i f the warehouse i s expected to grow at a rapi d rate.
• Front-end independence. The desi gn of the data warehouse i s i ndependent of
any parti cul ar front-end tool . Thi s i ndependence l eaves the warehouse team free to
mi x and match di ffer ent fr ont-end tool s accor di ng to the needs and ski l l s of
warehouse users. Enterpri ses can easi l y add more sophi sti cated front-ends (such as
data mi ni ng tool s) at a l ater ti me.
• Architectural integrity. The proposed sol uti on does not make the overal l system
archi tecture of the enterpri se bri ttl e; rather, i t contri butes to the l ong-term resi l i ency
of the I T archi tecture.
• Preservation of prior investments. The sol uti on l everages as much as possi bl e
pr i or softwar e and har dwar e i nvestments and exi sti ng ski l l sets wi thi n the
organi zati on.
Project and Integration Consultancy Services
The fol l owi ng eval uati on cri teri a can be appl i ed to consul tants and system i ntegrators:
• Star join schema. Warehouse desi gners use a di mensi onal model i ng approach
based on a star joi n schema. Thi s form of model i ng resul ts i n database desi gns that
are navi gabl e by busi ness users and are resi l i ent to evol vi ng busi ness requi rements.
THE CI O 65
• Source data audit. A thorough physi cal and l ogi cal data audi t i s performed pri or
to i mpl ementati on, to i denti fy data qual i ty i ssues wi thi n source systems and propose
remedi al steps. Source system qual i ty i ssues are the number-one cause of data
warehouse project del ays. The data audi t al so serves as a real i ty check to determi ne
i f the requi red data are avai l abl e i n the operati onal systems of the enterpri se.
• Decisional requirements analysis. Perform a through deci si onal requi rements
anal ysi s acti vi ty wi th the appr opr i ate end-user r epr esentati ves pr i or to
i mpl ementati on to i denti fy detai l ed deci si onal requi rements and thei r pri ori ti es.
Thi s anal ysi s must serve as the basi s for key warehouse desi gn deci si ons.
• Methodology. The consul tant team speci al i zes i n data warehousi ng and uses a
data war ehousi ng methodol ogy based on the cur r ent state-of-the-ar t. Avoi d
consul tants who appl y unsui tabl e OLTP methodol ogi es to the devel opment of data
warehouses.
• Appropriate fact record granularity. The fact records are stored at the l owest
granul ari ty necessary to meet current deci si onal requi rements wi thout precl udi ng
l i kel y future requi rements. The wrong choi ce of grai n can dramati cal l y reduce the
useful ness of the warehouse by l i mi ti ng the degree to whi ch users can sl i ce and
di ce through data.
• Operational data store. The consul tant team capabl e of i mpl ementi ng on operati onal
data store l ayer beneath the data warehouse i f one i s necessary for operati onal
i ntegri ty. The consul tant team i s cogni zant of the key di fferences between operati onal
data store and warehousi ng desi gn i ssues, and i t desi gns the sol uti on accordi ngl y.
• Knowledge transfer. The consul tant team vi ews knowl edge transfer as a key
component of a data warehousi ng i ni ti ati ve. The project envi ronment encourages
coachi ng and l earni ng for both I T staff and end users. Busi ness users are encouraged
to share thei r i n-depth understandi ng and knowl edge of the busi ness wi th the rest
of the warehousi ng team.
• Incremental rollouts. The overal l project approach i s dri ven by ri sks and rewards,
wi th cl earl y defi ned phases (rol l outs) that i ncremental l y extend the scope of the
warehouse. The scope of each rol l out i s stri ctl y managed to prevent schedul e sl i ppage.
Front-End/OLAP/Decision Support/ Data Access and Retrieval Tools
The fol l owi ng eval uati on cri teri a can be appl i ed to front-end/OLAP/deci si on support/
data access and retri eval tool s:
• Multidimensional views. The tool supports pi voti ng, dri l l -up and dri l l -down and
di spl ays query resul ts as spreadsheets, graphs, and charts.
• Usability. The tool works under the GUI envi ronment and has features that make
i t user-fri endl y (e.g., the abi l i ty to open and run an exi sti ng report wi th one or two
cl i cks of the mouse).
• Star schema aware. I t i s appl i cabl e onl y to rel ati onal OLAP tool s. The tool
recogni zes a star schema database and takes advantage of the schema desi gn.
• Tool sophistication. The tool i s appropri ate for the i ntended user. Not al l users
are comfortabl e wi th desktop computi ng, and rel ati onal OLAP tool s can meet most
66 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
user requi rements. Mul ti di mensi onal databases are hi ghl y sophi sti cated and are
more appropri ate for power users.
• Delivery lead time. The product vendor can del i ver the product wi thi n the requi red
ti me frame.
• Planned functionality for future releases. Si nce thi s area of data warehousi ng
technol ogi es i s constantl y evol vi ng and matur i ng, i t i s hel pful to know the
enhancements or features that the tool wi l l eventual l y have i n i ts future rel eases.
The pl anned functi onal i ty shoul d be consi stent wi th the above eval uati on cri teri a.
Middleware/Data Extraction and Transformation Tools
The fol l owi ng eval uati on cri teri a can be appl i ed to mi ddl eware and extracti on and
transformati on tool s:
• Price/Performance. The product performs wel l i n a pri ce/performance/mai ntenance
compari son wi th other vendors of si mi l ar products.
• Extraction and transformation steps supported. The tool supports or automates
one or more of the basi c steps to extracti ng and transformi ng data. These steps are
readi ng source data, transporti ng source data, remappi ng keys, creati ng l oad i mages,
generati ng or creati ng stored aggregates, l oggi ng l oad excepti ons, generati ng i ndexes,
qual i ty assurance checki ng, al ert generati on, and backup and recovery.
• Delivery load time. The product vendor can del i ver the product wi thi n the requi red
ti me frame.
Most tool s i n thi s category are very expensi ve. Seri ousl y consi der wri ti ng i n-house
versi ons of these tool s as an al ternati ve, especi al l y i f source and target envi ronments are
homogeneous.
Relational Database Management Systems
The fol l owi ng eval uati on cri teri a can be appl i ed to an RDBMS:
• Preservation of prior investments. The warehouse sol uti on l arges as much as
possi bl e pri or software and hardware i nvestments and exi sti ng ski l l sets wi thi n the
or gani zati on. However , data war ehousi ng does enqui r e addi ti onal database
management techni ques because of the si ze and scal e of the database.
• Financial stability. The product vendor has proven to be a strong and vi si bl e
pl ayer i n the rel ati onal database market, and i ts fi nanci al performance i ndi cates
growth or stabi l i ty.
• Data warehousing features. The product has or will have features that support data
warehousing requirements (e.g., bit-mapped indexes for large tables, aggregate navigation).
• Star schema aware. The product’s query opti mi zer recogni zes the star schema
and opti mi zes the query accordi ngl y. Note that most query opti mi zers strongl y
support onl y OLTP-type queri es. Unfortunatel y, al though these opti mi zers are
appr opr i ate for tr ansacti onal envi r onments, they may actual l y sl ow down the
performance on deci si onal queri es.
• Warehouse metadata. The tool for the use of warehouse metadata for aggregate
navi gati on, query stati sti cs col l ecti on, etc.
THE CI O 67
• Price/Performance. The product performs wel l i n a pri ce/performance compari son
wi th other vendors of si mi l ar products.
Hardware or Operating System Platforms
The fol l owi ng eval uati on cri teri a can be appl i ed to hardware and operati ng system
pl atforms:
• Scalability. The warehouse sol uti on can scal e up i n terms of space and processi ng
power. Thi s scal abi l i ty i s parti cul arl y i mportant i f the warehouse i s projected to
grow at a rapi d rate.
• Financial stability. The product vendor has proven to be a strong and vi si bl e
pl ayer i n the hardware segment, and i ts fi nanci al performance i ndi cates growth or
stabi l i ty.
• Price/Performance. The product performs wel l i n a pri ce/performance compari son
wi th other vendors of si mi l ar products.
• Delivery lead time. The product vendor can del i ver the hardware or an equi val ent
servi ce uni t wi thi n the requi red ti me frame. I f the uni t i s not readi l y avai l abl e
wi thi n the same country, there may be del ays due to i mportati on l ogi sti cs.
• Reference sites. The hardware vendor has a reference si te that i s usi ng a si mi l ar
uni t for the same purpose. The warehousi ng team can ei ther arrange a si te vi si t
or i ntervi ew representati ves from the si te vi si t. Al ternati vel y, an onsi te test of the
uni t can be conducted, especi al l y i f no reference i s avai l abl e.
• Availability of support. Support for the hardware and i ts operati ng system i s
avai l abl e, and support response ti mes and wi thi n the acceptabl e down ti me for the
warehouse.
4.9 HOW DOES DATA WAREHOUSING AFFECT EXISTING SYSTEMS?
Exi sti ng operati onal systems are the source of i nternal warehouse data. Extracti ons
can take pl ace onl y duri ng the batch wi ndows of the operati onal systems, typi cal l y after
offi ce hours. I f batch wi ndows are suffi ci entl y l arge, warehouse-rel ated acti vi ti es wi l l have
l i ttl e or no di srupti ve effects on normal , day-to-day operati ons.
Improvement Areas in Operational Systems
Data warehousi ng, however, hi ghl i ght areas i n exi sti ng systems where i mprovements
can be made to operati onal systems, parti cul arl y i n two areas:
• Missing data items. Deci si onal i nformati on needs al ways requi re the col l ecti on of
data that are currentl y outsi de the scope of the exi sti ng systems. I f possi bl e, the
exi sti ng systems are extended to support the col l ecti on of such data. The team wi l l
have to study al ternati ves to data col l ecti on i f the operati onal systems cannot be
modi fi ed (for exampl e, i f the operati onal system i s an appl i cati on package whose
warranti es wi l l be voi d i f modi fi cati ons are made).
• Insufficient data quality. The data warehouse efforts may al so i denti fy areas
where the data qual i ty of the operati onal systems can be i mproved. Thi s i s especi al l y
true for data i tems that are used to uni quel y i denti fy customers, such as soci al
securi ty numbers.
68 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
The data warehouse i mpl ementati on team shoul d conti nuousl y provi de constructi ve
feedback regardi ng the operati onal systems. Easy i mprovements can be qui ckl y i mpl emented,
and i mprovements that requi re si gni fi cant effort and resources can be pri ori ti zed duri ng I T
pl anni ng.
By ensuri ng that each rol l out of a data warehouse phase i s al ways accompani ed by a
revi ew of the exi sti ng systems, the warehousi ng team can provi de val uabl e i nputs to pl ans
for enhanci ng operati onal systems.
4.10 DATA WAREHOUSING AND ITS IMPACT ON OTHER ENTERPRISE INITIATIVES
By i ts enterpri se-wi de nature, a data warehousi ng i ni ti ati ve wi l l natural l y have an
i mpact on other enterpri se i ni ti ati ves, two of whi ch are di scussed bel ow.
How Does Data Warehousing Tie in with BPR?
Data warehousi ng refers to the gamut of acti vi ti es that support the deci si onal i nformati on
requi rements of the enterpri se. BPR i s “the radi cal redesi gn of strategi c and val ue-added
processes and the systems, pol i ci es, and organi zati onal structures that support them to
opti mi ze the work fl ows and producti vi ty i n an organi zati on.”
Most BPR projects have focused on the opti mi zati on of operati onal busi ness processes.
Data warehousi ng, on the other hand, focuses on opti mi zi ng the deci si onal (or deci si on-
maki ng) processes wi thi n the enterpri se. I t can be sai d that data warehousi ng i s the technol ogy
enabl er for reengi neeri ng deci si onal processes.
The r eady avai l abi l i ty of i ntegr ated data for cor por ate deci si on-maki ng al so has
i mpl i cati ons for the or gani zati onal str uctur e of the enter pr i se. Most or gani zati ons ar e
structured or desi gned to col l ect, summari ze, report, and di rect the status of operati ons (i .e.,
there i s an operati onal moni tori ng purpose). The avai l abi l i ty of i ntegrated data di fferent
l evel s of detai l may encourage a fl atteni ng of the organi zati on structure.
Data warehouses al so provi de the enterpri se wi th the measures for gaugi ng competi ti ve
standi ng. The use of the warehouse l eads to i nsi ghts as to what dri ves the enterpri se. These
i nsi ghts may qui ckl y l ead to busi ness process reengi neeri ng i ni ti ati ves i n the operati onal
areas.
How Does Data Warehousing Tie in with Intranets?
The term i ntranet refers to the use of i nternet technol ogi es for i nternal corporate
networks. I ntranets have been touched as cost-effecti ve, cl i ent/server sol uti ons to enterpri se
computi ng needs. I ntranets are al so popul ar due to the uni versal , easy-to-l earn, easy-to-use
front-end, i .e., the web browser.
A data warehouse wi th a web-enabl ed front-end therefore provi des enterpri ses wi th
i nteresti ng opti ons for i ntranet based sol uti ons.
Wi th the i ntroducti on of technol ogi es that enabl e secure connecti ons over the publ i c
i nter net i nfr astr uctur e, enter pr i ses now al so have a cost-effecti ve way of di str i buti ng
del i veri ng warehouse data to users i n mul ti pl e l ocati ons.
THE CI O 69
4.11 WHEN IS A DATA WAREHOUSE NOT APPROPRIATE?
Not al l organi zati ons are ready for a data warehousi ng i ni ti ati ve. Bel ow are two i nstances
when a data warehouse i s si mpl y i nappropri ate.
When the Operational Systems are not Ready
The data warehouse i s popul ated wi th i nformati on pri mari l y from the operati onal systems
of the enterpri se. A good i ndi cator of operati onal system readi ness i s the amount of I T effort
focused on operati onal systems.
A number of tel l tal e si gns i ndi cate a l ack of readi ness. These i ncl ude the fol l owi ng:
• Many new operational systems are planned for development or are in the
process of being deployed. Much of the enterpri se’s I T resources wi l l be assi gned
to thi s effort and wi l l therefore not be avai l abl e for data warehousi ng projects.
• Many of the operational systems are legacy applications that require much
fire fighting. The source systems are bri ttl e or unstabl e candi dates for repl acement.
I T resources are al so di rected at fi ghti ng operati onal system fi res.
• Many of the operational systems require major enhancements and must be
overhauled. I f the operati onal systems requi re major enhancements, many choi ces
for these systems do not suffi ci entl y suppor t the day-to-day oper ati ons of the
enterpri se. Agai n, I T resources wi l l be di rected to enhancement or repl acement
efforts. Furthermore, defi ci ent operati onal systems al ways fal l to capture al l the
data requi red to meet the deci si onal i nformati on needs of busi ness managers.
Regardl ess of the reason for a l ack of operati onal system readi ness, the bottom l i ne
i s si mpl e: an enterpri se-wi de data warehouse i s out of the questi on due to the l ack
of adequate sour ce systems. However , thi s does not pr ecl ude a phased data
warehousi ng i ni ti ati ve, as i l l ustrated i n Fi gure 4.2.
Rollout 1 Rollout 2 Rollout 3 Rollout 4 Rollout 5
Data
Warehouse
Existing
Systems
System 1 System 2 System 3 System 4 System 5 System N
Figure 4.2. Data Warehouse Rol l out Strategy
The enterpri se may opt for an i nterl eaved depl oyment of systems. A seri es of projects
can be conducted, where a project to depl oy an operati onal system i s fol l owed by a project
that extends the scope of the data warehouse to encompass the newl y stabi l i zed operati onal
system.
70 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
The mai n focus of the majori ty of I T staff remai ns on depl oyi ng the operati onal systems.
However, a data warehouse scope extensi on project i s i ni ti ated as each operati onal system
stabi l i zes. Thi s project extends the data warehouse scope wi th data from each new operati onal
system.
However, thi s approach may create unreal i sti c end-user expectati ons, parti cul arl y duri ng
ear l i er r ol l outs. The scope and str ategy shoul d ther efor e be communi cated cl ear l y and
consi stentl y to al l users. Most, i f not al l , busi ness users wi l l understand that enterpri se-wi de
vi ews of data are not possi bl e whi l e most of the operati onal systems are not feedi ng the
warehouse.
When Does the Operational Integration Needed?
Despi te i ts abi l i ty to provi de i ntegrated data for deci si onal i nformati on needs, a data
warehouse does not i n any way contri bute to meeti ng the operati onal i nformati on needs of
the enterpri se. Data warehouses are refreshed at best on a dai l y basi s. They do not i ntegrate
data qui ckl y enough or often enough for operati onal management purposes.
I f the enter pr i se needs oper ati onal i ntegr ati on, then the typi cal data war ehouse
depl oyment (as shown i n Fi gure 4.3) i s i nsuffi ci ent.
Legacy System 1
Legacy System 2
Legacy System N
Alert System
Exception Reporting
Data Mining
EIS/DSS
Report
Writers
OLAP
Data
Warehouse
Figure 4.3. Tradi ti onal Data Warehouse Archi tecture
I nstead, the enterpri se needs an operati onal data store and i ts accompanyi ng front-end
appl i cati ons. As menti oned i n Chapter 1, fl ash moni tori ng and reporti ng tool s are often
l i kened to a dashboard that i s constantl y refreshed to provi de operati onal management wi th
the l atest i nformati on about enterpri se operati ons. Fi gure 4.4 i l l ustrates the operati onal
data store archi tecture.
When the i ntended user s of the system ar e oper ati onal manager s and when the
r equi r ements ar e for an i ntegr ated vi ew of constantl y r efr eshed oper ati onal data, an
operati onal data store i s the appropri ate sol uti on.
THE CI O 71
Enterpri ses that have i mpl emented operati onal data stores wi l l fi nd i t natural to use
the operati onal data store as one of the pri mary source systems for thei r data warehouse.
Thus, the data warehouse contai ns a seri es (i .e., l ayer upon l ayer) of ODS snapshots, where
each l ayer corresponds to data as of a speci fi c poi nt i n ti me.
Alert System
Exception Reporting
Data Mining
Report
Writers
OLAP
EIS/DSS
Source System 1
Source System 2
Source System N
Figure 4.4. The Data Warehouse and the Operati onal Data Store
4.12 HOW TO MANAGE OR CONTROL A DATA WAREHOUSE INITIATIVE?
There are several ways to manage or control a data warehouse project. Most of the
techni ques descri bed bel ow are useful i n any technol ogy project.
Milestones
Cl earl y defi ned mi l estones provi de project management and the project sponsor wi th
regul ar checkpoi nts to track the progress of the data warehouse devel opment effort. Mi l estones
shoul d be far enough apart to show real progress, but not so far apart that seni or management
becomes uneasy or l oses focus and commi tment. I n general , one data warehouse rol l out
shoul d be treated as one project, l asti ng anywhere between three to si x months.
Incremental Rollouts, Incremental Investments
Avoi d bi ti ng off more than you can chew; projects that are gi ganti c l eaps forward are
more l i kel y to fai l . I nstead, break up the data warehouse i ni ti ati ve i nto i ncremental rol l outs.
By doi ng so, you gi ve the warehouse team manageabl e but ambi ti ous targets and cl earl y
defi ned del i verabl es.
Appl yi ng a phased approach al so has the added benefi t of al l owi ng the project sponsor
and the warehousi ng team to set pri ori ti es and manage end-user expectati ons. The benefi ts
of each rol l out can be measured separatel y; and the data warehouse i s justi fi ed on a phase-
per-phase basi s.
A phased approach, however, requi res an overal l archi tect so that each phase al so l ays
the foundati on for subsequent warehousi ng efforts; and earl i er i nvestments remai n i ntact.
72 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Clearly Defined Rollout Scopes
To the maxi mum extent possi bl e, cl earl y defi ne the scope of each rol l out to set the
expectati ons of both seni or management and warehouse end-users. Each rol l out shoul d
del i ver useful functi onal i ty. As i n most devel opment project, the project manager wi l l be
wal ki ng the fi ne l i ne between i ncreasi ng the scope to better meet user needs and ruthl essl y
control l i ng the scope to meet the rol l out deadl i ne.
Individually Cost-Justified Rollouts
The scope of each rol l out determi nes the correspondi ng rol l out cost. Each rol l out shoul d
be cost-justi fi ed on i ts own meri ts to ensure appropri ate return on i nvestment. However,
thi s practi ce shoul d not precl ude l ong-term archi tectural i nvestments that do not have an
i mmedi ate return i n the same rol l out.
Plan to have Early Successes
Data warehousi ng i s l ong-term effort that must have earl y and conti nuous successes
that justi fy the l ength of the journey. Focus earl y efforts on areas that can del i ver hi ghl y
vi si bl e success, and that success wi l l i ncrease organi zati onal support.
Plan to be Scalable
I ni ti al successes wi th the data warehouse wi l l resul t i n a sudden demand for i ncreased
data scope, i ncreased functi onal i ty, or both! The warehousi ng envi ronment and desi gn must
both be scal abl e to deal wi th i ncreased demand as needed.
Reward your Team
Data warehousi ng i s a hard work, and teams need to know thei r work i s appreci ated.
A moti vated team i s al ways an asset i n l ong-term i ni ti ati ves.
In Summary
The Chi ef I nformati on Offi cer (CI O) has the unenvi abl e task of juggl i ng the l i mi ted I T
resources of the enterpri se. He or she makes the resource assi gnment deci si ons that determi ne
the ski l l sets of the vari ous I T project teams.
Unfortunatel y, data warehousi ng i s just one of the many projects on the CI O’s pl ate.
I f the enterpri se i s sti l l i n the process of depl oyi ng operati onal system, data warehousi ng
wi l l natural l y be at a l ower pri ori ty.
CI Os al so have the di ffi cul t responsi bi l i ty of evol vi ng the enterpri se’s I T archi tecture.
They must ensure that the addi ti on of each new system, and the extensi on of each exi sti ng
system, contri butes to the stabi l i ty and resi l i ency of the overal l I T archi tecture.
Fortunatel y, data warehouse and operati onal data store technol ogi es al l ow CI Os to
mi grate reporti ng and anal yti cal functi onal i ty from l egacy or operati onal envi ronments,
thereby creati ng a more robust and stabl e computi ng envi ronment for the enterpri se.
73
The warehouse Project Manager i s responsi bl e for any and al l techni cal acti vi ti es rel ated
to pl anni ng, desi gni ng, and bui l di ng a data warehouse. Under i deal ci rcumstances, thi s rol e
i s ful fi l l ed by i nternal I T staff. I t i s not unusual , however, for thi s rol e to be outsourced,
especi al l y for earl y or pi l ot projects, because warehousi ng technol ogi es and techni ques are
so new.
5.1 HOW TO ROLL OUT A DATA WAREHOUSE INITIATIVE?
To start a data warehouse i ni ti ati ve, there are three mai n thi ngs to be kept i n mi nd.
Al ways start wi th a pl anni ng acti vi ty. Al ways i mpl ement a pi l ot project as your “proof of
concept.” And, al ways extend the functi onal i ty of the warehouse i n an i terati ve manner.
Start with a Data Warehouse Planning Activity
The scope of a data warehouse vari es from one enterpri se to another. The desi red scope
and scal e are typi cal l y determi ned by the i nformati on requi rements that dri ve the warehouse
desi gn and devel opment. These requi rements, i n turn, are dri ven by the busi ness context
of the enterpri se–the i ndustry, the fi erceness of competi ti on, and the state of the art i n
i ndustry practi ces.
Regardl ess of the i ndustry, however, i t i s advi sabl e to start a data warehouse i ni ti ati ve
wi th a short pl anni ng acti vi ty. The Project Manager shoul d l aunch and manage the acti vi ti es
l i sted bel ow.
Decisional Requirements Analysis
Start wi th an anal ysi s of the deci si on support needs of the organi zati on. The warehousi ng
team must understand the user requi rements and attempt to map these to the data sources
avai l abl e. The team al so desi gns potenti al queri es or reports that can meet the stated
i nformati on requi rements.
Unl i ke system devel opment projects for OLTP appl i cati ons, the i nformati on needs of
deci si onal users cannot be pi nned down and are frequentl y changi ng. The requi rements
anal ysi s team shoul d therefore gai n enough of an understandi ng of the busi ness to be abl e
to anti ci pate l i kel y changes to end-user requi rements.
THE FRO]ECT MANAGER
5
CHAFTER
74 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Decisional Source System Audit
Conduct an audi t of al l potenti al sources of data. Thi s cruci al and very detai l ed task
veri fi es that data sources exi st to meet the deci si onal i nformati on needs i denti fi ed duri ng
requi rements anal ysi s. There i s no poi nt i n desi gni ng a warehouse schema that cannot be
popul ated because of a l ack of source data.
Si mi l arl y, there i s no poi nt i n desi gni ng reports or queri es when data are not avai l abl e
to generate them. Log al l data i tems that are currentl y not supported or provi ded by the
operati onal systems and submi t these to the CI O as i nputs for I T pl anni ng.
Logical and Physical Warehouse Schema Design (Preliminary)
The resul ts of requi rements anal ysi s and source system audi t serve as i nputs to the
desi gn of the warehouse schema. The schema detai l s al l fact and di mensi on tabl es and
fi el ds, as wel l as the data sources for each warehouse fi el d. The prel i mi nary schema produced
as part of the warehousi ng pl anni ng acti vi ty wi l l be progressi vel y refi ned wi th each rol l out
of the data warehouse.
The goal of the team i s to desi gn a data structure that wi l l be resi l i ent enough to meet
the constantl y changi ng i nformati on requi rements of warehouse end-users.
Other Concerns
The three tasks descri bed above shoul d al so provi de the warehousi ng team wi th an
understandi ng of.
• The requi red warehouse archi tecture.
• The appropri ate phasi ng and rol l out strategy; and
• The i deal scope for a pi l ot i mpl ementati on.
The data warehouse pl an must al so eval uate the need for an ODS l ayer between the
operati onal systems and the data warehouse.
Addi ti onal i nformati on on the above acti vi ti es i s avai l abl e i n part I I I . Process.
Implement a Proof-of-Concept Pilot
Start wi th a pi l ot i mpl ementati on as the fi rst rol l out for data warehousi ng. Pi l ot projects
have the advantage of bei ng smal l and manageabl e, thereby provi di ng the organi zati on wi th
a data warehouse “proof of concept” that has a good chance of success.
Determi ne the functi onal scope of a pi l ot i mpl ementati on based on two factors:
• The degree of risk the enterprise is willing to take. The project di ffi cul ty
i ncreases as the number of source systems, users, and l ocati ons i ncreases. Pol i ti cal l y
sensi ti ve areas of the enterpri se are al so very hi gh ri sk.
• The potential for leveraging the pilot project. Avoi d constructi ng a “throwaway”
prototype for the pi l ot project. The pi l ot warehouse must have actual val ue to the
enterpri se. Fi gure 5.1 i s a matri x for assessi ng the pi l ot project.
Avoi d hi gh-ri sk projects wi th very l ow reward possi bi l i ti es. I deal l y, the pi l ot project has
l ow or manageabl e ri sk factors but has a hi ghl y vi si bl e i mpact on the way deci si ons are
made i n the enterpri se. An earl y and hi gh profi l e success wi l l i ncrease the grassroots
support of the warehousi ng i ni ti ati ve.
THE PROJECT MANAGER 75

AVOID:
High Risk,
Low Reward
CANDIDATE:
High Risk,
High Reward
CANDIDATE:
Low Risk,
Low Reward
IDEAL:
Low Risk,
High Reward
R
i
s
k

Figure 5.1. Sel ecti ng a Pi l ot Project: Ri sks vs Reward
Extend Functionality Iteratively
Once the warehouse pi l ot i s i n pl ace and i s stabl e, i mpl ement subsequent rol l outs of the
data warehouse to conti nuousl y l ayer new functi onal i ty or extend exi sti ng warehousi ng
functi onal i ty on a cost-justi fi abl e, pri ori ti zed basi s, i l l ustrated by the di agram i n Fi gure 5.2.
TOP-DOWN
• User Requirements
BACK-END
• Extraction
• Integration
• QA
• DW Load
• Aggregates
• Metadata
FRONT-END
• OLAP Tool
• Canned Reports
• Canned Queries
BOTTOM-UP
• Source Systems
• External Data
Figure 5.2. I terati ve Extensi on of Functi onal i ty, i .e., Evol uti on
Top-down
Dri ve al l rol l outs by a top-down study of user requi rements. Deci si onal requi rements
are subject to constant change; the team wi l l never be abl e to ful l y document and understand
the requi rements, si mpl y because the requi rements change as the busi ness si tuati on changes.
Don’t fal l i nto the trap of wanti ng to anal yze everythi ng to extreme detai l (i .e., anal ysi s
paral ysi s).
76 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Bottom-up
Whi l e some team members are worki ng top-down, other team members are worki ng
bottom-up. The resul ts of the bottom-up study serve as the real i ty check for the rol l out–
some of the top-down requi rements wi l l qui ckl y become unreal i sti c, gi ven the state and
contents of the i ntended source systems. End users shoul d be qui ckl y i nformed of l i mi tati ons
i mposed by source system data to property manage thei r expectati ons.
Back-end
Each rol l out or i terati on extends the back-end (i .e., the server component) of the data
warehouse. Warehouse subsystems are created or extended to support a l arger scope of
data. Warehouse data structures are extended to support a l arger scope of data. Aggregate
records are computed and l oaded. Metadata records are popul ated as requi red.
Front-end
The front-end (i .e., cl i ent component) of the warehouse i s al so extended by depl oyi ng
the exi sti ng data access and retri eval tool s to more users and by depl oyi ng new tool s (e.g.,
data mi ni ng tool s, new deci si on support appl i cati ons) to warehouse users. The avai l abi l i ty
of more data i mpl i es that new reports and new queri es can be defi ned.
5.2 HOW IMPORTANT IS THE HARDWARE PLATFORM?
Si nce many data war ehouse i mpl ementati ons ar e devel oped i nto al r eady exi sti ng
envi ronments, many organi zati ons tend to l everage the exi sti ng pl atforms and ski l l base to
bui l d a data warehouse. Thi s secti on l ooks at the hardware pl atform sel ecti on from an
archi tectural vi ewpoi nt: what pl atform i s best to bui l d a successful data warehouse from the
ground up.
An i mportant consi derati on when choosi ng a data warehouse server i s i ts capaci ty for
handl i ng the vol umes of data requi red by deci si on support appl i cati ons, some of whi ch may
requi re data requi red by deci si on support appl i cati ons, some of whi ch may requi re a si gni fi cant
amount of hi stori cal (e.g., up to 10 years) data. Thi s capaci ty requi rement can be qui te l arge.
For exampl e, i n general , di sk storage al l ocated for the warehouse shoul d be 2 to 3 ti mes the
si ze of the data component of the warehouse to accommodate DSS processi ng, such as
sorti ng, stori ng of i ntermedi ate resul t, summari zati on, joi n, and formatti ng. Often, the
pl atform choi ce between a mai nframe and non MVS (UNI X or Wi ndows NT) server.
Of course, a number of arguments can be made for and agai nst each of these choi ces.
For exampl e, a mai nframe i s based on a proven technol ogy; has l arge data and throughput
capaci ty; i s rel i abl e, avai l abl e, and servi ceabl e; and may support the l egacy databases that
are used as sources for the data warehouse. The data warehouse resi di ng on the mai nframe
i s best sui ted for si tuati ons i n whi ch l arge amounts of l egacy data need to be stored i n the
data warehouse. A mai nframe system, however, i s not as open and fl exi bl e as a contemporary
cl i ent/server systems, and i s not opti mi zed for ad hoc query processi ng. A modern server
(non-mai nframe) can al so support l arge data vol umes and a l arge number of fl exi bl e GUI -
based end-user tool s, and can rel i eve the mai nframe from ad hoc query processi ng. However,
i n general , non-MVS servers are not as rel i abl e as mai nframes, are more di ffi cul t to manage
and i ntegrate i nto the exi sti ng envi ronment, and may requi re new ski l l s and even new
organi zati onal structures.
THE PROJECT MANAGER 77
Fr om the ar chi tectur al vi ewpoi nt, however , the data war ehouse ser ver has to be
speci al i zed for the tasks associ ated wi th the data warehouse, and a mai nframe scan be wel l
sui ted to be a data warehouse server. Let’s l ook at the hardware features that make a
server-whether i t i s mai nframe, UNI X or NT-based–an appropri ate techni cal sol uti on for
the data warehouse.
To begi n wi th, the data war ehouse ser ver has to be abl e to suppor t l ar ge data
vol umes and compl ex quer y pr ocessi ng. I n addi ti on, i t has to be scal abl e, si nce the data
war ehouse i s never fi ni shed, as new user r equi r ements, new data sour ces, and mor e
hi stor i cal data ar e conti nuousl y i ncor por ated i nto the war ehouse, and as the user
popul ati on of the data war ehouse conti nues to gr ow. Ther efor e, a cl ear r equi r ement for
the data war ehouse ser ver i s the scal abl e hi gh per for mance for data l oadi ng and ad hoc
quer y pr ocessi ng as wel l as the abi l i ty to suppor t l ar ge databases i n a r el i abl e, effi ci ent
fashi on. Chapter 4 br i efl y touched on var i ous desi gn poi nts to enabl e ser ver speci al i zati on
for scal abi l i ty i n per for mance, thr oughput, user suppor t, and ver y l ar ge database (VLDB)
pr ocessi ng.
Balanced Approach
An i mportant desi gn poi nt when sel ecti ng a scal abl e computi ng pl atform i s the ri ght
bal ance between al l computi ng components, for exampl e, between the number of processors
i n a mul ti processor system and the I /O bandwi dth. Remember that the l ack of bal ance i n
a system i nevi tabl y resul ts i n bottl eneck!
Typi cal l y, when a har dwar e pl atfor m i s si zed to accommodate the data war ehouse,
thi s si zi ng i s frequentl y focused on the number and si ze of di sks. A typi cal di sk confi gurati on
i ncl udes 2.5 to 3 ti mes the amount of raw data. An i mportant consi derati on–di sk throughput
comes fr om the actual number of di sks, and not the total di sk space. Thus, the number
of di sks has di r ect i mpact on data par al l el i sm. To bal ance the system, i t i s ver y i mpor tant
to al l ocate a cor r ect number of pr ocessor s to effi ci entl y handl e al l di sk I /O oper ati ons. I f
thi s al l ocati on i s not bal anced, an expensi ve data war ehouse pl atfor m can r api dl y become
CPU-bound. I ndeed, si nce var i ous pr ocessor s have wi del y di ffer ent per for mance r ati ngs
and thus can suppor t a di ffer ent number of di sks per CPU, data war ehouse desi gner s
shoul d car eful l y anal yze the di sk I /O r ates and pr ocessor capabi l i ti es to der i ve an effi ci ent
system confi gur ati on. For exampl e, i f i t takes a CPU r ated war ehouse desi gner s shoul d
car eful l y anal yze the di sk I /O r ates and pr ocessor capabi l i ti es to der i ve an effi ci ent system
confi gur ati on. For exampl e, i f i t takes a CPU r ated at 10 SPEC to effi ci entl y handl e one
3-Gbyte di sk dr i ve, then a si ngl e 30 SPEC i nto pr ocessor i n a mul ti pr ocessor system can
handl e thr ee di sk dr i ver s. Knowi ng how much data needs to be pr ocessed shoul d gi ve you
an i dea of how bi g the mul ti pr ocessor system shoul d be. Another consi der ati on i s r el ated
to di sk contr ol l er s. A di sk contr ol l er can suppor t a cer tai n amount of data thr oughput
(e.g., 20 Mbytes/s). Knowi ng the per -di sk thr oughput r ati o and the total number of di sks
can tel l you how many contr ol l er s of a gi ven type shoul d be confi gur ed i n the system.
The i dea of a bal anced approach can (and shoul d) be careful l y extended to al l system
components. The resul ti ng system confi gurati on wi l l easi l y handl e known workl oads and
provi de a bal anced and scal abl e computi ng pl atform for future growth.
78 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Optimal Hardware Architecture for Parallel Query Scalability
An i mportant consi derati on when sel ecti ng a hardware pl atform for a data warehouse
i s that of scal abi l i ty. Therefore, a frequent approach to system sel ecti on i s to take advantage
of hardware paral l el i sm that comes i n the form of shared-memory symmetri c mul ti processors
(SMPs), massi vel y mul ti processi ng (MPPs) and cl usters expl ai ned i n Chapter 10. The
scal abi l i ty of these systems can be seri ousl y affected by the system-archi tecture-i nduced
data skew. Thi s archi tecture-i nduced data skew i s more severe i n the l ow-densi ty asymmetri c
connecti on archi tectures e.g., dai sy-chai ned, 2-D and 3-D mesh), and i s vi rtual l y nonexi stent
i n symmetri c connecti on archi tectures (e.g., cross-bar swi tch). Thus, when sel ecti ng a hardware
pl atform for a data warehouse, take i nto account the fact that the system-archi tecture-
i nduced data skew can overpower even the best data l ayout for paral l el query executi on, and
can force and expensi ve paral l el computi ng system to process queri es seri al l y.
The choi ce between SMP and MPP i s i nfl uenced by a number of factors, i ncl udi ng the
compl exi ty of the query envi ronment, the pri ce/performance rati o, the proven processi ng
capaci ty of the hardware pl atform wi th the target RDBMS, the anti ci pated warehouse
appl i cati ons, and the foreseen i ncreases i n warehouse si ze and users.
For exampl e, compl ex queri es that i nvol ve mul ti pl e tabl e joi ns mi ght real i ze better
per for mance wi th an MPP confi gur ati on. MPPs though, ar e gener al l y mor e expensi ve.
Cl ustered SMPs may provi de a hi ghl y scal abl e i mpl ementati on wi th better pri ce/performance
benefi ts.
5.3 WHAT ARE THE TECHNOLOGIES INVOLVED?
Sever al types of technol ogi es ar e used to make data war ehousi ng possi bl e. These
technol ogy types are enumerated bri efl y bel ow. More i nformati on i s avai l abl e i n Part 4,
Technol ogy.
• Source systems. The operati onal systems of the enterpri se are the most l i kel y
source systems for a data warehouse. The warehouse may al so make use of external
data sources from thi rd parti es.
• Middleware, extraction, transportation and transformation technologies.
These tool s extract and reorgani ze data from the vari ous source systems. These
tool s vary greatl y i n terms of compl exi ty, features, and pri ce. The i deal tool s for the
enterpri se are heavi l y dependent on the computi ng envi ronment of the source
systems and the i ntended computi ng envi ronment of the data warehouse.
• Data quality tools. These tool s i denti fy or correct data qual i ty errors that exi st
i n the raw source data. Most tool s of thi s type are used to cal l the warehouse team’s
attenti on to potenti al qual i ty probl ems. Unfortunatel y, much of the data cl eansi ng
process i s sti l l manual ; i t i s al so tedi ous due to the vol ume of data i nvol ved.
• Warehouse storage. Database management systems (DBMS) are used to store
the warehouse data. DBMS products are general l y cl assi fi ed as rel ati onal (e.g.,
Oracl e, I nformi x, Sybase) or mul ti di mensi onal (e.g., Essbase, Bri oQuery, Express
Ser ver ).
• Metadata management. These tool s create, store, and manage the warehouse
metadata.
THE PROJECT MANAGER 79
• Data access and retrieval tools. These are tool s used by warehouse end users
to access, format, and di ssemi nate warehouse data i n the form of reports, query
resul ts, charts, and graphs. Other data access and retri eval tool s acti vel y search
the data warehouse for patterns i n the data (i .e., data mi ni ng tool s). Deci si on
Support Systems and Executi ve I nformati on Systems al so fal l i nto thi s category.
• Data modeling tools. These tool s are used to prepare and mai ntai n an i nformati on
model of both the source databases and the warehouse database.
• Warehouse management tools. These tool s are used by warehouse admi ni strators
to create and mai ntai n the warehouse (e.g., create and modi fy warehouse data
structures, generate i ndexes).
• Data warehouse hardware. Thi s refers to the data warehouse server pl atforms
and thei r rel ated operati ng systems.
Fi gure 5.3 depi cts the typi cal warehouse software components and thei r rel ati onshi ps
to one another.
Data Access &
Retrieval
OLAP
Report
Writers
EIS/DSS
Data Mining
Alert System
Exception Reporting
Metadata
Data
Warehousing
Data
Mart(s)
Warehouse
Technology
Extraction &
Transformation
Source Data
Middleware
Extraction
Transformation
Quality Assurance
Load Image Creation
Figure 5.3. Data Warehouse Components
5.4 ARE THE RELATIONAL DATABASES STILL USED FOR DATA WAREHOUSING?
Al though there were i ni ti al doubts about the use of rel ati onal database technol ogy i n
data warehousi ng, experi ence has shown that there i s actual l y no other appropri ate database
management system for an enterpri se-wi de data warehouse.
MDDBs
The confusi on about rel ati onal databases ari ses from the prol i ferati on of OLAP products
that make use of a mul ti di mensi onal database (MDDB). MDDBs store data i n a “hypercube”
i .e., a mul ti di mensi onal array that i s paged i n and out of memory as needed, as i l l ustrated
i n Fi gure 5.4.
80 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG

1 2 3 4 5 6
7 8 3 3 3 4
5 5 6 7 8 2
3 4 5 5 6 8
3 4 5 5 5 7
Figure 5.4. MDDB Data Structures
RDBMS
I n contrast, rel ati onal databases store data as tabl es wi th rows and col umns that do not
map di rectl y to the mul ti di mensi onal vi ew that users have data. Structured query l anguage
(SQL) scri pts are used to retri eve data from RDBMses.
Two Rival Approaches
Al though these two approaches are apparent “ri val s,” the apparent competi ti on between
MDDB and rel ati onal database (RDB) technol ogy presents enterpri ses wi th i nteresti ng
archi tectural al ternati ves for i mpl ementi ng data warehousi ng technol ogy. I t i s not unusual
to fi nd enterpri ses maki ng use of both technol ogi es i n dependi ng on the requi rements of the
user communi ty.
From an archi tectural perspecti ve, the enterpri se can get the best of tooth worl ds
through the careful use of both technol ogi es i n di fferent parts of the warehouse archi tecture.
• Enterprise data warehouses. These have a tendency to grow si gni fi cantl y beyond
the si ze l i mi t of most MDDBs and are therefore typi cal l y i mpl emented wi th rel ati onal
database technol ogy. Onl y rel ati onal database technol ogy i s capabl e of stori ng upto
terabytes of data whi l e sti l l provi di ng acceptabl e l oad and query performance.
• Data marts. A data mart i s typi cal l y a subset of the enterpri se data warehouse.
These subsets are determi ned ei ther by geography (i .e., one data mart per l ocati on)
or by user group. Data marts, due to thei r smal l er si ze, may take advantage of
mul ti di mensi onal databases for better reporti ng and anal ysi s performance.
Warehousing Architectures
How the rel ati onal and mul ti -di mensi onal database technol ogi es can be used together
for data warehouse and data mart i mpl ementati on i s presented bel ow.
RDBMSes in Warehousing Architectures
Data warehouses are bui l t on rel ati onal database technol ogy. Onl i ne anal yti cal processi ng
(OLAP) tool s are then used to i nteract di rectl y wi th the rel ati onal data warehouse or wi th
a rel ati onal data mart (see Fi gure 5.5). Rel ati onal OLAP (ROLAP) tool s recogni ze the
THE PROJECT MANAGER 81
rel ati onal nature of the database but sti l l present thei r users wi th mul ti di mensi onal vi ews
of the data.
ROLAP
Front-End
Data
Mart
(RDB)
Data
Warehouse
ROLAP
Front-End
Figure 5.5. Rel ati onal Databases
MDDBs in Warehousing Architectures
Al ter nati vel y data i s extr acted fr om the r el ati onal data war ehouse and pl aced i n
mul ti di mensi onal data structures to refl ect the mul ti di mensi onal nature of the data (see
Fi gure 5.6). Mul ti di mensi onal OLAP (MOLAP) tool s run agai nst the mul ti di mensi onal server,
rather than agai nst the data warehouse.
Data
Watehouse
Multi-
Dimensional
(MDDB)
Figure 5.6. Mul ti di mensi onal Databases
Tiered Data Warehousing Architectures
The enterpri se i s free to mi x and match these two database technol ogi es, dependi ng on
the scal e and si ze of the data warehouse, as i l l ustrated i n Fi gure 5.7.
I t i s not unusual to fi nd an enterpri se wi th the fol l owi ng ti ered data warehousi ng
archi tecture:
• ROLAP tool s, whi ch run di rectl y agai nst rel ati onal databases, are used whenever
the queri es are fai rl y si mpl e and when the admi ni strati on overhead that comes
wi th mul ti di mensi onal tool s i s not justi fi ed for a parti cul ar user base.
• Mul ti di mensi onal databases are used for data marts, and speci fi c mul ti di mensi onal
front-end appl i cati ons query the contents of the MDDB. Data marts may al so use
rel ati onal database technol ogy, i n whi ch case, users make use of ROLAP front-
ends.
82 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
ROLAP
Front-End
Data
Warehouse
(RDB)
ROLAP
Front-End
Data
Mart
(RDB)
Data
Mart
(MDDB)
MDDB
Figure 5.7. Ti ered Data Warehousi ng Archi tecture
Trade-Offs: MDDB vs. RDBMS
Consi der the fol l owi ng factor s whi l e choosi ng between the mul ti di mensi onal and
rel ati onal approaches:
Size
Mul ti di mensi onal databases are general l y l i mi ted by si ze, al though the si ze l i mi t has
been i ncreasi ng gradual l y over the years. I n the mi d-1990s, 10 gi gabytes of data i n a
hyper cube al r eady pr esented pr obl ems and unacceptabl e quer y per for mance. Some
mul ti di mensi onal products today are abl e to handl e up to 100 gi gabytes of data. Despi te thi s
i mprovement, l arge data warehouses are sti l l better served by rel ati onal front-ends runni ng
agai nst hi gh performi ng and scal abl e rel ati onal databases.
Volatility of Source Data
Hi ghl y vol ati l e data are better handl ed by rel ati onal technol ogy. Mul ti di mensi onal data
i n hypercubes general l y take l ong to l oad and update. Thus the ti me requi red to constantl y
l oad and update the mul ti di mensi onal data structure may prevent the enterpri se from
l oadi ng new data as often as desi red.
Aggregate Strategy
Mul ti di mensi onal hypercubes support aggregati ons better, al though thi s advantage wi l l
di sappear as rel ati onal databases i mprove thei r support of aggregate navi gati on. Dri l l i ng up
and down on RDBMSes general l y take l onger than on MDDBs as wel l . However, due to
thei r si ze l i mi tati on, MDDBS wi l l not be sui ted to warehouses or data marts wi th very
detai l ed data.
Investment Protection
Most enterpri ses have al ready made si gni fi cant i nvestments i n rel ati onal technol ogy
(e.g., RDBMS assets) and ski l l sets. The conti nued use of these tool s and ski l l s for another
purpose provi des addi ti onal return on i nvestment and l owers the techni cal ri sk for the data
warehousi ng effort.
THE PROJECT MANAGER 83
Ability to Manage Complexity
A mul ti di mensi onal DBMS adds a l ayer to the overal l systems archi tecture of the
warehouse. Suffi ci ent resources must be al l ocated to admi ni ster and mai ntai n the MDDB
l ayer. I f the admi ni strati ve overhead i s not or cannot be justi fi ed, and MDDB wi l l not be
appropri ate.
Type of Users
Power users general l y prefer the range of functi onal i ty avai l abl e i n mul ti di mensi onal
OLAP tool s. Users that requi re broad vi ews of enterpri se data requi re access to the data
warehouse and therefore are best served by a rel ati onal OLAP tool .
Recentl y, many of the l arge database vendors have announced pl ans to i ntegrate thei r
mul ti di mensi onal and rel ati onal database products. I n thi s scenari o, end-users make use of
the mul ti di mensi onal front-end tool s or al l thei r queri es. I f the query requi res data that are
not avai l abl e i n the MDDB, the tool s wi l l retri eve the requi red data from the l arger rel ati onal
database. Subbed as a “dri l l -through” feature, thi s i nnovati on wi l l certai nl y i ntroduce new
data warehousi ng.
5.5 HOW LONG DOES A DATA WAREHOUSING PROJECT LAST?
Data warehousi ng i s a l ong, daunti ng task; i t requi res si gni fi cant, prol onged effort on
the part of the enterpri se and may have the unpl easant si de effect of hi ghl i ghti ng probl em
areas i n operati onal systems. Li ke any task of great magni tude, the data warehousi ng effort
must be parti ti oned i nto manageabl e chunks, where each pi ece i s managed as an i ndi vi dual
project or rol l out.
Data warehouse rol l outs after the pi l ot warehouse must fi t together wi thi n an overal l
strategy. Defi ne the strategy at the start of the data warehousi ng effort. Constantl y update
(at l east once a year) the data warehouse strategy as new requi rements are understood, new
operati onal systems are i nstal l ed, and new tool s become avai l abl e.
I n cases where the enterpri se al so has an operati onal data store i ni ti ati ve, the ODS and
warehousi ng projects must be synchroni zed. The data warehouse shoul d take advantage of
the ODS as a source system as soon as possi bl e. Fi gure 5.8 depi cts how the enti re deci si onal
systems effort can be i nterl eaved.
Rollout 2 Rollout 4 Rollout 6 Rollout 8 Rollout n Data
Warehouse
Rollout 1 Rollout 3 Rollout 5 Rollout 7 Rollout 9 Operational
Data Store
Existing
System
System 1 System 2 System 3 System 4 System 5 System N
Figure 5.8. I nterl eaved Operati onal , ODS, and Warehouse Projects
84 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Each data warehouse rol l out shoul d be scoped to l ast anywhere between three to si x
months, wi th a team of about 6 to 12 peopl e worki ng on i t ful l ti me. Part-ti me team
members can easi l y bri ng the total number of parti ci pants to more than 20. Suffi ci ent
resources must be al l ocated to support each rol l out.
5.6 HOW IS A DATA WAREHOUSE DIFFERENT FROM OTHER IT PROJECTS?
Si nce much of computi ng has focused on meeti ng operati onal i nformati on needs, I T
professi onal s have a natural tendency to appl y the same methodol ogi es, approaches or
techni ques to data warehousi ng projects. Unfortunatel y, data warehousi ng projects di ffer
from other I T projects i n a number of ways, as di scussed bel ow.
A Data Warehouse Project is not a Package Implementation Project
A data warehouse project requi res a number of tool s and software uti l i ti es that are
avai l abl e from mul ti pl e vendors. At present, there i s sti l l no si ngl e sui te of tool s that can
automate the enti re data warehousi ng effort.
Most of the major warehouse vendors, however, are now attempti ng to provi de off-the-
shel f sol uti ons for warehousi ng projects by bundl i ng thei r warehousi ng products wi th that
of other warehousi ng partners. Thi s sol uti on l i mi ts the potenti al tool i ntegrati on probl ems
of the warehousi ng team.
A Data Warehouse Never Stops Evolving; it Changes with the Business
Unl i ke OLTP systems that are subject onl y to changes rel ated to the process or area
of the busi ness they support, a data warehouse i s subject to changes to the deci si onal
i nformati on requi rements of deci si on-makers. I n other words, i t i s subject to any changes
i n the busi ness context of the enterpri se.
Al so unl i ke OLTP systems, a successful data warehouse wi l l resul t i n more questi ons
from busi ness users. Change requests for the data warehouse are a posi ti ve i ndi cati on that
the warehouse i s bei ng used.
Data Warehouses are Huge
Wi thout exaggerati on, enterpri se-wi de data warehouses are huge. A pi l ot data warehouse
can easi l y be more than 10 gi gabytes i n si ze. A data warehouse i n producti on for a l i ttl e over
a year can easi l y reach 1 terabyte, dependi ng on the granul ari ty and the vol ume of data.
Database of thi s si ze requi re di fferent database opti mi zati on and tuni ng techni ques.
Project progress and effort are hi ghl y dependent on accessi bi l i ty and qual i ty of source
system data.
The progress of a data warehouse project i s hi ghl y dependent on where the operati onal
data r esi des. Enter pr i ses that make use of pr opr i etar y appl i cati on packages wi l l fi nd
themsel ves deal i ng wi th l ocked data. Enterpri ses wi th data di stri buted over di fferent l ocati ons
wi th no easy access wi l l al so encounter di ffi cul ti es.
Si mi l arl y, the qual i ty of the exi sti ng data pl ays a major rol e i n the project. Data qual i ty
probl ems consi stentl y remai n at the top of the l i st of data warehouse i ssues. Unfortunatel y,
none of the avai l abl e tool s can automate away the probl em of data qual i ty. Al though tool s can
hel p i denti fy probl em areas, these probl ems can onl y be resol ved manual l y.
THE PROJECT MANAGER 85
5.7 WHAT ARE THE CRITICAL SUCCESS FACTORS OF A DATA WAREHOUSING
PROJECT?
A number of factors i nfl uence the progress and success of data warehousi ng projects.
Whi l e the l i st bel ow does not cl ai m to be compl ete, i t hi ghl i ghts areas of the warehousi ng
project that the project manager i s i n a posi ti on to acti vel y control or i nfl uence.
• Proper planning. Defi ne a warehousi ng strategy and expect to revi ew i t after
each warehouse rol l out. Bear i n mi nd that I T resources are sti l l requi red to manage
and admi ni ster the warehouse once i t i s i n producti on. Stay coordi nated wi th any
schedul ed mai ntenance work on the warehouse source systems.
• Iterative development and change management. Stay away from the bi g-bang
approach. Di vi de the warehouse i ni ti ati ve i nto manageabl e rol l outs, each to l ast
anywhere between three to si x months. Constantl y col l ect feedback from users.
I denti fy l essons l earnt at the end of each project and feed these i nto the next
i terati on.
• Access to and involvement of key players. The Project Sponsor, the CI O, and
the Project Manager must al l be acti vel y i nvol ved i n setti ng the di recti on of the
warehousi ng project. Work together to resol ve the busi ness and techni cal i ssues
that wi l l ari se on the project. Choose the project team members careful l y, taki ng
care to ensure that the project team rol es that must be performed by i nternal
resources are staffed accordi ngl y.
• Training and communication. I f data warehousi ng i s new to the enterpri se or
i f new team members are joi ni ng the warehousi ng i ni ti ati ve, set asi de enough ti me
for tr ai ni ng the team member s. The r ol es of each team member must be
communi cated cl earl y to set rol e expectati ons.
• Vigilant issue management. Keep a cl ose watch on pr oject i ssues and ensur e
thei r speedy r esol uti on. The Pr oject Manager shoul d be qui ck to i denti fy the
negati ve consequences on the pr oject schedul e i f i ssues ar e l eft unr esol ved. The
Pr oject Sponsor shoul d i nter vene and ensur e the pr oper r esol uti on of i ssues,
especi al l y i f these ar e cl ear l y causi ng del ays, or deal wi th hi ghl y pol i ti ci zed ar eas
of the busi ness.
• Warehousing approach. One of the worst thi ngs a Project Manager can do i s to
appl y OLTP devel opment approaches to a warehousi ng project. Appl y a system
devel opment approach that i s tai l ored to data warehousi ng; avoi d OLTP devel opment
approaches that have si mpl y been repackaged i nto warehousi ng terms.
• Demonstration with a pilot project. The i deal pi l ot project i s a hi gh-i mpact,
l ow-ri sk area of the enterpri se. Use the pi l ot as a proof-of-concept to gai n champi ons
wi thi n the enterpri se and to refute the opposi ti on.
• Focus on essential minimal characteristics. The scope of each project or rol l out
shoul d be ruthl essl y managed to del i ver the essenti al mi ni mal characteri sti cs for
that rol l out. Don’t get carri ed away by spendi ng ti me on the bel l s and whi stl es,
especi al l y wi th the front-end tool s.
86 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
In Summary
The Project Manager i s responsi bl e for al l techni cal aspects of the project on a day-to-
day basi s. Si nce upto 80 percent of any data warehousi ng project can be devoted to the back-
end of the warehouse, the rol e of Project Manager can easi l y be one of the busi est on the
pr oject.
Despi te the fact that busi ness users now dri ve the warehouse devel opment, a huge part
of any data warehouse project i s sti l l technol ogy centered. The cri ti cal success factors of
typi cal technol ogy projects therefore sti l l appl y to data warehousi ng projects. Bear i n mi nd,
however, that data warehousi ng projects are more suscepti bl e to organi zati onal and l ogi sti cal
i ssues than the typi cal technol ogy project.
PART III : PROCESS
Al though there have been attempts to use tradi ti onal software
devel opment methodol ogi es from the OLTP arena for data
warehouse devel opment, warehousi ng practi ti oners general l y
agree that an i terati ve devel opment approach i s more sui ted
to war ehouse devel opment than ar e tr adi ti onal water fal l
approaches.
Thi s secti on of the book presents an i terati ve warehousi ng
appr oach for enter pr i ses about to embar k on a data
war ehousi ng i ni ti ati ve. The appr oach begi ns wi th the
defi ni ti on of a data warehouse strategy, then proceeds to defi ne
the way to set up war ehouse management and suppor t
processes.
The l atter part of the approach focuses on the tasks requi red
to pl an and i mpl ement one rol l out (i .e., one phases) of the
tasks requi red to pl an and i mpl ement one rol l out (i .e., one
phase) of the warehouse. These tasks are repeated for each
phase of the warehouse devel opment.
This page
intentionally left
blank
89
Defi nes the warehouse strategy as part of the i nformati on technol ogy strategy of the
enterpri se. The tradi ti onal I nformati on Strategy Pl an (I SP) addresses operati onal computi ng
needs thoroughl y but may not gi ve suffi ci ent attenti on to deci si onal i nformati on requi rements.
A data warehouse strategy remedi es thi s by focusi ng on the deci si onal needs of the enterpri se.
We start thi s chapter by presenti ng the components of a Data Warehousi ng strategy.
We fol l ow wi th a di scussi on of the tasks requi red to defi ne a strategy for an enterpri se.
6.1 STRATEGY COMPONENTS
At a mi ni mum, the data warehouse strategy shoul d i ncl ude the fol l owi ng el ements:
Preliminary Data Warehouse Rollout Plan
Not al l of the user requi rements can be met i n one data warehouse project—such a
project woul d necessari l y be l arge, and dangerousl y unmanageabl e. I t i s more real i sti c to
pri ori ti ze the di fferent user requi rements and assi gn them to di fferent warehouse rol l outs.
Doi ng so al l ows the enterpri se to di vi de the warehouse devel opment i nto phased, successi ve
rol l outs, where each rol l out focuses on meeti ng an agreed set of requi rements.
The i terati ve nature of such an approach al l ows the warehousi ng temp to extend the
functi onal i ty of the warehouse i n a manageabl e manner. The phased approach l owers the
overal l ri sk of the data warehouse project, whi l e del i veri ng i ncreasi ng functi onal i ty to the
users.
Preliminary Data Warehouse Architecture
Defi ne the overal l data warehouse archi tecture for the pi l ot and subsequent warehouse
rol l outs to ensure the scal abi l i ty of the warehouse. Whenever possi bl e, defi ne the i ni ti al
techni cal archi tecture of each rol l out.
By consci ousl y thi nki ng through the data warehouse archi tecture, warehouse pl anners
can determi ne the vari ous technol ogy components (e.g., MDDB. RDBMS, tool s) those are
requi red for each rol l out.
9)4-0751/ 564)6-/;
6
+0)26-4
90 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Short-listed Data Warehouse Environment and Tools
There are a number of tool s and warehousi ng envi ronments from whi ch to be chosen.
Create a short - l i st for the tool s and envi ronments that appear to meet a warehousi ng
needs. A standard set of tool s wi l l l essen tool i ntegrati on probl ems and wi l l mi ni mi ze the
l earni ng requi red for both the warehousi ng team and the warehouse users.
Bel ow are the tasks requi red to create the enterpri se’s warehousi ng strategy. Note that
the tasks descri bed bel ow can typi cal l y be compl eted i n three to fi ve weeks, dependi ng on
the avai l abi l i ty of resource persons and the si ze of the enterpri se.
6.2 DETERMINE ORGANIZATIONAL CONTEXT
An understandi ng of the organi zati on hel ps to establ i sh the context of the project and
may hi ghl i ght aspects of the corporate cul ture that my ease or compl i cate the warehousi ng
project. Answers to organi zati onal background questi ons are typi cal l y obtai ned from the
Project Sponsor, the CI O, or the Project Manager assi gned to the warehousi ng effort.
Typi cal organi zati onal background questi ons i ncl ude:
• Who is the Project Sponsor for this project?
The Project Sponsor sets the scope of the warehousi ng project. He or she al so pl ays
a cruci al rol e i n establ i shi ng the worki ng rel ati onshi p among warehousi ng team
members, especi al l y i f thi rd parti es are i nvol ved. Easy access to warehousi ng data
may al so be l i mi ted to the organi zati onal scope that i s wi thi n the control or authori ty
of the Project Sponsor.
• What are the IS or IT groups in the organization, which are involved in
the data warehousing effort?
Si nce data warehousi ng i s very much a technol ogy-based endeavor, the I S or I T
groups wi thi n the organi zati on wi l l al ways be i nvol ved i n any warehousi ng effort.
I t i s often i nsi ghtful to understand the bul k of the work currentl y performed wi thi n
the I S or I T departments. I f the I S or I T groups are often fi ghti ng fi res or are very
busy depl oyi ng operati onal systems, data warehousi ng i s unl i kel y to be hi gh on the
l i st of I T pri ori ti es.
• What are the roles and responsibilities of the individuals who have been
assigned to this effort?
I t i s hel pful to defi ne the rol es and responsi bi l i ti es of the vari ous i ndi vi dual s i nvol ved
i n the data warehousi ng project. Thi s practi ce sets common, real i sti c expectati ons
and i mproves understandi ng and communi cati on wi thi n the team. I n cases where
the team i s composed of external parti es (especi al l y where several vendors are
i nvol ved), a cl ear defi ni ti on of rol es becomes cri ti cal .
6.3 CONDUCT PRELIMINARY SURVEY OF REQUIREMENTS
Obtai n an i nventory of the requi rements of busi ness users through i ndi vi dual and
group i ntervi ew wi th the enduser communi ty. Whenever possi bl e, obtai n l ayouts of the
current management reports (and thei r pl anned enhancements).
WAREHOUSI NG STRATEGY 91
The requi rements i nventory represents the breadth of i nformati on that the warehouse
i s expected to eventual l y provi de. Whi l e i t i s i mportant to get a cl ear pi cture of the extent
of requi rements, i t i s not necessary to detai l al l the requi rements i n depth at thi s poi nt. The
objecti ve i s to understand the user needs enough to pri ori ti ze the requi rements. Thi s i s a
cri ti cal i nput for i denti fyi ng the scope of each data warehouse rol l out.
Interview Categories and Sample Questions
The fol l owi ng questi ons, arranged by category, shoul d be useful as a starti ng poi nt for
the i ntervi ew wi th i ntended end users of the warehouse:
• Functions. What i s the mi ssi on of your group or uni t? How do you go about
ful fi l l i ng thi s mi ssi on? How do you know i f you’ve been successful wi th your mi ssi on?
What are the key performance i ndi cators and cri ti cal success factors?
• Customers. How do you group or cl assi fy your customers? Do these groupi ngs
change over ti me? Does your groupi ng affect how you treat your customers? What
ki nd of i nformati on do you track for each type of cl i ent? What demographi c i nformati on
do you use, i f any? Do you need to track customer data for each customer?
• Profit. At what l evel do you measure profi tabi l i ty i n your group? Per agent? Per
customer? Per product? Per regi on? At what l evel of detai l are costs and revenues
tracked i n your organi zati on? How do you track costs and revenues now? What
ki nd of profi tabi l i ty reports do you use or produce now?
• Systems. What systems do you use as part of your job? What systems are you
aware of i n other groups that contai n i nformati on you requi re? What ki nd of manual
data transformati ons do you have to perform when data are unavai l abl e?
• Time. How many months or years of data you need to track? Do you anal yze
performance across years? At what l evel of detai l do you need to see fi gures? Dai l y?
Weekl y? Monthl y? Quarterl y? Yearl y? Do you need to see fi gures? How soon do you
need to see data (e.g., do you need yesterday’s data today?) How soon after week-
end, month-end, quarter-end, and year-end do you need to see the previ ous peri od
fi gures?
• Queries and reports. What reports do you use now? What i nformati on do you
actual l y use i n each of the reports you now recei ve? Can we obtai n sampl es, of
these reports? How often are these reports produced? Do you get them soon enough,
frequentl y enough? Who makes these reports for you? What reports do you produce
for other peopl e?
• Product. What products do you sel l , and how do you cl assi fy them? Do you have
a product hi erarchy? Do you anal yze data for al l products at the same ti me, or do
you anal yze one product type at a ti me? How do you handl e changes i n product
hi erarchy and product descri pti on?
• Geography. Does your company operate i n more than one l ocati on? Do you di vi de
your market i nto geographi cal areas? Do you track sal es per geographi c regi on?
Interviewing Tips
Many of the i nter vi ewi ng ti ps enumer ated bel ow may seem l i ke common sense.
Neverthel ess, i ntervi ewers are encouraged to keep the fol l owi ng poi nts i n mi nd:
92 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Avoid making commitments about warehouse scope. I t wi l l not be surpri si ng
to fi nd that some of the queri es and reports requested by i ntervi ewes cannot be
supported by the data that currentl y resi de i n the operati onal databases. I ntervi ewers
shoul d keep thi s i n mi nd and communi cate thi s potenti al l i mi tati on to thei r
i ntervi ewees. The i ntervi ewers cannot afford to make commi tments regardi ng the
warehouse scope at thi s ti me.
• Keep the interview objective in mind. The objecti ve of these i ntervi ews i s to
create an i nventory of requi rements. There i s no need to get a detai l ed understandi ng
of the requi rements at thi s poi nt.
• Don’t overwhelm the interviewees. The i ntervi ewi ng team shoul d be smal l ; two
peopl e ar e the i deal number —one to ask questi ons, another to take notes.
I ntervi ewees may be i nti mi dated i f a l arge group of i ntervi ewers shows up.
• Record the session if the interviewee lets you. Most i ntervi ewees wi l l not
mi nd i f i ntervi ewers bri ng al ong a tape recorder to record the sessi on. Transcri pts
of the sessi on may l ater prove hel pful .
• Change the interviewing style depending on the interviewee. Mi ddl e-
Manager s mor e l i kel y to deal wi th actual r epor ts and detai l ed i nfor mati on
requi rements. Seni or executi ves are more l i kel y to dwel l on strategi c i nformati on
needs. Change the i ntervi ewi ng styl e as needed by adapti ng the type of questi ons
to the type of i ntervi ewee.
• Listen carefully. Li sten to what the i ntervi ewee has to say. The sampl e i ntervi ew
questi ons are merel y a starti ng poi nt—fol l ow-up questi ons have the potenti al of
yi el di ng i nteresti ng and cri ti cal of yi el di ng i nteresti ng and cri ti cal i nformati on.
Take note of the terms that the i ntervi ewee uses. Popul ar busi ness terms such as
“profi t” may have di fferent meani ngs or connotati ons wi thi n the enterpri se.
• Obtain copies of reports, whenever possible. The r epor ts wi l l gi ve the
warehouse team val uabl e i nformati on about source systems (whi ch system produced
the report), as wel l as busi ness rul es and terms. I f a person manual l y makes the
reports, the team may benefi t from tal ki ng to thi s person.
6.4 CONDUCT PRELIMINARY SOURCE SYSTEM AUDIT
Obtai n an i nventory of potenti al warehouse data sources through i ndi vi dual and group
i ntervi ew wi th key personnel i n the I T organi zati on. Whi l e the CI O no doubt has a broad,
hi gh-l evel vi ew of the systems i n the enterpri se, the best resource persons for the source
system audi t are the DBAs and system admi ni strators who mai ntai n the operati onal systems.
Typi cal background i ntervi ew questi ons, arranged by categori es, for the I T department
i ncl ude:
• Current architecture. What i s the cur r ent technol ogy ar chi tectur e of the
organi zati on? What ki nd of systems, hardware, DBMS, network, end-user tool s,
devel opment tool s, and data access tool s are currentl y i n use?
• Source system relationships. Are the source systems rel ated i n any way? Does
one system provi de i nformati on to another? Are the systems i ntegrated i n any
manner? I n cases where mul ti pl e systems have customer and product records,
whi ch one serves as the “master” copy?
WAREHOUSI NG STRATEGY 93
• Network facilities. I s i t possi bl e to use a si ngl e termi nal or PC to access the
di fferent operati onal systems, from al l l ocati ons?
• Data quality. How much cl eani ng, scrubbi ng, de-dupl i cati on, and i ntegrati on do
you suppose wi l l be requi red? What areas (tabl es or fi el ds) i n the source systems
are currentl y known to have poor data qual i ty?
• Documentation. How much documentati on i s avai l abl e for the source systems?
How accurate and up-to-date are these manual s and reference materi al s? Try to
obtai n the fol l owi ng i nformati on whenever possi bl e: copi es of manual s and reference
documents, database si ze, batch wi ndow, pl anned enhancements, typi cal backup
si ze, backup scope and backup medi um, data scope of the system (e.g., i mportant
tabl es and fi el ds), system codes and thei r meani ngs, and keys generati on schemes.
• Possible extraction mechanisms. What extracti on mechani sms are possi bl e wi th
thi s system? What extracti on mechani sms have you used before wi th thi s system?
What extracti on mechani sms wi l l not work?
6.5 IDENTIFY EXTERNAL DATA SOURCES (IF APPLICABLE)
The enterpri se may al so make use of external data sources to augment the data from
i nternal source systems. Exampl e of external data that can be used are:
• Data from credi t agenci es.
• Zi p code or mai l code data.
• Stati sti cal or census data.
• Data from i ndustry organi zati ons.
• Data from publ i cati ons and news agenci es.
Al though the use of exter nal data pr esents oppor tuni ti es for enr i chi ng the data
warehouse, i t may al so present di ffi cul ti es because of di fference i n granul ari ty. For exampl e,
the external data may not be readi l y avai l abl e at the l evel of detai l requi red by the data
warehouse and may requi re some transformati on or summari zati on.
6.6 DEFINE WAREHOUSE ROLLOUTS (PHASED IMPLEMENTATION)
Di vi de the data war ehouse devel opment i nto phased, successi ve r ol l out. Note that
the scope of each r ol l out wi l l have to be fi nal i zed as par t of the pl anni ng for that r ol l out.
The avai l abi l i ty and qual i ty of sour ce data wi l l pl ay a cr i ti cal r ol e i n fi nal i zi ng that
scope.
As stated earl i er, appl yi ng a phased approach for del i veri ng the warehouse shoul d
l ower the overal l ri sk of the data warehouse project whi l e del i veri ng i ncreasi ng functi onal i ty
and data to more users. I t al so hel ps manage user expectati ons through the cl ear defi ni ti on
of scope for each rol l out.
Fi gure 6.1 i s a sampl e tabl e l i sti ng al l requi rements i denti fi ed duri ng the i ni ti al round
of i ntervi ews wi th end users. Each requi rement i s assi gned a pri ori ty l evel . An i ni ti al
compl exi ty assessment i s made, based on the esti mated number of source systems, earl y
data qual i ty assessments, and the computi ng envi ronments of the source systems. The
i ntended user group i s al so i denti fi ed.
94 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
No. Requirement Priority Complexity Users Rollout No.
1 Customer Hi gh Hi gh Customer 1
Profi tabi l i ty Servi ce
2 Product Hi gh Medi um Product 1
Market Share Manager
3 Weekl y Sal es Medi um Low VP, Sal es 2
Trends
— — — — —
Figure 6.1. Sampl e Rol l out Defi ni ti on
More factors can be l i sted to hel p determi ne the appropri ate rol l out number for each
requi rement. The rol l out defi ni ti on i s fi nal i zed onl y when i t i s approved by the Project
Sponsor.
6.7 DEFINE PRELIMINARY DATA WAREHOUSE ARCHITECTURE
Defi ne the prel i mi nary archi tecture of each rol l out based on the approved rol l out scope.
Expl ore the possi bi l i ty of usi ng a mi x of rel ati onal and mul ti di mensi onal databases and
tool s, as i l l ustrated i n Fi gure 6.2.
At a mi ni mum, the prel i mi nary archi tecture shoul d i ndi cate the fol l owi ng:
• Data warehouses and data mart. Defi ne the i ntended depl oyment of data
warehouses and data marts for each rol l out. I ndi cate how the di fferent databases
are rel ated (i .e., how the databases feed one another). The warehouse archi tecture
must ensure that the di fferent data marts are not depl oyed i n i sol ati on.
Rollout No. 1
ROLAP
Front-End
(10 Users)
Rollout No. 2
ROLAP
Front-End
(15 Users)
EIS/DSS
(3 Users)
Report
Server
(10 Users)
Alert
System
(10 Users)
Report
Writer
(5 Users)
Figure 6.2. Sampl e Prel i mi nary Archi tecture per Rol l out
WAREHOUSI NG STRATEGY 95
• Number of users. Speci fy the i ntended number of users for each data access and
retri eval tool (or front-end) for each rol l out.
• Location. Speci fy the l ocati on of the data warehouse, the data marts, and the
i ntended users for each rol l out. Thi s has i mpl i cati ons on the techni cal archi tecture
requi rements of the warehousi ng project.
6.8 EVALUATE DEVELOPMENT AND PRODUCTION ENVIRONMENT AND TOOLS
Enterpri ses can choose from several envi ronments and tool s for the data warehouse
i ni ti ati ve, sel ect the combi nati on of tool s that best meets the needs of the enterpri se. At
present, no si ngl e vendor provi des an i ntegrated sui te of warehousi ng tool s. There are,
however, cl ear l eaders for each tool category.
El i mi nate al l unsui tabl e tool s, and produce a short-l i st from whi ch each rol l out or
project wi l l choose i ts tool set (see Fi gure 6.3). Al ternati vel y, sel ect and standardi ze on a set
of tool s for al l warehouse rol l outs.
No. Tool Category Short-listed Evaluation Weights Preliminary
Tools Criteria (%) Scores
1 Data Access Tool A Cri teri on 1 30%
and Retri eval Cri teri on 2 30% 78%
Cri teri on 3 40%
Tool B Cri teri on 1 30%
Cri teri on 2 30% 82%
Cri teri on 3 40%
Tool C Cri teri on 1 30%
Cri teri on 2 30% 84%
Cri teri on 3 40%
2 RDBMS
Figure 6.3. Sampl e Tool Short-Li st
In Summary
A data warehouse strategy at a mi ni mum contai ns:
• Prel i mi nary data warehouse rol l out pl an, whi ch i ndi cates how the devel opment of
the warehouse i s to be phased:
• Pr el i mi nar y data war ehouse ar chi tectur e, whi ch i ndi cates the l i kel y physi cal
i mpl ementati on of the warehouse rol l out; and
• Short-l i sted opti ons for the warehouse envi ronment and tool s.
The approach for arriving at these strategy components may vary from one enterprise to
another, the approach presented in this chapter is one that has consistently proven to be effective.
Expect the data warehousi ng strategy to be updated annual l y. Each warehouse rol l out
provi des new l earni ng and as new tool s and technol ogi es become avai l abl e.
96 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
96
Warehouse Management and Support Processes are desi gned to address aspects of
pl anni ng and managi ng a data war ehouse pr oject that ar e cr i ti cal to the successful
i mpl ementati on and subsequent extensi on of the data warehouse. Unfortunatel y, these aspects
are al l too often overl ooked i n i ni ti al warehousi ng depl oyments.
These processes are defi ned to assi st the project manager and warehouse dri ver duri ng
warehouse devel opment projects.
7.1 DEFINE ISSUE TRACKING AND RESOLUTION PROCESS
Duri ng the course of a project, i t i s i nevi tabl e that a number of busi ness and techni cal
i ssues wi l l surface. The project wi l l qui ckl y be del ayed by unresol ved i ssues i f an i ssue
tracki ng and resol uti on process i s not i n pl ace. Of parti cul ar i mportance are busi ness i ssues
that i nvol ve more than one group of users. These i ssues typi cal l y i ncl ude di sputes over the
defi ni ti on of busi ness terms and the fi nanci al formul as that govern the transformati on of
data.
An i ndi vi dual on the project shoul d be desi gnated to track and fol l ow up the resol uti on
of each i ssue as i t ari ses. Extremel y urgent i ssues (i .e., i ssues that may cause project del ay
i f l eft unresol ved) or i ssues wi th strong pol i ti cal overtones can be brought to the attenti on
of the Project Sponsor, who must use hi s or her cl out to expedi te the resol uti on process.
Fi gure 7.1 shows a sampl e i ssue l ogs that tracks al l the i ssues that ari se duri ng the
course of the project.
The fol l owi ng i ssue tracki ng gui del i nes wi l l prove hel pful :
• Issue description. State the i ssue bri efl y i n two to three sentences. Provi de a
more detai l ed descri pti on of the i ssue as a separate paragraph. I f there are possi bl e
r esol uti ons to the i ssue, i ncl ude these i n the i ssue descr i pti on. I denti fy the
consequences of l eavi ng thi s i ssue open, parti cul arl y and i mpact on the project
schedul e.
WAREHOU$E MANAGEMENT AND
$UFFORT FROCE$$E$
7
CHAFTER
WAREHOUSE MANAGEMENT AND SUPPORT PROCESSES 97
No. Issue Urgency Raised Assigned Date Date Resolved Resolution
Description By To Opened Closed By Description
1 Confl i ct over Hi gh MWH MCD Feb 03 Feb 05 CEO Use
defi ni ti on of Cor Pl an’s
“Customer ” defi ni ti on
2 Currency Hi gh MCD RGT Feb 04
exchange
rates are
not tracked
i n GL
Figure 7.1. Sampl e I ssue Log
• Urgency. I ndi cate the pri ori ty l evel of the i ssue: hi gh, medi um, or l ow. Low-
pri ori ty i ssues that are l eft unresol ved may l ater become hi gh pri ori ty. The team
may have agreed on a resol uti on rate dependi ng on the urgency of the i ssue. For
exampl e, the team can agree to resol ve hi gh-pri ori ty i ssues wi thi n three days,
medi um-pri ori ty i ssues wi thi n a week, and l ow-pri ori ty i ssues wi thi n two weeks.
• Raised by. I denti fy the person who rai sed the i ssue. I f the team i s l arge or does
not meet on a regul ar basi s, provi de i nformati on on how to contact the person (e.g.,
tel ephone number, e-mai l address). The peopl e who are resol vi ng the i ssue may
requi re addi ti onal i nformati on or detai l s that onl y the i ssue ori gi nator can provi de.
• Assigned to. I denti fy the person on the team who i s responsi bl e for resol vi ng the
i ssue. Note that thi s person does not necessari l y have answer. However, he or she
i s responsi bl e for tracki ng down the person who can actual l y resol ve the i ssue. He
or she al so fol l ows up on i ssues that have been l eft unresol ved.
• Date opened. Thi s i s the date when the i ssue was fi rst l ogged.
• Date closed. Thi s i s the date when the i ssue was fi nal l y resol ved.
• Resolved by. The person who resol ves the i ssue must have the requi red authori ty
wi thi n the organi zati on. User representati ves typi cal l y resol ve busi ness i ssues. The
CI O or desi gnated representati ves typi cal l y resol ve techni cal i ssues, and the project
sponsor typi cal l y resol ves i ssues rel ated project scope.
• Resolution description. State bri efl y the resol uti on of thi s i ssue i n two or three
sentences. Provi de a more detai l ed descri pti on of the resol uti on i n a separate
paragraph. I f subsequent acti ons are requi red to i mpl ement the resol uti on, these
shoul d be stated cl earl y and resources shoul d be assi gned to i mpl ement them.
I denti fy target dates for i mpl ementati on.
I ssue l ogs formal i ze the i ssue resol uti on process. They al so serve as a formal record of
key deci si ons made throughout the project.
I n some cases, the team may opt to augment the l og wi th yet another form–one form
for each i ssue. Thi s typi cal l y happens when the i ssue descri pti ons and resol uti on descri pti ons
are qui te l ong. I n thi s case, onl y the bri ef i ssue statement and bri ef resol uti on descri pti ons
are recorded i n the i ssue l og.
98 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
7.2 PERFORM CAPACITY PLANNING
Warehouse capaci ty requi rements come i n the fol l owi ng forms: space requi red, machi ne
processi ng power, network bandwi dth, and number of concurrent users. These requi rements
i ncrease wi th each rol l out of the data warehouse.
Duri ng the stage of defi ni ng the warehouse strategy, the team wi l l not have the exact
i nformati on for these requi rements. However, as the warehouse rol l out scopes are fi nal i zed,
the capaci ty requi rements wi l l l i kewi se become more defi ned.
Revi ew the fol l owi ng capaci ty pl anni ng requi rements basi ng your revi ew on the scope
of each rol l out.
There are several aspects to the data warehouse envi ronment that make capaci ty
pl anni ng for the data warehouse a uni que exerci se. The fi rst factor i s that the workl oad for
the data warehouse envi ronment i s very vari abl e. I n many ways tryi ng to anti ci pate the
DSS workl oad requi res i magi nati on. Unl i ke the operati onal workl oad that has an ai r of
regul ari ty to i t, the data warehouse DSS workl oad i s much l ess predi ctabl e. Thi s factor, i n
and of i tsel f, makes capaci ty pl anni ng for the data warehouse a chancy exerci se.
A second factor maki ng capaci ty pl anni ng for the data warehouse a ri sky busi ness i s that
the data warehouse normal l y entai l s much more data than was ever encountered i n the
operati onal envi ronment. The amount of data found i n the data warehouse i s di rectl y rel ated
to the desi gn of the data warehouse envi ronment. The desi gner determi nes the granul ari ty
of data that i n turn determi nes how much data there wi l l be i n the warehouse. The fi ner the
degree of granul ari ty, the more data there i s. The coarser the degree of granul ari ty, the l ess
data there i s. And the vol ume of data not onl y affects the actual di sk storage requi red, but
the vol ume of data affects the machi ne resources requi red to mani pul ate the data. I n very few
envi ronments i s the capaci ty of a system so cl osel y l i nked to the desi gn of the system.
A thi r d factor maki ng capaci ty pl anni ng for the data war ehouse envi r onment a
nontr adi ti onal exer ci se i s that the data war ehouse envi r onment and the oper ati onal
envi ronments do not mi x under the stress of a workl oad of any si ze at al l . Thi s i mbal ance
of envi ronments must be understood by al l parti es i nvol ved-the capaci ty pl anner, the systems
programmer, management, and the desi gner of the data warehouse envi ronment.
Consi der the patterns of hardware uti l i zati on as shown by Fi gure 7.2.
operational/transaction environment
data warehouse/DSS environment
100%
0%
100%
0%
Figure 7.2. The Fundamental l y Di fferent Patterns of Hardware Uti l i zati on between the Data
Warehouse Envi ronment and the Operati onal Envi ronment
WAREHOUSE MANAGEMENT AND SUPPORT PROCESSES 99
I n Fi gure 7.2 i t i s seen that the operati onal envi ronment uses hardware i n a stati c
fashi on. There are peaks and val l eys i n the operati onal envi ronment, but at the end of the
day hardware uti l i zati on i s predi ctabl e and fai rl y constant. Contrast the pattern of hardware
uti l i zati on found i n the operati onal envi ronment wi th the hardware uti l i zati on found i n the
data warehouse/DSS envi ronment.
I n the data warehouse, hardware i s used i n a bi nary fashi on. Ei ther the hardware i s
bei ng used constantl y or the hardware i s not bei ng used at al l . Furthermore, the pattern i s
such that i t i s unpredi ctabl e. One day much processi ng occurs at 8:30 a.m. The next day the
bul k of processi ng occurs at 11:15 a.m. and so forth.
Ther e ar e then, ver y di ffer ent and i ncompati bl e patter ns of har dwar e uti l i zati on
associ ated wi th the operati onal and the data warehouse envi ronment. These patterns appl y
to al l types of hardware CPU, channel s, memory, di sk storage, etc.
Tryi ng to mi x the di fferent patterns of hardware l eads to some basi c di ffi cul ti es. Fi gure 7.3
shows what happens when the two types of patterns of uti l i zati on are mi xed i n the same
machi ne at the same ti me.
same
machine
Figure 7.3. Tryi ng to Mi x the Two Fundamental l y Di fferent Patterns of the Executi on i n the
Same Machi ne at the Same Ti me Leads to Some Very Basi c Confl i cts
The patterns are si mpl y i ncompati bl e. Ei ther you get good response ti me and a l ow rate
of machi ne uti l i zati on (at whi ch poi nt the fi nanci al manager i s unhappy), or you get hi gh
machi ne uti l i zati on and poor response ti me (at whi ch poi nt the user i s unhappy.) The need
to spl i t the two envi ronments i s i mportant to the data warehouse capaci ty pl anner because
the capaci ty pl anner needs to be aware of ci rcumstances i n whi ch the patterns of access are
mi xed. I n other words, when doi ng capaci ty pl anni ng, there i s a need to separate the two
envi ronments. Tryi ng to do capaci ty pl anni ng for a machi ne or compl ex of machi nes where
there i s a mi xi ng of the two envi ronments i s a nonsensi cal task. Despi te these di ffi cul ti es
wi th capaci ty pl anni ng, pl anni ng for machi ne resources i n the data warehouse envi ronment
i s a worthwhi l e endeavor.
Time Horizons
As a rul e there are two ti me hori zons, the capaci ty pl anner shoul d ai m for–the one-year
ti me hori zon and the fi ve-year ti me hori zon. Fi gure 7.4 shows these ti me hori zons.
The one-year ti me hori zon i s i mportant i n that i t i s on the i mmedi ate requi rements l i st
for the desi gner. I n other words, at the rate that the data warehouse becomes desi gned and
popul ated, the deci si ons made about resources for the one year ti me hori zon wi l l have to be
l i ved wi th. Hardware, and possi bl y software acqui si ti ons wi l l have to be made. A certai n
100 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
amount of “burn-i n” wi l l have to be tol erated. A l earni ng curve for al l parti es wi l l have to
be survi ved, al l on the one-year hori zon. The fi ve-year hori zon i s of i mportance as wel l . I t
i s where the massi ve vol ume of data wi l l show up. And i t i s where the maturi ty of the data
warehouse wi l l occur.
year 1 year 2 year 3 year 4 year 5
Figure 7.4. Projecti ng Hardware Requi rements out to Year 1 and Year 5
An i nteresti ng questi on i s – “ why not l ook at the ten year hori zon as wel l ?” Certai nl y
projecti ons can be made to the ten-year hori zon. However, those projecti ons are not usual l y
made because:
• I t i s very di ffi cul t to predi ct what the worl d wi l l l ook l i ke ten years from now,
• I t i s assumed that the organi zati on wi l l have much more experi ence handl i ng data
warehouses fi ve years i n the future, so that desi gn and data management wi l l not
pose the same probl ems they do i n the earl y days of the warehouse, and
• I t i s assumed that ther e wi l l be technol ogi cal advances that wi l l change the
consi derati ons of bui l di ng and managi ng a data warehouse envi ronment.
DBMS Considerations
One major factor affecti ng the data warehouse capaci ty pl anni ng i s what porti on of the
data warehouse wi l l be managed on di sk storage and what porti on wi l l be managed on
al ternati ve storage. Thi s very i mportant di sti ncti on must be made, at l east i n broad terms,
pri or to the commencement of the data warehouse capaci ty pl anni ng effort.
Once the di sti ncti on i s made, the next consi derati on i s that of the technol ogy underl yi ng
the data warehouse. The most i nteresti ng underl yi ng technol ogy i s that of the Data Base
Management System-DBMs. The components of the DBMs that are of i nterest to the capaci ty
pl anner are shown i n Fi gure 7.5.
The capaci ty pl anner i s i nterested i n the access to the data warehouse, the DBMs
capabi l i ti es and effi ci enci es, the i ndexi ng of the data warehouse, and the effi ci ency and
operati ons of storage. Each of these aspects pl ays a l arge rol e i n the throughput and operati ons
of the data warehouse.
Some of the rel evant i ssues i n regards to the data warehouse data base management
system are:
• How much data can the DBMs handl e? (Note: There i s al ways a di screpancy
between the theor eti cal l i mi ts of the vol ume of data handl ed by a data base
management system and the practi cal l i mi ts.)
WAREHOUSE MANAGEMENT AND SUPPORT PROCESSES 101
data
warehouse
processor
1
2
4
3
data warehouse DBMS capacit y issues:
1. access to the warehouse
2. DBMS operations and efficiency
3. indexing to the warehouse
4. storage efficiency and requirements
Figure 7.5
• How can the data be stored? Compressed? I ndexed? Encoded? How are nul l val ues
handl ed?
• Can l ocki ng be suppressed?
• Can requests be moni tored and suppressed based on resource uti l i zati on?
• Can data be physi cal l y denormal i zed?
• What support i s there for metadata as needed i n the data warehouse? and so forth.
Of course the operati ng system and any tel eprocessi ng moni tori ng must be factored
i n as wel l .
Disk Storage and Processing Resources
The two most i mportant parameters of capaci ty management are the measurement of
di sk storage and processi ng resources. Fi gure 7.6 shows those resources.
data
warehouse
processor
disk
storage
data
warehouse
processor
disk
storage
which
configuration?
Figure 7.6. The Two Mai n Aspects of Capaci ty Pl anni ng for the Data Warehouse Envi ronment
are the Di sk Storage and Processor Si ze
102 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
The questi on faci ng the capaci ty pl anner i s – how much of these resources wi l l be
requi red on the one-year and the fi ve-year hori zon. As has been stated before, there i s an
i ndi rect (yet very real ) rel ati onshi p between the vol ume of data and the processor requi red.
Calculating Disk Storage
The cal cul ati ons for space are al most al ways done excl usi vel y for the current detai l ed
data i n the data warehouse. (I f you are not fami l i ar wi th the di fferent l evel s of data i n the
warehouse, pl ease refer to the Tech Topi c on the descri pti on of the data warehouse.) The
reason why the other l evel s of data are not i ncl uded i n thi s anal ysi s i s that:
• They consume much l ess storage than the current detai l ed l evel of data, and
• They are much harder to i denti fy.
Therefore, the consi derati ons of capaci ty pl anni ng for di sk storage center around the
current detai l ed l evel . The cal cul ati ons for di sk storage are very strai ghtforward. Fi gure 7.7
shows the el ements of cal cul ati on.
disk
storage
key
attribute 1
attribute 2
attribute 3
................
attribute n
keys
attributes
rows in
table
tables
bytes
Figure 7.7. Esti mati ng Di sk Storage Requi rements for the Data Warehouse
To cal cul ate di sk storage, fi rst the tabl es that wi l l be i n the current detai l ed l evel of the
data warehouse are i denti fi ed. Admi ttedl y, when l ooki ng at the data warehouse from the
standpoi nt of pl anni ng, where l i ttl e or no detai l ed desi gn has been done, i t i s di ffi cul t to
di vi se what the tabl es wi l l be. I n truth, onl y the very l argest tabl es need be i denti fi ed.
Usual l y ther e ar e a fi ni te number of those tabl es i n even the most compl ex of
envi ronments.
Once the tabl es are i denti fi ed, the next cal cul ati on i s how many rows wi l l there be i n
each tabl e. Of course, the answer to thi s questi on depends di rectl y on the granul ari ty of data
found i n the data warehouse. The l ower the l evel of detai l , the more the number of rows.
I n some cases the number of rows can be cal cul ated qui te accuratel y. Where there i s
a hi stori cal record to rel y upon, thi s number i s cal cul ated. For exampl e, where the data
warehouse wi l l contai n the number of phone cal l s made by a phone company’s customers
and where the busi ness i s not changi ng dramati cal l y, thi s cal cul ati on can be made. But i n
other cases i t i s not so easy to esti mate the number of occurrences of data.
104 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Predi ctabl e DSS processi ng i s that processi ng that i s regul arl y done, usual l y on a query
or transacti on basi s. Predi ctabl e DSS processi ng may be model ed after DSS processi ng done
today but not i n the data warehouse envi ronment. Or predi ctabl e DSS processi ng may be
projected, as are other parts of the data warehouse workl oad.
The parameters of i nterest for the data warehouse desi gner (for both the background
processi ng and the predi ctabl e DSS processi ng) are:
• The number of ti mes the process wi l l be run,
• The number of I /Os the process wi l l use,
• Whether there i s an arri val peak to the processi ng,
• The expected response ti me.
These metri cs can be arri ved at by exami ni ng the pattern of cal l s made to the DBMS
and the i nteracti on wi th data managed under the DBMS.
The thi rd category of process of i nterest to the data warehouse capaci ty pl anner i s that
of the unpredi ctabl e DSS anal ysi s. The unpredi ctabl e process by i ts very nature i s much l ess
manageabl e than ei ther background processi ng or predi ctabl e DSS processi ng.
However, certai n characteri sti cs about the unpredi ctabl e process can be projected (even
for the worst behavi ng process.) For the unpredi ctabl e processes, the:
• Expected response ti me (i n mi nutes, hours, or days) can be outl i ned,
• Total amount of I /O can be predi cted, and
• Whether the system can be qui esced duri ng the runni ng of the request can be projected.
Once the workl oad of the data warehouse has been broken i nto these categori es, the
esti mate of processor resources i s prepared to conti nue. The next deci si on to be made i s whether
the ei ght-hour dai l y wi ndow wi l l be the cri ti cal processi ng poi nt or whether overni ght processi ng
wi l l be the cri ti cal poi nt. Usual l y the ei ght-hour day from 8:00 a.m. to 5:00 p.m. as the data
warehouse i s bei ng used i s the cri ti cal poi nt. Assumi ng that the ei ght-hour wi ndow i s the cri ti cal
poi nt i n the usage of the processor, a profi l e of the processi ng workl oad i s created.
The Workload Matrix
The workl oad matri x i s a matri x that i s created as the i ntersecti on of the tabl es i n the
data warehouse and the processes (the background and the predi ctabl e DSS processes) that
wi l l run i n the data warehouse. Fi gure 7.9 shows a matri x formed by tabl es and processes.
process a
process b
process c
process d
process e
process z
...............
table
1
table
2
table
3
table
4
table
5
table
6
table
7
table
8
table
9
.....
table
n
Figure 7.9. A Matri x Approach Hel ps to Organi ze the Acti vi ti es
The workl oad matri x i s then fi l l ed i n. The fi rst pass at fi l l i ng i n the matri x i nvol ves
putti ng the number of cal l s and the resul ti ng I /O from the cal l s the process woul d do i f the
process were executed exactl y once duri ng the ei ght hour wi ndow. Fi gure 7-10 shows a
si mpl e form of a matri x that has been fi l l ed i n for the fi rst step.
WAREHOUSE MANAGEMENT AND SUPPORT PROCESSES 105
process a
process b
process c
process d
process e
process z
...............
table
1
table
2
table
3
table
4
table
5
table
6
table
7
table
8
table
9
.....
table
n
2/5
1/3
1/6
1/5
1/5
5/25
3/9
4/8
3/4
4/8
2/10
1 /2
2/2
1 /2
3/9
6/12 6/12
3/9
4/12
1 /2
1/5
5/15
3/9
2/5
1 /2
5/15
number of calls I/O’s per access
5/25
5/15
Figure 7.10. The Si mpl est Form of a Matri x i s to Profi l e a I ndi vi dual Transacti ons i n Terms of
Cal l s and I /O’s.
For exampl e, i n Fi gure 7.10 the fi rst cel l i n the matri x–the cel l for process a and tabl e 1-
contai ns a “2/5”. The 2/5 i ndi cates that upon executi on process a has two cal l s to the tabl e and
uses a total of 5 I /Os for the cal l s. The next cel l –the cel l for process a and tabl e 2 - i ndi cates that
process a does not access tabl e 2. The matri x i s fi l l ed i n for the processi ng profi l e of the workl oad
as i f each transacti on were executed once and onl y once. Determi ni ng the number of I /Os per
cal l can be a di ffi cul t exerci se. Whether an I /O wi l l be done depends on many factors:
• The number of rows i n a bl ock,
• Whether a bl ock i s i n memory at the moment i t i s requested,
• The amount of buffers there are,
• The traffi c through the buffers,
• The DBMS managi ng the buffers,
• The i ndexi ng for the data,
• The other part of the workl oad,
• The system parameters governi ng the workl oad, etc.
There are, i n short, MANY factors affecti ng how many physi cal I /O wi l l be used. The
I /Os an be cal cul ated manual l y or automati cal l y (by software speci al i zi ng i n thi s task). After
the si ngl e executi on profi l e of the workl oad i s i denti fi ed, the next step i s to create the actual
workl oad profi l e. The workl oad profi l e i s easy to create. Each row i n the matri x i s mul ti pl i ed
by the number of ti mes i t wi l l execute i n a day. The cal cul ati on here i s a si mpl e one. Fi gure
7.11 shows an exampl e of the total I /Os used i n an ei ght-hour day bei ng cal cul ated.
process a
process b
process c
process d
process e
process z
...............
table
1
table
2
table
3
table
4
table
5
table
6
table
7
table
8
table
9
.....
table
n
2500
1000
750
700
750
15000
total I/Os
5000
3000
3750
85000
1000
566
1000
10250 12250
250
283
300
5750
7500
1250
27750
1200
19500 6750
1250
1750
666
500
2250
8926
4500
500
250
2250
87560
total I/O
2500
Figure 7.11. After the I ndi vi dual Transacti ons are Profi l ed, the profi l e i s Mul ti pl i ed by the
Number of Expected Executi ons per Day
106 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
At the bottom of the matri x the total s are cal cul ated, to reach a total ei ght-hour I /O
requi rement. After the ei ght hour I /O requi rement i s cal cul ated, the next step i s to determi ne
what the hour-by-hour requi rements are. I f there i s no hourl y requi rement, then i t i s
assumed that the queri es wi l l be arri vi ng i n the data warehouse i n a fl at arri val rate.
However, there usual l y are di fferences i n the arri val rates. Fi gure 7.12, shows the arri val
rates adjusted for the di fferent hours i n the day.
activity/transaction a
activity/transaction b
activity/transaction c
activity/transaction d
activity/transaction e
activity/transaction f
activity/transaction g
noon
7:00 8:00 9:00 10:00 11:00 12:00 1:00 2:00 3:00 4:00 5:00 6:00
a.m. p.m.
Figure 7.12. The Di fferent Acti vi ti es are Tal l i ed at thei r Processi ng Rates
for the Hours of the Day
Fi gure 7.12, shows that the processi ng requi red for the di fferent i denti fi ed queri es i s
cal cul ated on an hourl y basi s. After the hourl y cal cul ati ons are done, the next step i s to
i denti fy the “hi gh water mark. “The hi gh water mark i s that hour of the day when the most
demands wi l l be made of the machi ne. Fi gure 7.13, shows the si mpl e i denti fi cati on of the
hi gh water mark.
high water
mark
7:00 8:00 9:00 10:00 11:00 12:00 1:00 2:00 3:00 4:00 5:00 6:00
a.m. p.m.
noon
Figure 7.13. The Hi gh “Water Mark” for the Day i s Determi ned
After the hi gh water mark requi rements are i denti fi ed, the next requi rement i s to scope
out the requi rements for the l argest unpredi ctabl e request. The l argest unpredi ctabl e request
must be parameteri zed by:
WAREHOUSE MANAGEMENT AND SUPPORT PROCESSES 107
• How many total I /Os wi l l be requi red,
• The expected response ti me, and
• Whether other processi ng may or may not be qui esced.
Fi gure 7.14, shows the speci fi cati on of the l argest unpredi ctabl e request.
next the largest unpredictable activity is sized:
in terms of length of time for execution
in terms of total I/O
Figure 7.14
After the l argest unpredi ctabl e request i s i denti fi ed, i t i s merged wi th the hi gh water
mark. I f no qui esci ng i s al l owed, then the l argest unpredi ctabl e request i s si mpl y added as
another request. I f some of the workl oad (for i nstance, the predi ctabl e DSS processi ng) can
be qui esced, then the l argest unpredi ctabl e request i s added to the porti on of the workl oad
that cannot be qui esced. I f al l of the workl oad can be qui esced, then the unpredi ctabl e
l argest request i s not added to anythi ng.
The anal yst then sel ects the l arger of the two–the unpredi ctabl e l argest request wi th
qui esci ng (i f qui esci ng i s al l owed), the unpredi ctabl e l argest request added to the porti on of
the workl oad that cannot be qui esced, or the workl oad wi th no unpredi ctabl e processi ng.
The maxi mum of these numbers then becomes the hi gh water mark for al l DSS processi ng.
Fi gure 7.15 shows the combi nati ons.
added onto
high water
mark
partially
quiesced
quiesced
Figure 7.15. The I mpact of the Largest Unpredi ctabl e Request i s Esti mated
The maxi mum number then i s compared to a chart of MI PS requi red to support the
l evel of processi ng i denti fi ed, as shown i n Fi gure 7.16.
Fi gure 7.16, merel y shows that the processi ng rate i denti fi ed from the workl oad i s
matched agai nst a machi ne power chart.
Of course there i s no sl ack processi ng factored. Many shops factor i n at l east ten
percent. However, factori ng an unused percentage may sati sfy the user wi th better response
ti me, but costs money i n any case.
The anal ysi s descri bed here i s a general pl an for the pl anni ng of the capaci ty needs of
a data warehouse. I t must be poi nted out that the pl anni ng i s usual l y done on an i terati ve
basi s. I n other words, after the fi rst pl anni ng effort i s done, another more refi ned versi on
soon fol l ows. I n al l cases i t must be recogni zed that the capaci ty pl anni ng effort i s an
esti mate.
108 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
100 mips
95 mips
90 mips
85 mips
80 mips
75 mips
70 mips
65 mips
60 mips
55 mips
50 mips
45 mips
40 mips
35 mips
30 mips
25 mips
20 mips
15 mips
10 mips
5 mips
0 mips
Figure 7.16. Matchi ng the Hi gh Water Mark for al l Processi ng Agai nst the Requi red MI PS
Network Bandwidth
The network bandwi dth must not be al l owed to sl ow down the warehouse extracti on
and warehouse performance. Veri fy al l assumpti ons about the network bandwi dth before
proceedi ng wi th each rol l out.
7.3 DEFINE WAREHOUSE PURGING RULES
Purgi ng rul es speci fy when data are to be removed from the data warehouse. Keep i n
mi nd that most compani es are i nterested onl y i n tracki ng thei r performance over the l ast
three to fi ve years. I n cases where a l onger retenti on peri od i s requi red, the end users wi l l
qui te l i kel y requi re onl y hi gh-l evel summari es for compari son purposes. They wi l l not be
i nterested i n the detai l ed or atomi c data.
Defi ne the mechani sms for archi vi ng or removi ng ol der data from the data warehouse.
Check for any l egal , regul atory, or audi ti ng requi rements that may warrant the storage of
data i n other medi a pri or to actual purgi ng from the warehouse. Acqui re the software and
devi ces that are requi red for archi vi ng.
7.4 DEFINE SECURITY MANAGEMENT
Keep the data warehouse secure to prevent the l oss of competi ti ve i nformati on ei ther
to unforeseen di sasters or to unauthori zed users. Defi ne the securi ty measures for the data
warehouse, taki ng i nto consi derati on both physi cal securi ty (i .e., where the data warehouse
i s physi cal l y l ocated), as wel l as user-access securi ty.
Securi ty management deal s wi th how system i ntegri ty i s mai ntai ned ami d possi bl e
man-made threats and ri sks, i ntenti onal or uni ntenti onal . I ntenti onal man-made threats
i ncl ude espi onage, hacks, computer vi ruses, etc. Uni ntenti onal threats i ncl ude those due to
acci dents or user i gnorance of the effects of thei r acti ons. Securi ty management ranges from
i denti fi cati on of r i sks to deter mi nati on of secur i ty measur es and contr ol s, detecti on of
vi ol ati ons, and anal ysi s of securi ty vi ol ati ons. Thi s secti on descri bes the process steps i nvol ved
i n securi ty management, and di scusses factors cri ti cal to the success of securi ty management.
WAREHOUSE MANAGEMENT AND SUPPORT PROCESSES 109
Determine and Evaluate of IT Assets
Three types of assets must be i denti fi ed:
• Physical. Computer har dwar e and softwar e r esour ces, bui l di ng faci l i ti es, and
resources used to house sensi ti ve assets or process sensi ti ve i nformati on;
• Information. Sensi ti ve data pertai ni ng to the company’s operati ons, pl ans, and
strategi es. Exampl es are marketi ng and sal es pl ans, detai l ed fi nanci al data, trade
secrets, personnel i nformati on, I T i nfrastructure data, user profi l es and passwords,
sensi ti ve offi ce correspondence, mi nutes of meeti ngs, etc. Latel y, there i s al so concern
about protecti ng company l ogos and materi al s posted on the publ i c I nternet; and
• People. Vi tal i ndi vi dual s hol di ng key rol es, whose i ncapaci ty or absence wi l l i mpact
the busi ness i n one way or another.
After you i denti fy company assets, the next step i s to determi ne thei r securi ty l evel .
Dependi ng on the company’s requi rements, assets may be cl assi fi ed i nto two, three, or more
l evel s of securi ty, dependi ng on the val ue of the asset bei ng protected. We recommend
havi ng onl y two l evel s for organi zati ons wi th mi ni mal securi ty threats: publ i c and confi denti al .
A three-l evel securi ty cl assi fi cati on scheme can be i mpl emented i f securi ty needs are greater:
publ i c, confi denti al , and restri cted.
Beware of havi ng too many securi ty l evel s, as thi s tends to di l ute thei r i mportance i n
the eyes of the user. A l arge mul ti nati onal I T vendor used to have four l evel s of securi ty:
publ i c, i nternal use onl y, confi denti al , confi denti al restri cted, and regi stered confi denti al .
Today, they have cut i t down to three: publ i c, i nternal use onl y, and confi denti al . Empl oyees
were getti ng confused as to the di fferences between the secured l evel s, and the procedures
associ ated wi th each one. Havi ng too many securi ty l evel s proved expensi ve i n terms of
empl oyee educati on, securi ty faci l i ti es, and offi ce practi ces - the costs were often greater
than the potenti al l osses from a securi ty vi ol ati on.
Analyze Risk
Every effecti ve securi ty management system refl ects a careful eval uati on of how much
secur i ty i s needed. Too l i ttl e secur i ty means the system can easi l y be compr omi sed
i ntenti onal l y or uni ntenti onal l y. Too much securi ty can make the system hard to use or
degrade i ts performance unacceptabl y. Securi ty i s i nversel y proporti onal to uti l i ty - i f you
want the system to be 100 percent secure, don’t l et anybody use i t. There wi l l al ways be
ri sks to systems, but often these ri sks are accepted i f they make the system more powerful
or easi er to use.
Sources of ri sks to assets can be intentional (cri mi nal s, hackers, or terrori sts; competi tors;
di sgruntl ed empl oyees; or sel f-servi ng empl oyees) or unintentional (carel ess empl oyees; poorl y
trai ned users and system operators; vendors and suppl i ers).
Acceptance of ri sk i s central to good securi ty management. You wi l l never have enough
resources to secure assets 100 percent; i n fact, thi s i s vi rtual l y i mpossi bl e even wi th unl i mi ted
resources. Therefore, i denti fy al l ri sks to the system, then choose whi ch ri sks to accept and
whi ch to address vi a securi ty measures. Here are a few reasons why some ri sks are acceptabl e:
• The threat i s mi ni mal ;
• The possi bi l i ty of compromi se i s unl i kel y;
110 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• The val ue of the asset i s l ow;
• The cost to secure the asset i s greater than the val ue of the asset;
• The threat wi l l soon go away; and
• Securi ty vi ol ati ons can easi l y be detected and i mmedi atel y corrected.
After the ri sks are i denti fi ed, the next step i s to determi ne the i mpact to the busi ness
i f the asset i s l ost or compromi sed. By doi ng thi s, you get a good i dea of how many resources
shoul d be assi gned to protecti ng the asset. One user workstati on al most certai nl y deserves
fewer resources than the company’s servers.
The ri sks you choose to accept shoul d be documented and si gned by al l parti es, not onl y
to protect the I T organi zati on, but al so to make everybody aware that unsecured company
assets do exi st.
Define Security Practices
Defi ne i n detai l the fol l owi ng key areas of securi ty management:
• Asset classification practices. Gui del i nes for speci fyi ng securi ty l evel s as di scussed
above;
• Risk assessment and acceptance. As above;
• Asset ownership. Assi gnment of rol es for handl i ng sensi ti ve assets;
• Asset handling responsibilities. The di fferent tasks and procedures to be fol l owed
by the di fferent enti ti es handl i ng the asset, as i denti fi ed above;
• Policies regarding mishandling of security assets;
• How security violations are reported and responded to;
• Security awareness practices. Educati on programs, l abel i ng of assets; and
• Security audits. Unannounced checks of securi ty measures put i n pl ace to fi nd
out whether they are functi oni ng.
Implement Security Practices
At thi s phase, i mpl ement the securi ty measures defi ned i n the precedi ng step. You can
do thi s i n stages to make i t easi er for everybody to adapt to the new worki ng envi ronment.
Expect many probl ems at the start, especi al l y wi th respect to user resi stance to thei r securi ty
tasks, such as usi ng passwords. Staged i mpl ementati on can be performed:
• By department, starti ng wi th the most sensi ti ve assets. The natural fi rst choi ce
woul d be the I T department.
• By business function or activity, starti ng wi th those that depends upon (or
create) the most sensi ti ve assets. You mi ght begi n wi th al l Busi ness Pl anni ng
acti vi ti es, fol l owed by Marketi ng, Human Resources, etc.
• By location, especi al l y i f pri ori ti zed sensi ti ve assets are mostl y physi cal . Thi s
appr oach i s easi est to i mpl ement. However , i ts effecti veness i s doubtful for
i nformati on assets resi di ng i n networked computer systems. You mi ght start wi th
the I T data center, then gradual l y wi den the secured area to encompass the enti re
busi ness faci l i ty.
• By people, starti ng wi th key members of the organi zati on.
WAREHOUSE MANAGEMENT AND SUPPORT PROCESSES 111
Monitor for Violations and Take Corresponding Actions
An effecti ve securi ty management di sci pl i ne depends on adequate compl i ance moni tori ng.
Vi ol ati ons of securi ty practi ces, whether i ntenti onal or uni ntenti onal , become more frequent
and seri ous i f not detected and acted upon. A computer hacker who gets away wi th the fi rst
system penetrati on wi l l return repeatedl y i f he knows no one can detect hi s acti vi ti es. Users
who get away wi th l eavi ng confi denti al documents on thei r desks wi l l get i nto bad habi ts
i f not corrected qui ckl y.
There are two major acti vi ti es here: detecting securi ty vi ol ati ons and responding to
them. Wi th respect to sensi ti ve assets, i t i s i mportant to know:
• Who has the ri ght to handl e the assets (user names);
• How to authenti cate those asset users (password, I Ds, etc.);
• Who has tri ed to gai n access to them;
• How to restri ct access to al l owed acti vi ti es; and
• Who has tri ed to perform acti ons beyond those that are al l owed.
Document the response to securi ty vi ol ati ons, and fol l ow up i mmedi atel y after a vi ol ati on
i s detected. The I T organi zati on shoul d have a Computer Emergency Response Team to deal
wi th securi ty vi ol ati ons. Members of thi s team shoul d have access to seni or management so
that severe si tuati ons can easi l y be escal ated.
Responses can be bui l t i nto your securi ty tool s or faci l i ti es to ensure that the response
to a vi ol ati on i s i mmedi ate. For exampl e, a password checki ng uti l i ty may be desi gned to
l ock out a user name i mmedi atel y after three i nval i d password entri es. Al arms can be
i nstal l ed around the data center faci l i ty so that i f any wi ndow or door i s forced open,
securi ty guards or pol i ce are i mmedi atel y noti fi ed.
A cri ti cal part of thi s acti vi ty i s the generati on of reports for management that di scuss
si gni fi cant securi ty vi ol ati ons and trends of mi nor i nci dences. The objecti ve i s to spot potenti al
major securi ty vi ol ati ons before they cause seri ous damage.
Re-evaluate IT Assets and Risks
Securi ty management i s a di sci pl i ne that never rests. Some major changes that woul d
requi re a reassessment of the securi ty management practi ce are:
• Securi ty vi ol ati ons are rampant
• Organi zati onal structure or composi ti on changes
• Busi ness envi ronment changes
• Technol ogy changes
• Budget al l ocati on decreases
Addi ti onal precauti ons are requi red i f ei ther the warehouse data or warehouse reports
are avai l abl e to users through an i ntranet or over the publ i c I nternet i nfrastructure.
7.5 DEFINE BACKUP AND RECOVERY STRATEGY
Defi ne the backup and recovery strategy for the warehouse, taki ng i nto consi derati on
the fol l owi ng factors:
112 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Data to be backed up. I denti fy the data that must be backed up on a regul ar
basi s. Thi s gi ves an i ndi cati on of the regul ar backup si ze. Asi de from warehouse
data and metadata, the team mi ght al so want to back up the contents of the
stagi ng or de-dupl i cati on areas of the warehouse.
• Batch window of the warehouse. Backup mechani sms are now avai l abl e to
support the backup of data even when the system i s onl i ne, al though these are
expensi ve. I f the warehouse does not need to be onl i ne 24 hours a day, 7 days a
week, deter mi ne the maxi mum al l owabl e down ti me for the war ehouse (i .e.,
determi ne i ts batch wi ndow). Part of that batch wi ndow i s al l ocated to the regul ar
warehouse l oad and, possi bl y, to report generati on and other si mi l ar batch jobs.
Determi ne the maxi mum ti me peri od avai l abl e for regul ar backups and backup
veri fi cati on.
• Maximum acceptable time for recovery. I n case of di sasters that resul t i n the
l oss of warehouse data, the backups wi l l have to be restored i n the qui ckest way
possi bl e. Di fferent backup mechani sms i mpl y di fferent ti me frames for recovery.
Determi ne the maxi mum acceptabl e l ength of ti me for the warehouse data and
metadata to be restored, qual i ty assured, and brought onl i ne.
• Acceptable costs for backup and recovery. Di fferent backup mechani sms i mpl y
di fferent costs. The enterpri se may have budgetary constrai nts that l i mi t i ts backup
and recovery opti ons.
Al so consi der the fol l owi ng when sel ecti ng the backup mechani sm:
• Archive format. Use a standard archi vi ng format to el i mi nate potenti al recovery
probl ems.
• Automatic backup devices. Wi thout these, the backup medi a (e.g., tapes) wi l l
have to be changed by hand each ti me the warehouse i s backed up.
• Parallel data streams. Commerci al l y avai l abl e backup and recovery systems now
support the backup and recovery of databases through paral l el streams of data i nto
and from mul ti pl e removabl e storage devi ces. Thi s technol ogy i s especi al l y hel pful
for the l arge databases typi cal l y found i n data warehouse i mpl ementati ons.
• Incremental backups. Some backup and recovery systems al so support i ncremental
backups to reduce the ti me requi red to back up dai l y. I ncremental backups archi ve
onl y new and updated data.
• Offsite backups. Remember to mai ntai n offsi te backups to prevent the l oss of
data due to si te di sasters such as fi res.
• Backup and recovery procedures. Formal l y defi ne and document the backup
and r ecover y pr ocedur es. Per for m r ecover y pr acti ce r uns to ensur e that the
procedures are cl earl y understood.
7.6 SET UP COLLECTION OF WAREHOUSE USAGE STATISTICS
Warehouse usage stati sti cs are col l ected to provi de the data warehouse desi gner wi th
i nputs for further refi ni ng the data warehouse desi gn and to track general usage and
acceptance of the warehouse.
WAREHOUSE MANAGEMENT AND SUPPORT PROCESSES 113
Defi ne the mechani sm for col l ecti ng these stati sti cs, and assi gn resources to moni tor
and revi ew these regul arl y.
In Summary
The capaci ty pl anni ng process and the i ssue tracki ng and resol uti on process are cri ti cal
to the successful devel opment and depl oyment of data warehouses, especi al l y duri ng earl y
i mpl ementati ons.
The other management and support processes become i ncreasi ngl y i mportant as the
warehousi ng i ni ti ati ve progress further.
114 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
114
The data warehouse pl anni ng approach presented i n thi s chapter descri bes the acti vi ti es
rel ated to pl anni ng one rol l out of the data warehouse. The acti vi ti es di scussed bel ow bui l d
on the resul ts of the warehouse strategy formul ati on descri bed i n Chapter 6.
Data warehouse pl anni ng further detai l s the prel i mi nary scope of one warehouse rol l out
by obtai ni ng detai l ed user requi rements for queri es and reports, creati ng a prel i mi nary
warehouse schema desi gn to meet the user requi rements, and mappi ng source system fi el ds
to the warehouse schema fi el ds. By so doi ng, the team gai ns a thorough understandi ng of
the effort requi red to i mpl ement that one rol l out.
A pl anni ng project typi cal l y l asts between fi ve to ei ght weeks, dependi ng on the scope
of the rol l out. The progress of the team vari es, dependi ng (among other thi ngs) on the
parti ci pati on of enterpri se resource persons, the avai l abi l i ty and qual i ty of source system
documentati on, and the rate at whi ch project i ssues are resol ved.
Upon compl eti on of the pl anni ng effor t, the team moves i nto data war ehouse
i mpl ementati on for the pl anned rol l out. The acti vi ti es for data warehouse i mpl ementati on
are di scussed i n Chapter 9.
8.1 ASSEMBLE AND ORIENT TEAM
I denti fy al l parti es who wi l l be i nvol ved i n the data warehouse i mpl ementati on and
bri ef them about the project. Di stri bute copi es of the warehouse strategy as background
materi al for the pl anni ng acti vi ty.
Defi ne the team setup i f a formal project team structure i s requi red. Take the ti me and
effort to ori ent the team members on the rol l out scope, and expl ai n the rol e of each member
of the team. Thi s approach al l ows the project team members to set real i sti c expectati ons
about ski l l sets, project workl oad, and project scope.
Assi gn project team members to speci fi c rol es, taki ng care to match ski l l sets to rol e
responsi bi l i ti es. When al l assi gnments have been compl eted, check for unavoi dabl e trai ni ng
requi rements due to ski l l -rol e mi smatches (i .e., the team member does not possess the
appropri ate ski l l sets to properl y ful fi l l hi s or her assi gned rol e).
DATA WAREHOU$E FLANN¡NG
8
CHAFTER
DATA WAREHOUSE PLANNI NG 115
I f requi red, conduct trai ni ng for the team members to ensure a common understandi ng
of data warehousi ng concepts. I t i s easi er for everyone to work together i f al l have a common
goal and an agreed approach for attai ni ng i t. Descri be the schedul e of the pl anni ng project
to the team. I denti fy mi l estones or checkpoi nts al ong the pl anni ng project ti mel i ne. Cl earl y
expl ai n dependenci es between the vari ous pl anni ng tasks.
Consi deri ng the short ti me frame for most pl anni ng projects, conduct status meeti ngs
at l east once a week wi th the team and wi th the project sponsor. Cl earl y set objecti ves for
each week. Use the status meeti ng as the venue for rai si ng and resol vi ng i ssues.
8.2 CONDUCT DECISIONAL REQUIREMENTS ANALYSIS
Deci si onal Requi rements Anal ysi s i s one of two acti vi ti es that can be conducted i n
paral l el duri ng Data Warehouse Pl anni ng; the other acti vi ty bei ng Deci si onal Source System
Audi t (descri bed i n the next secti on). The object of Deci si onal Requi rements Anal ysi s i s to
gai n a thorough understandi ng of the i nformati on needs of deci si on-makers.
TOP-DOWN
• User Requirements
Decisional Requirements Analysis is Working Top-Down
Deci si onal requi rements anal ysi s represents the top-down aspect of data warehousi ng.
Use the warehouse strategy resul ts as the starti ng poi nt of the deci si onal requi rements
anal ysi s; a prel i mi nary anal ysi s shoul d have been conducted as part of the warehouse
strategy formul ati on.
Revi ew the i ntended scope of thi s warehouse rol l out as documented i n the warehouse
str ategy document. Fi nal i ze thi s scope by fur ther detai l i ng the pr el i mi nar y deci si onal
requi rements anal ysi s. I t wi l l be necessary to revi si t the user representati ves. The rol l out
scope i s typi cal l y expressed i n terms of the queri es or reports that are to be supported by
the warehouse by the end of thi s rol l out. The project sponsor must revi ew and approve the
scope to ensure that management expectati ons are set properl y.
Document any known l i mi tati ons about the source systems (e.g., poor data qual i ty,
mi ssi ng data i tems). Provi de thi s i nformati on to source system audi tors for thei r confi rmati on.
Veri fi ed l i mi tati ons i n source system data are used as i nputs to fi nal i zi ng the scope of the
rol l out—i f the data are not avai l abl e, they cannot be l oaded i nto the warehouse.
Take note that the scope strongl y i nfl uences the i mpl ementati on ti me frame for thi s
rol l out. Too l arge a scope wi l l make the project unmanageabl e. As a general rul e, l i mi t the
scope of each project or rol l out so that i t can be del i vered i n three to si x months by a ful l -
ti me team of 6 to 12 team members.
116 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Conducting Warehouse Planning Without a Warehouse Strategy
I t i s not unusual for enterpri ses to go di rectl y i nto warehouse pl anni ng wi thout previ ousl y
formul ati ng a warehouse strategy. Thi s typi cal l y happens when a group of users i s cl earl y
dri vi ng the warehouse i ni ti ati ve and are more than ready to parti ci pate i n the i ni ti al rol l out
as user representati ves. More often than not, these users have al ready taken the i ni ti ati ve
to l i st and pri ori ti ze thei r i nformati on requi rements.
I n thi s type of si tuati on, a number of tasks from the strategy formul ati on wi l l have to be
conducted as part of the pl anni ng for the fi rst warehouse rol l out. These tasks are as fol l ows:
• Determine organizational context. An understandi ng of the organi zati on i s
al ways hel pful i n any warehousi ng project, especi al l y si nce organi zati onal i ssues
may compl etel y derai l the warehouse i ni ti ati ve.
• Define data warehouse rollouts. Al though busi ness users may have al ready
predefi ned the scope of the fi rst rol l out, i t hel ps the warehouse archi tect to know
what l i es ahead i n subsequent rol l outs.
• Define data warehouse architecture. Defi ne the data warehouse archi tecture
for the current rol l out (and i f possi bl e, for subsequent rol l outs).
• Evaluate development and production environment and tools. The strategy
formul ati on was expected to produce a short-l i st of tool s and computi ng envi ronments
for the warehouse. Thi s eval uati on wi l l be fi nal i zed duri ng pl anni ng by the actual
sel ecti on of both envi ronments and tool s.
8.3 CONDUCT DECISIONAL SOURCE SYSTEM AUDIT
The deci si onal source system audi t i s a survey of al l i nformati on systems that are
current or potenti al sources of data for the data warehouse.
A prel i mi nary source system audi t duri ng warehouse strategy formul ati on shoul d provi de
a compl ete i nventory of data sources. I denti fy al l possi bl e source systems for the warehouse
i f thi s i nformati on i s currentl y unavai l abl e.
• Source Systems
• External Data
BOTTOM-UP
Data Sources can be Internal or External
Data sources are pri mari l y i nternal . The most obvi ous candi dates are the operati onal
systems that automate the day-to-day busi ness transacti ons of the enterpri se. Note that
DATA WAREHOUSE PLANNI NG 117
asi de from transacti onal or operati onal processi ng systems, one often-used data source i n
the enterpri se general l edger, especi al l y i f the reports or queri es focus on profi tabi l i ty
measurements.
I f external data sources are al so avai l abl e, these may be i ntegrated i nto the warehouse.
DBAs and IT Support Staff are the Best Resource Persons
The best resource persons for a deci si onal source system audi t of i nternal systems are
the database admi ni strators (DBAs), system admi ni strators and other I T staff who support
each i nternal system that i s a potenti al source of data. Wi th thei r i nti mate knowl edge of the
systems, they are i n the best posi ti on to gauge the sui tabi l i ty of each system as a warehouse
data source.
These i ndi vi dual s are al so more l i kel y to be fami l i ar wi th any data qual i ty probl ems
that exi st i n the source systems. Cl earl y document any known data qual i ty probl ems, as
these have a beari ng on the data extracti on and cl eansi ng processes that the warehouse
must support. Known data qual i ty probl ems al so provi de some i ndi cati on of the magni tude
of the data cl eanup task.
I n organi zati ons where the producti on of manageri al reports has al ready been automated
(but not through an archi tected data warehouse), the DBAs and I T support staff can provi de
very val uabl e i nsi ght about the data that are presentl y col l ected. These staff members can
al so provi de the team wi th a good i dea of the busi ness rul es that are used to transform the
raw data i nto management reports.
Conduct i ndi vi dual and group i ntervi ews wi th the I T organi zati on to understand the
data sources that are currentl y avai l abl e. Revi ew al l avai l abl e documentati on on the candi date
source systems. Thi s i s wi thout doubt one of the most ti me-consumi ng and detai l ed tasks
i n data warehouse pl anni ng, especi al l y i f up-to-data documentati on of the exi sti ng systems
i s not readi l y avai l abl e.
As a consequence, the whol e-hearted support of the I T organi zati on greatl y faci l i tates
thi s enti re acti vi ty.
Obtai n the fol l owi ng documents and i nformati on i f these have not yet been col l ected as
part of the data warehouse strategy defi ni ti on:
• Enterprise IT architecture documentation. Thi s refers to al l documentati on
that provi des a bi rd’s eye vi ew of the I T archi tecture of the enterpri se, i ncl udi ng
but not l i mi ted to:
• System ar chi tectur e di agr ams and documentati on—A model of al l the
i nformati on systems i n the enterpri se and thei r rel ati onshi ps to one another.
• Enterpri se data model —A model of al l data that currentl y stored or mai ntai ned
by the enterpri se. Thi s may al so i ndi cate whi ch systems support whi ch data
i tem.
• Network archi tecture—A di agram showi ng the l ayout and bandwi dth of the
enterpri se network, especi al l y for the l ocati ons of the project team and the
user representati ves parti ci pati ng i n thi s rol l out.
• User and technical manuals of each source system. Thi s refers to data model s
and schemas for al l exi sti ng i nformati on systems that are candi date’s data sources.
118 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
I f extracti on programs are used for ad hoc reporti ng, obtai n documentati on of these
extr acti on pr ogr ams as wel l . Obtai n copi es of al l other avai l abl e system
documentati on, whenever possi bl e.
• Database sizing. For each source system, i denti fy the type of database used, the
typi cal backup si ze, as wel l as the backup format and medi um. I t i s hel pful al so to
know what data are actual l y backed up on a regul ar basi s. Thi s i s parti cul arl y
i mportant i f hi stori cal data are requi red i n the warehouse and such data are avai l abl e
onl y i n backups.
• Batch window. Determi ne the batch wi ndows for each of the operati onal systems.
I denti fy al l batch jobs that are al ready performed duri ng the batch wi ndow. Any
data extracti on jobs requi red to feed the data warehouse must be compl eted wi thi n
the batch wi ndows of each source system wi thout affecti ng any of the exi sti ng
batch jobs al ready schedul ed. Under no ci rcumstances wi l l the team want to di srupt
normal operati ons on the source systems.
• Future enhancements. What appl i cati on devel opment projects, enhancements,
or acqui si ti on pl ans have been defi ned or approved for i mpl ementati on i n the next
6 to 12 months, for each of the source systems? Changes to the data structure wi l l
affect the mappi ng of source system fi el ds to data warehouse fi el ds. Changes to the
operati onal systems may al so resul t i n the avai l abi l i ty of new data i tems or the l oss
of exi sti ng ones.
• Data scope. I denti fy the most i mpor tant tabl es of each sour ce system. Thi s
i nformati on i s i deal l y avai l abl e i n the system documentati on. However, i f defi ni ti ons
of these tabl es are not documented, the DBAs are i n the best posi ti on to provi de
that i nformati on. Al so requi red are busi ness descri pti ons or defi ni ti ons of each fi el d
i n each i mportant tabl e, for al l source systems.
• System codes and keys. Each of the source systems no doubt uses a set of codes
for the system wi l l be i mpl ementi ng key generati on routi nes as wel l . I f these are
not documented, ask the DBAs to provi de a l i st of al l val i d codes and a textual
descri pti on for each of the system codes that are used. I f the system codes have
changed over ti me, ask the DBAs to provi de al l system code defi ni ti ons for the
rel evant ti me frame. Al l key generati on routi nes shoul d l i kewi se be documented.
These i ncl ude r ul es for assi gni ng customer number s, pr oduct number s, or der
numbers, i nvoi ce numbers, etc. check whether the keys are reused (or recycl ed) for
new records over the years. Reused keys may cause errors duri ng redupl i cati on
and must therefore be thoroughl y understood.
• Extraction mechanisms. Check i f data can be extracted or read di rectl y from the
producti on databases. Rel ati onal databases such as oracl e or Sybase are open and
shoul d be r eadi l y accessi bl e. Appl i cati on packages wi th pr opr i etar y database
management softwar e, however , may pr esent pr obl ems, especi al l y i f the data
structures are not documented. Determi ne how changes made to the database are
tracked, perhaps through an audi t l og. Determi ne al so i f there i s a way to i denti fy
data that have been changed or updated. These are i mportant i nputs to the data
extracti on process.
DATA WAREHOUSE PLANNI NG 119
8.4 DESIGN LOGICAL AND PHYSICAL WAREHOUSE SCHEMA
Desi gn the data warehouse schema that can best meet the i nformati on requi rements
of thi s rol l out. Two mai n schema desi gn techni ques are avai l abl e:
• Normalization. The database schema i s desi gned usi ng the nor mal i zati on
techni ques tradi ti onal l y used for OLTP appl i cati ons;
• Dimensional modeling. Thi s techni que produces demoral i zed, star schema desi gns
consi sti ng of fact and di mensi on tabl es. A vari ati on of the di mensi onal star schema
al so exi sts (i .e., snowfl ake schema).
There are ongoi ng debates regardi ng the appl i cabi l i ty or sui tabi l i ty of both these model i ng
techni ques for data warehouse projects, al though di mensi onal model i ng has certai nl y been
gai ni ng popul ari ty i n recent years. Di mensi onal model i ng has been used successful l y i n
l arger data warehousi ng i mpl ementati ons across mul ti pl e i ndustri es. The popul ari ty of thi s
model i ng techni que i s al so evi dent from the number of databases and front-end tool s that
now support opti mi zed performance wi th star schema desi gns (e.g., Oracl e RDBMS 8, R/ol ap
XL).
A di scussi on of di mensi onal model i ng techni ques i s provi ded i n Chapter 12.
8.5 PRODUCE SOURCE-TO-TARGET FIELD MAPPING
The Source-To-Target Fi el d Mappi ng documents how fi el ds i n the operati onal (source)
systems are transformed i nto data warehouse fi el ds. Under no ci rcumstances shoul d thi s
mappi ng be l eft vague or open to mi si nterpretati on, especi al l y for fi nanci al data. The mappi ng
al l ows non-team members to audi t the data transformati ons i mpl emented by the warehouse.
BACK-END
• Extraction
• Integration
• QA
• DW Load
• Aggregates
• Metadata
Many-to-Many Mappings
A si ngl e fi el d i n the data warehouse may be popul ated by data from more than one
source system. Thi s i s a natural consequence of the data warehouse’s rol e of i ntegrati ng
data from mul ti pl e sources.
120 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
The cl assi c exampl es are customer name and product name. Each operati onal system
wi l l typi cal l y have i ts own customer and product records. A data warehouse fi el d cal l ed
customer name or product name wi l l therefore be popul ated by data from more than one
systems.
Conversel y, a si ngl e fi el d i n the operati onal systems may need to be spl i t i nto several
fi el ds i n the warehouse. There are operati onal systems that sti l l record addresses as l i nes
of text, wi th fi el d names l i ke address l i ne 1, address l i ne
2
, etc. these can be spl i t i nto
mul ti pl e address fi el ds such as street name, ci ty, country and Mai l /Zi p code. Other exampl es
are numeri c fi gures or bal ances that have to be al l ocated correctl y to two or more di fferent
fi el ds.
To el i mi nate any confusi on as to how data are transformed as the data i tems are moved
from the source systems to the warehouse database, create a source-to-target fi el d mappi ng
that maps each source fi el d i n each source system to the appropri ate target fi el d i n the data
warehouse schema. Al so, cl earl y document al l busi ness rul es that govern how data val ues
are i ntegrated or spl i t up. Thi s i s requi red for each fi el d i n the source-to-target fi el d mappi ng.
The sour ce-to-tar get fi el d mappi ng i s cr i ti cal to the successful devel opment and
mai ntenance of the data warehouse. Thi s mappi ng serves as the basi s for the data extracti on
and transformati on subsystems. Fi gure 8.1 shows an exampl e of thi s mappi ng.
No.
Schema
Table
Fields
SF1
SF2
SF3
SF4
SF5
SF6
SF7
SF8
SF9
S F10
ST1
ST1
ST1
ST1
ST2
ST2
ST2
ST3
ST3
S T3
SS1
SS1
SS1
SS1
SS1
SS1
SS2
SS2
Ss2
S S2
1
2
3
4
5
6
7
8
9
10
No. System Table
TARGET 1
R1
T T1
TF1
2
R1
T T1
TF2
3
R1
T T1
TF3
4
R1
T T2
TF4
5
R1
T T2
TF5
6
R1
T T2
TF6
7
R1
T T2
TF7
SOURCE
... ... ... ... ... ... ... ... ... ... ...
SOURCE: SS1 = Source System1. ST1= Source Table 1. SF1 = Source Fi eld 1
TARGET: R1 = Roll out1.TT1 = Target Tabl e1. TF1 = Target Fi el d 1
Figure 8.1. Sampl e Source-to-Target Fi el d Mappi ng.
Revi se the data warehouse schema on an as-needed basi s i f the fi el d-to-fi el d mappi ng
yi el ds mi ssi ng data i tems i n the source systems. These mi ssi ng data i tems may prevent the
warehouse from produci ng one or more of the requested queri es or reports. Rai se these
types of scope i ssues as qui ckl y as possi bl e to the project sponsors.
Historical Data and Evolving Data Structures
I f users requi re the l oadi ng of hi stori cal data i nto the data warehouse, two thi ngs must
be determi ned qui ckl y:
• Changes in schema. Determi ne i f the schemas of al l source systems have changed
over the rel evant ti me peri od. For exampl e, i f the retenti on peri od of the data
DATA WAREHOUSE PLANNI NG 121
warehouse i s two years and data from the past two years have to be l oaded i nto
the warehouse, the team must check for possi bl e changes i n source system schemas
over the past two years. I f the schemas have changed over ti me, the task of extracti ng
the data i mmedi atel y becomes more compl i cated. Each di fferent schema may requi re
a di fferent source-to-target fi el d mappi ng.
• Availability of historical data. Determi ne al so i f hi stori cal data are avai l abl e for
l oadi ng i nto the warehouse. Backups duri ng the rel evant ti me peri od may not
contai n the requi red data i tem. Veri fy assumpti ons about the avai l abi l i ty and
sui tabi l i ty of backups for hi stori cal data l oads.
These two tedi ous tasks wi l l be more di ffi cul t to compl ete i f documentati on i s out of
data or i nsuffi ci ent and i f none of the I T professi onal s i n the enterpri se today are fami l i ar
wi th the ol d schemas.
8.6 SELECT DEVELOPMENT AND PRODUCTION ENVIRONMENT AND TOOLS
Fi nal i ze the computi ng envi ronment and tool set for thi s rol l out based on the resul ts
of the devel opment and producti on envi ronment and tool s study duri ng the data warehouse
strategy defi ni ti on. I f an exhausti ve study and sel ecti on had been performed duri ng the
strategy defi ni ti on stage, thi s acti vi ty becomes opti onal .
I f, on the other hand, the warehouse strategy was not formul ated, the enterpri se must
now eval uate and sel ect the computi ng envi ronment and tool s that wi l l be purchased for the
warehousi ng i ni ti ati ve. Thi s acti vi ty may take some ti me, especi al l y i f the eval uati on process
requi res extensi ve vendor presentati ons and demonstrati ons, as wel l as si te vi si ts. Thi s
acti vi ty i s therefore best performed earl y on to al l ow for suffi ci ent ti me to study and sel ect
the tool s. Suffi ci ent l ead ti mes are al so requi red for the del i very (especi al l y i f i mportati on
i s requi red) of the sel ected equi pment and tool s.
8.7 CREATE PROTOTYPE FOR THIS ROLLOUT
Usi ng the short-l i sted or fi nal tool s and producti on envi ronment, create a prototype of
the data warehouse.
A prototype i s typi cal l y created and presented for one or more of the fol l owi ng reasons:
• To assists in the selection of front-end tools. I t i s someti mes possi bl e to ask
warehousi ng vendors to present a prototype to the eval uators as part of the sel ecti on
122 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
process. However, such prototypes wi l l natural l y not be very speci fi c to the actual
data and reporti ng requi rements of the rol l out.
• To verify the correctness of the schema design. The team i s better served by
creati ng a prototype usi ng the l ogi cal and physi cal warehouse schema for thi s
rol l out. I f possi bl e, use actual data from the operati onal systems for the prototype
queri es and reports. I f the user requi rements (i n terms of queri es and reports) can
be created usi ng the schema, then the team has concretel y veri fi ed the correctness
of the schema desi gn.
• To verify the usability of the selected front-end tools. The warehousi ng team
can i nvi te representati ves from the user communi ty to actual l y use the prototype
to veri ty the usabi l i ty of the sel ected front-end tool s.
• To obtain feedback from user representatives. The prototype i s often the fi rst
concrete output of the pl anni ng effort. I t provi des users wi th somethi ng that they
can see and touch. I t al l ows users to experi ence for the fi rst ti me the ki nd of
computi ng envi ronment they wi l l have when the warehouse i s up. Such an experi ence
typi cal l y tri ggers a l ot of feedback (both posi ti ve and negati ve) from users. I t may
even cause users to arti cul ate previ ousl y unstated requi rements.
Regardl ess of the type of feedback, however, i t i s al ways good to hear what the
users have to say as earl y as possi bl e. Thi s provi des the team more ti me to adjust
the approach or the desi gn accordi ngl y.
Duri ng the prototype presentati on meeti ng, the fol l owi ng shoul d be made cl ear to
the busi ness users who wi l l be vi ewi ng or usi ng the prototype:
• Objective of the prototype meeting. State the objecti ves of the meeti ng cl earl y
to properl y ori ent al l parti ci pants. I f the objecti ve i s to sel ect a tool set, then the
attenti on and focus of users shoul d be di rected accordi ngl y.
• Nature of data used. I f actual data from the operati onal systems are used wi th
the prototype, make cl ear to al l busi ness users that the data have not yet been
qual i ty assur ed. I f dummy or test data ar e used, then thi s shoul d be cl ear l y
communi cated as wel l . Users who are concerned wi th the correctness of the prototype
data have unfortunatel y si detracked many prototype presentati ons.
• Prototype scope. I f the prototype does not yet mi mi c al l the requi rements i denti fi ed
for thi s rol l out, then say so. Don’t wai t for the users to expl i ci tl y ask whether the
team has consi dered (or forgotten!) the requi rements they had speci fi ed i n earl i er
meeti ngs or i ntervi ews.
8.8 CREATE IMPLEMENTATION PLAN OF THIS ROLLOUT
Wi th the scope now ful l y defi ned and the source-to-target fi el d mappi ng ful l y speci fi ed,
i t i s now possi bl e to draft an i mpl ementati on pl an for thi s rol l out. Consi der the fol l owi ng
factors when creati ng the i mpl ementati on pl an:
• Number of source systems, and their related extraction mechanisms and
logistics. The more source systems there are, the more compl ex the extracti on and
i ntegrati on processes wi l l be. Al so, source systems wi th open computi ng envi ronments
present fewer compl i cati ons wi th the extracti on process than do propri etary systems.
DATA WAREHOUSE PLANNI NG 123
• Number of decisional business processes supported. The l arger the number
of deci si onal busi ness processes supported by thi s rol l out, the more users there are
who wi l l want to have a say about the data warehouse contents, the defi ni ti on of
terms, and the busi ness rul es that must be respected.
• Number of subject areas involved. Thi s i s a strong i ndi cator of the rol l out si ze.
The more subject areas there are, the more fact tabl es wi l l be requi red. Thi s
i mpl i es more warehouse fi el ds to map to source systems and, of course, a l arger
rol l out scope.
• Estimated database size. The esti mated war ehouse si ze pr ovi des an ear l y
i ndi cati on of the l oadi ng, i ndexi ng, and capaci ty chal l enges of the warehousi ng
effort. The database si ze al l ows the team to esti mate the l ength of ti me i t takes to
l oad the warehouse regul arl y (gi ven the number of records and the average l ength
of ti me i t takes to l oad and i ndex each record).
• Availability and quality of source system documentation. A l ot of the team’s
ti me wi l l be wasted on searchi ng for or mi sunderstandi ng the data that are avai l abl e
i n the sour ce systems. The avai l abi l i ty of good-qual i ty documentati on wi l l
si gni fi cantl y i mpr ove the pr oducti vi ty of sour ce system audi tor s and techni cal
anal ysts.
• Data quality issues and their impact on the schedule. Unfortunatel y, there
i s no di rect way to esti mate the i mpact of data qual i ty probl ems on the project
schedul e. Any attempts to esti mate the del ays often produce unreal i sti cal l y l ow
fi gures, much to the concentrati on of warehouse project managers. Earl y knowl edge
and documentati on of data qual i ty i ssues wi l l hel p the team to anti ci pate probl ems.
Al so, data qual i ty i s very much a user responsi bi l i ty that cannot be l eft to I T to
sol ve. Wi thout suffi ci ent user support, data qual i ty probl ems wi l l conti nual l y be a
thorn i n the si de of the warehouse team.
• Required warehouse load rate. A number of factors external to the warehousi ng
team (parti cul arl y batch wi ndows of the operati onal systems and the average si ze
of each warehouse l oad) wi l l affect the desi gn and approach used by the warehouse
i mpl ementati on team.
• Required warehouse availability. The warehouse i tsel f wi l l al so have batch
wi ndows. The maxi mum al l owed down ti me for the warehouse al so i nfl uences the
desi gn and approach of the warehousi ng team. A ful l y avai l abl e warehouse (24
hours × 7 days) requi res an archi tecture that i s compl etel y di fferent from that
requi red by a warehouse that i s avai l abl e onl y 12 hours a day, 5 days a week.
These di fferent archi tectural requi rements natural l y resul t i n di fferences i n cost
and i mpl ementati on ti me frame.
• Lead time for delivery and setup of selected tools, development, and
production environment. Project schedul es someti mes fai l to consi der the l ength
of ti me requi red to setup the devel opment and producti on envi ronments of the
warehousi ng project. Whi l e some warehouse i mpl ementati on tasks can proceed
whi l e the computi ng envi ronments and tool s are on thei r way, si gni fi cant progress
cannot be made unti l the correct envi ronment and tool sets are avai l abl e.
124 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Time frame required for IT infrastructure upgrades. Some I T i nfrastructure
upgrades (e.g., network upgrade or extensi on) may be requi red or assumed by the
warehousi ng project. These dependenci es shoul d be cl earl y marked on the project
schedul e. The warehouse Project Manager must coordi nate wi th the i nfrastructure
Project Manager to ensure that suffi ci ent communi cati on exi sts between al l concerned
teams.
• Business sponsor support and user participation. Ther e i s no way to
overemphasi ze the i mportance of Project Sponsor support and end user parti ci pati on.
No amount of pl anni ng by the warehouse Project Manager (and no amount of effort
by the warehouse project team) can make up for the l ack of parti ci pati on by these
two parti es.
• IT support and participation. Si mi l arl y, the support and parti ci pati on of the
database admi ni str ator s and system admi ni str ator s wi l l make a tr emendous
di fference to the overal l producti vi ty of the warehousi ng team.
• Required vs. existing skill sets. The match (or mi smatch) of per sonnel ski l l
sets and r ol e assi gnments wi l l l i kewi se affect the pr oducti vi ty of the team. I f thi s
i s an ear l y or pi l ot pr oject, then tr ai ni ng on var i ous aspects of war ehousi ng wi l l
most l i kel y be r equi r ed. These tr ai ni ng sessi ons shoul d be factor ed i nto the
i mpl ementati on schedul e as wel l and, i deal l y, shoul d take pl ace befor e the actual
ski l l s ar e r equi r ed.
8.9 WAREHOUSE PLANNING TIPS AND CAVEATS
The actual data warehouse pl anni ng acti vi ty wi l l rarel y be a strai ghtforward exerci se.
Before conducti ng your pl anni ng acti vi ty, read through thi s secti on for pl anni ng ti ps and
caveats.
Follow the Data Trail
I n the absence of true deci si on support systems, enterpri ses have, over the years, been
forced to fi nd stopgap or i nteri m sol uti ons for produci ng the manageri al or deci si onal reports
that deci si on-makers requi re. Some of these sol uti ons requi re onl y si mpl e extracti on programs
that are regul arl y run to produce the requi red reports. Other sol uti ons requi re a compl ex
seri es of steps that combi ne manual data mani pul ati on, extracti on programs, conversi on
formul as, and spreadsheet macros.
I n the absence of a data warehouse, many of the manageri al reporti ng requi rements
are cl assi fi ed as ad hoc reports. As a resul t, most of these report generati on programs and
processes are l argel y undocumented and are known onl y by the peopl e who actual l y produce
the reports. Thi s natural l y l eads to a l ack of standards (i .e., di fferent peopl e may appl y
di fferent formul as and rul es to the same data i tem), and possi bl e i nconsi stenci es each ti me
the process i s executed. Fortunatel y, the warehouse project team wi l l be i n a posi ti on to
i ntroduce standards and consi stent ways of mani pul ati ng data.
Fol l owi ng the data trai l (see Fi gure 8.2) from the current management reports, back to
thei r respecti ve data sources can prove to be a very enl i ghteni ng exerci se for data warehouse
pl anners.
DATA WAREHOUSE PLANNI NG 125
Manual
Transformation
Current
Reports
Data Extract 2
Data Extract 1
Figure 8.2. Fol l ow the Data Trai l
Through thi s exerci se, the data warehouse pl anner wi l l fi nd:
• All data sources currently used for decisional reporting. At the very l east,
these data sources shoul d al so be i ncl uded i n the deci si onal source system audi t.
The team has the added benefi t of knowi ng before hand whi ch fi el ds i n these
systems are consi dered i mportant.
• All current extraction programs. The current extracti on programs are a ri ch
i nput for the source-to-target fi el d mappi ng. Al so, i f these programs mani pul ate or
transform or convert the data i n any way, the transformati on rul es and formul as
may al so prove hel pful to the warehousi ng effort.
• All undocumented manual steps to transform the data. After the raw data
have been extracted from the operati onal systems, a number of manual steps may
be performed to further transform the data i nto the reports that enterpri se managers.
I nter vi ews wi th the appr opr i ate per sons shoul d pr ovi de the team wi th an
understandi ng of these manual conversi on and transformati on steps (i f any).
Apart from the above i tems, i t i s al so l i kel y that the data warehouse pl anner wi l l fi nd
subtl e fl aws i n the way reports are produced today. I t i s not unusual to fi nd i nconsi stent
use of busi ness terms formul as, and busi ness rul es, dependi ng on the person who creates
and reads the reports. Thi s l ack of standard terms and rul es contri butes di rectl y to the
exi stence of confl i cti ng reports from di fferent groups i n the same enterpri se, i .e., the exi stence
of “di fferent versi ons of the truth”.
Limitations Imposed by Currently Available Data
Each data i tem that i s requi red to produce the reports requi red by deci si on-makers
comes from one or more of the source systems avai l abl e to the enterpri se. Understandabl y,
there wi l l be data i tems that are not readi l y supported by the source systems.
Data l i mi tati ons general l y fal l i nto one of the fol l owi ng types.
Missing Data Items
A data i tem i s consi dered mi ssi ng, i f no provi si ons were made to col l ect or store thi s
data i tem i n any of the source systems. Thi s omi ssi on parti cul arl y occurs wi th data i tems
that may have no beari ng on the day-to-day operati ons of the enterpri se but wi l l have
tacti cal or manageri al i mpl i cati ons.
126 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
For exampl e, not al l l oan systems record the i ndustry to whi ch each l oan customer
bel ongs; from an operati onal l evel , such i nformati on may not necessari l y be consi dered
cri ti cal . Unfortunatel y, a bank that wi shes to track i ts ri sk exposure for any gi ven i ndustry
wi l l not be abl e to produce an i ndustry exposure report i f customer i ndustry data are not
avai l abl e at the source systems.
Incomplete (optional) Data Items
A data i tem may be cl assi fi ed as “ni ce to have” i n the operati onal systems, and so
provi si ons are made to store the data, but no rul es are put i n pl ace to enforce the col l ecti on
of such data. These opti onal data i tems are avai l abl e for some customer products, accounts,
or orders but may be unavai l abl e for others.
Returni ng to the above exampl e, a l oan system may have a fi el d cal l ed customer i ndustry,
but the appl i cati on devel opers may have made the fi el d opti onal , i n recogni ti on of the fact
that data about a customers i ndustry are not ready avai l abl e i n cases such as thi s, onl y
customers wi th actual data can be cl assi fi ed meani ngful l y i n the report.
Wrong Data
Errors occur when data are stored i n one or more source systems but are not accurate.
There are many potenti al reasons or causes for thi s, i ncl udi ng the fol l owi ng ones:
• Data entry error. A genui ne error i s made duri ng data entry. The wrong data are
stored i n the database.
• Data item is mandatory but unavailable. A data i tem may be defi ned as
mandatory but i t may not be readi l y avai l abl e, and the random substi tuti on of
other i nformati on has no di rect i mpact on the day-to-day operati ons of the enterpri se.
Thi s i mpl i es that any data can be entered wi thout adversel y affecti ng the operati on
processes.
Returni ng to the above exampl e, i f customer i ndustry was a mandatory customer data
i tem and the person creati ng the customer record does not know the i ndustry to whi ch the
customer bel ongs, he i s l i kel y to sel ect, at random, any of the i ndustry codes that are
recogni zed by the system. Onl y by so doi ng wi l l he or she be abl e to create the customer
r ecor d.
Another data i tem that can be randoml y substi tuted i s the soci al securi ty number,
especi al l y i f these number s ar e stor ed for r efer ence pur poses onl y, and not for actual
processi ng. Data entry personnel remai n focused on the i mmedi ate task of creati ng the
customer record whi ch the system refuses to do wi thout al l the mandatory data i tems. Data
entry personnel are rarel y i n a posi ti on to see the consequences to recordi ng the wrong data.
Improvements to Source Systems
From the above exampl es, i t i s easy to see how the scope of a data warehousi ng
i ni ti ati ve can be severel y compromi sed by data l i mi tati ons i n the source systems. Most pi l ot
data warehouse projects are thus l i mi ted onl y to the data that are avai l abl e. However,
i mprovements can be made to the source systems i n paral l el wi th the warehousi ng projects.
The team shoul d ther efor e pr oper l y document any sour ce system l i mi tati ons that ar e
encountered. These documents can be used as i nputs to upcomi ng mai ntenance projects on
the operati onal systems.
DATA WAREHOUSE PLANNI NG 127
A deci si onal source system audi t report may have a source system revi ew secti on that
covers the fol l owi ng topi cs:
• Overview of operational systems. Thi s secti on l i sts al l operati onal systems
covered by the audi t. A general descri pti on of the functi onal i ty and data of each
operati onal system i s provi ded. A l i st of major tabl es and fi el ds may be i ncl uded
as an appendi x. Current users of each of the operati onal systems are opti onal l y
documented.
• Missing data items. Li st al l the data i tems that are requi red by the data warehouse
but are currentl y not avai l abl e i n the source systems. Expl ai n why each i tem i s
i mportant (e.g., ci te reports or queri es where these data i tems are requi red). For
each data i tem, i denti fy the source system where the data i tem i s best stored.
• Data quality improvement areas. For each operati onal system, l i st al l areas
where the data qual i ty can be i mproved. Suggesti ons as to how the data qual i ty
i mprovement can be achi eved can al so be provi ded.
• Resource and effort estimate. For each operati onal system, i t mi ght be possi bl e
to provi de an esti mate or the cost and l ength of ti me requi red to ei ther add the
data i tem or i mprove the data qual i ty for that data i tem.
In Summary
Data warehouse pl anni ng i s conducted to cl earl y defi ne the scope of one data warehouse
rol l out. The combi nati on of the top-down and bottom-up tracks gi ves the pl anni ng process
the best of both worl ds—a requi rements-dri ven approach that i s grounded on avai l abl e data.
The cl ear separati on of the front-end and back-end tracks encourages the devel opment
of warehouse subsystems for extracti ng, transporti ng, cl eani ng, and l oadi ng warehouse data
i ndependentl y of the front-end tool s that wi l l be used to access the warehouse.
The four tracks converge when a prototype of the warehouse i s created and when actual
warehouse i mpl ementati on takes pl ace.
Each rol l out repeatedl y executes the four tracks (top-down, bottom-up, back-end), and
the scope of the data warehouse i s i terati vel y extended as a resul t Fi gure 8.3 i l l ustrates the
concept.
TOP-DOWN
• User Requirements
BACK-END
• Extraction
• Integration
• QA
• DW Load
• Aggregates
• Metadata
FRONT-END
• OLAP Tool
• Canned Reports
• Canned Queries
BOTTOM-UP
• Source Systems
• External Data
Fig. 8.3
128 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
128
The data Warehouse i mpl ementati on approach presented i n thi s chapter descri bes the
acti vi ti es rel ated to i mpl ementi ng one rol l out of the date warehouse. The acti vi ti es di scussed
here are bui l t on the resul ts of the data warehouse pl anni ng descri bed i n the previ ous
chapter.
The data warehouse i mpl ementati on team bui l ds or extends an exi sti ng warehouse
schema based on the fi nal l ogi cal schema desi gn produced duri ng pl anni ng. The team al so
bui l ds the warehouse subsystems that ensure a steady, regul ar fl ow of cl ean data from the
operati onal systems i nto the data warehouse. Other team members i nstal l and confi gure the
sel ected front-end tool s to provi de users wi th access to warehouses data.
An i mpl ementati on project shoul d be scoped to l ast between three to si x months. The
progress of the team vari es, dependi ng (among other thi ngs) on the qual i ty of the warehouse
desi gn, the qual i ty of the i mpl ementati on pl an, the avai l abi l i ty and parti ci pati on of enterpri se
resource persons, and the rate at whi ch project i ssues are resol ved.
User tr ai ni ng and war ehouse testi ng acti vi ti es take pl ace towar ds the end of the
i mpl ementati on project, just pri or to the depl oyment to users. Once the warehouse has been
depl oyed, the day-to-day warehouse management, mai ntenance, and opti mi zati on tasks begi n.
Some members of the i mpl ementati on team may be asked to stay on and assi st wi th the
mai ntenance acti vi ti es to ensure conti nui ty. The other members of the project team may be
asked to start pl anni ng the next warehouse rol l out or may be rel eased to work on other
projects.
9.1 ACQUIRE AND SET UP DEVELOPMENT ENVIRONMENT
Acqui re and set up the devel opment envi ronment for the data warehouse i mpl ementati on
project. Thi s acti vi ty i ncl udes the fol l owi ng tasks, among others: i nstal l the hardware, the
operati ng system, the rel ati onal database engi ne; i nstal l al l warehousi ng tool s; create al l
necessary network connecti ons; and create al l requi red user I Ds and user access defi ni ti ons.
Note that most data warehouses reside on a machine that is physically separate from the
operational systems. I n addition, the relational database management system used for data
warehousing need not be the same database management system used by the operational systems.
DATA WAREHOU$E ¡MFLEMENTAT¡ON
9
CHAFTER
DATA WAREHOUSE I MPLEMENTATI ON 129
At the end of thi s task, the devel opment envi ronment i s set up, the project team
members are trai ned on the (new) devel opment envi ronment, and al l technol ogy components
have been purchased and i nstal l ed.
9.2 OBTAIN COPIES OF OPERATIONAL TABLES
There may be i nstances where the team has no di rect access to the operati onal source
systems from the warehouse devel opment envi ronment. Thi s i s especi al l y possi bl e for pi l ot
projects, where the network connecti on to the warehouse devel opment envi ronment may be
avai l abl e.
Regardl ess of the reason for the l ack of access, the warehousi ng team must establ i sh
and document a consi stent, rel i abl e, and easy-to-fol l ow procedure for obtai ni ng copi es of the
rel evant tabl es from the operati onal systems. Copi es of these tabl es are made avai l abl e to
the war ehousi ng team on another medi um (most l i kel y tape) and ar e r estor ed on the
warehouse server. The creati on of copi es can al so be automated through the use of repl i cati on
technol ogy.
The war ehousi ng team must have a mechani sm for ver i fyi ng the cor r ectness and
compl eteness of the data that ar e l oaded onto the war ehouse ser ver . One of the most
effecti ve compl eteness checks i s meani ngful busi ness counts (e.g., number of customer s,
number of accounts, number of tr ansacti ons) that ar e computed and compar ed to
ensur e data compl eteness. Data qual i ty uti l i ti es can hel p assess the cor r ectness of
the data.
The use of copi ed tabl es as descri bed above i mpl i es addi ti onal space requi rements on
the warehouse server. Thi s shoul d not be a probl em duri ng the pi l ot project.
9.3 FINALIZE PHYSICAL WAREHOUSE SCHEMA DESIGN
Tr ansl ate the detai l ed l ogi cal and physi cal war ehouse desi gn fr om the war ehouse
pl anni ng stage i nto a fi nal physi cal warehouse desi gn, taki ng i nto consi derati on the speci fi c,
sel ected database management system.
The key consi derati ons are :
• Schema design. Fi nal i ze the physi cal desi gn of the fact and di mensi on tabl es and
thei r respecti ve fi el ds. The warehouse database admi ni strator (DBA) may opt to
di vi de one l ogi cal di mensi on (e.g., customer) i nto two or more separate ones (e.g.,
a customer di mensi on and a customer demographi c di mensi on) to save on space
and i mprove query performance.
• Indexes. I denti fy the appropri ate i ndexi ng method to use on the warehouse tabl es
and fi el ds, based on the expected data vol ume and the anti ci pated natur e of
warehouse queri es. Veri fy i ni ti al assumpti ons made about the space requi red by
i ndexes to ensure that suffi ci ent space has been al l ocated.
• Partitioning. The warehouse DBA may opt to parti ti on fact and di mensi on tabl es,
dependi ng on thei r si ze and on the parti ti oni ng features that are supported by the
database engi ne. The warehouses DBA who deci des to i mpl ement parti ti oned vi ews
must consi der the tr ade-offs between degr adati on i n quer y per for mance and
i mprovements i n warehouse manageabi l i ty and space requi rements.
130 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
9.4 BUILD OR CONFIGURE EXTRACTION AND TRANSFORMATION SUBSYSTEMS
Easi l y 60 percent to 80 percent of a warehouse i mpl ementati on project i s devoted to the
back-end of the warehouse. The back-end subsystems must extract, transform, cl ean, and
l oad the operati onal data i nto the data warehouse. Understandabl y, the back-end subsystems
var y si gni fi cantl y fr om one enter pr i se to another due to di ffer ences i n the computi ng
envi ronments, source systems, and busi ness requi rements. For thi s reason, much of the
warehousi ng effort cannot si mpl y be automated away by warehousi ng tool s.
Extraction Subsystem
The fi rst among the many subsystems on the back-end of the warehouse i s the data
extracti on subsystem. The term extracti on refers to the process of retri evi ng the requi red
data from the operati onal system tabl es, whi ch may be the actual tabl es or si mpl y copi es
that have been l oaded i nto the warehouse server.
Actual extracti on can be achi eved through a wi de vari ety of mechani sms, rangi ng from
sophi sti cated thi rd-party tool s to custom-wri tten extracti on scri pts or programs devel oped
by i n house I T staff. Thi rd-party extracti on tool s are typi cal l y abl e to connect to mai nframe,
mi drange and UNI X envi ronments, thus freei ng thei r users from the ni ghtmare of handl i ng
heterogeneous data sources. These tool s al so al l ow users to document the extracti on process
(i .e., they have provi si ons for stori ng metadata about the extracti on).
These tool s, unfortunatel y, are expensi ve. For thi s reason, organi zati ons may al so turn
to wri ti ng thei r own extracti on programs. Thi s i s a parti cul arl y vi abl e al ternati ve i f the
source systems are on a uni form or homogenous computi ng envi ronment (e.g., al l data
resi de on the same RDBMS, and they make use of the same operati ng system). Custom-
wri tten extracti on programs, however, may be di ffi cul t to mai ntai n, especi al l y i f these
programs are not wel l documented. Consi deri ng how qui ckl y busi ness requi rements wi l l
change i n the warehousi ng envi ronment, ease of mai ntenance i s an i mportant factor to
consi der.
Transformation Subsystem
The transformati on subsystem l i teral l y transforms the data i n accordance wi th the
busi ness rul es and standards that have been establ i shed for the data warehouse.
Several types of transformati ons are typi cal l y i mpl emented i n data warehousi ng.
• Format changes. Each of the data fi el ds i n the operati onal systems may store
data i n di fferent formats and data types. These i ndi vi dual data i tems are modi fi ed
duri ng the transformati on process to respect a standard set of formats. For exampl e,
al l data formats may be changed to respect a standard format, or a standard data
type i s used for character fi el ds such as names, addresses.
• De-duplication. Records from mul ti pl e sources are compared to i denti fy dupl i cate
records based on matchi ng fi el d val ues. Dupl i cates are merged to create a si ngl e
record of a customer; a product, an empl oyee, or a transacti on, Potenti al dupl i cates
ar e l ogged as excepti ons that ar e manual l y r esol ved. Dupl i cate r ecor ds wi th
confl i cti ng data val ues are al so l ogged for manual correcti on i f there i s no system
of record to provi de the “master” or “correct” val ue.
DATA WAREHOUSE I MPLEMENTATI ON 131
• Splitting up fields. A data i tem i n the source system may need to be spl i t up i nto
one or mor e fi el ds i n the war ehouse. One of the most commonl y encounter ed
probl ems of thi s nature deal s wi th customer addresses that have si mpl y been stored
as several l i nes of text. These textual val ues may be spl i t up i nto di sti nct fi el ds;
street number, street name, bui l di ng name, ci ty, mai l or zi p code, country, etc.
• Integrating fields. The opposi te of spl i tti ng up fi el ds i s i ntegrati on. Two or more
fi el ds i n the operati onal systems may be i ntegrated to popul ate one warehouse
fi el d.
• Replacement of values. Val ues that are used i n operati onal systems may not be
comprehensi bl e to warehouse users. For exampl e, system codes that have speci fi c
meani ngs i n oper ati onal systems ar e meani ngl ess to deci si on-maker s. The
transformati on subsystem repl aces the ori gi nal wi th new val ues that have a busi ness
meani ng to warehouse users.
• Derived values. Bal ances, rati os, and other deri ved val ues can be computed usi ng
agreed formul as. By pre-computi ng and l oadi ng these val ues i nto the warehouse,
the possi bi l i ty of mi scomputati on by i ndi vi dual users i s reduced. A typi cal exampl e
of a pre-computed val ue i s the average dai l y bal ance of bank accounts. Thi s fi gure
i s computed usi ng the base data and i s l oaded as i t i s i nto the warehouse.
• Aggregates. Aggregates can al so be pre-computed for l oadi ng i nto the warehouse.
Thi s i s an al ternati ve to l oadi ng onl y atomi c (base-l evel ) data i n the warehouse and
creati ng i n the warehouse the aggregate records based on the atomi c warehouse
data.
The extracti on and transformati on subsystems (see Fi gure 9.1) create l oad i mages, i .e.,
tabl es and fi el ds popul ated wi th the data that are to be l oaded i nto the warehouse. The l oad
i mages are typi cal l y stored i n tabl es that have the same schema as the warehouse i tsel f. By
doi ng so, the extracti on and transformati on subsystems greatl y si mpl i fy the l oad process.
Extraction and
Transformation
Subsystem
Source
System Tables
(or Copies)
Dim 1
Dim 4
Dim 2
Dim 3
FACTS
Load Image
Figure 9.1. Extracti on and Transformati on Subsystems
9.5 BUILD OR CONFIGURE DATA QUALITY SUBSYSTEM
Data qual i ty probl ems are not al ways apparent at the start of the i mpl ementati on
project, when the team i s concerned more about movi ng massi ve amounts of data rather
than the actual i ndi vi dual data val ues that are bei ng moved. However, data qual i ty (or to
be more preci se, the l ack of i t) wi l l qui ckl y become a major, show-stoppi ng probl em i f i t i s
not addressed di rectl y.
One of the qui ckest ways to i nhi bi t user acceptance i s to have poor data qual i ty i n the
warehouse. Furthermore, the percepti on of data qual i ty i s i n some ways just as i mportant
132 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
as the actual qual i ty of the data warehouse. Data warehouse users wi l l make use of the
warehouse, onl y i f they bel i eve that the i nformati on they wi l l retri eve from i t i s correct.
Wi thout user confi dence i n the data qual i ty, a warehouse i ni ti ati ve wi l l soon l ose support
and eventual l y di e off.
A data qual i ty subsystem on the back-end of the warehouse therefore i s cri ti cal component
of the overal l warehouse archi tecture.
Causes of Data Errors
An understandi ng of the causes of data errors makes these errors easi er to fi nd. Si nce
most data errors ori gi nate from the source systems, source system database admi ni strators
and system admi ni strators, wi th thei r day-to-day experi ences worki ng wi th the sources
systems are very cri ti cal to the data qual i ty effort.
Data errors typi cal l y resul t from one or more of the fol l owi ng causes:
• Missing values. Val ues are mi ssi ng i n the sources systems due to ei ther i ncompl ete
records or opti onal data fi el ds.
• Lack of referential integrity. Referenti al i ntegri ty i n source systems may not be
enforced because of i nconsi stent system codes or codes whose meani ngs have changed
over ti me.
• Different units of measure. The use of di fferent currenci es and uni ts of measure
i n di fferent source systems may l ead to data errors i n the warehouse i f fi gures or
amounts are not fi rst converted to a uni form currency or uni t of measure pri or to
further computati ons or data transformati on.
• Duplicates. De-dupl i cati on i s per for med on sour ce system data pr i or to the
warehouse l oad. However, the de-dupl i cati on process depends on compari son of
data val ues to fi nd matches. I f the data are not avai l abl e to start wi th, the qual i ty
of the de-dupl i cati on may be compromi sed. Dupl i cate records may therefore be
l oaded i nto the warehouse.
• Fields to be split up. As menti oned earl i er, there are ti mes when a si ngl e fi el d
i n the sources system has to be spl i t up to popul ate mul ti pl e warehouse fi el ds.
Unfortunatel y, i t i s not possi bl e to manual l y spl i t up the fi el ds one at a ti me
because of the vol ume of the data. The team often resorts to some automated form
of fi el d spl i tti ng, whi ch may not be 100 percent correct.
• Multiple hierarchies. Many warehouse di mensi ons wi l l have mul ti pl e hi erarchi es
for anal ysi s purposes. For exampl e, the ti me di mensi on typi cal l y has day-month-
quar ter -year hi er ar chy. Thi s same ti me di mensi on may al so have a day-week
hi erarchy and a day-fi scal , month-fi scal , quarter-fi scal year hi erarchy. Lack of
understandi ng of these mul ti pl e hi erarchi es i n the di fferent di mensi ons may resul t
i n erroneous warehouse l oads.
• Conflicting or inconsistent terms and rules. The confl i cti ng or i nconsi stent
use of busi ness terms and busi ness rul es may mi sl ead warehouse pl anners i nto
l oadi ng two di sti nctl y di fferent data i tems i nto the same warehouse fi el d, or vi ce
versa. I nconsi stent busi ness rul es may al so cause the mi suse of formul as duri ng
data transformati on.
DATA WAREHOUSE I MPLEMENTATI ON 133
Data Quality Improvement Approach
Bel ow i s an approach for i mprovi ng the overal l data qual i ty of the enterpri se.
• Assess current level of data quality. Determi ne the current data qual i ty l evel
of each of the warehouse sources systems. Whi l e the enterpri se may have a data
qual i ty i ni ti ati ve that i s i ndependent of the warehousi ng project, i t i s best to focus
the data qual i ty efforts on warehouse sources systems—these systems obvi ousl y
contai ns data that are of i nterest to enterpri se deci si on—makers.
• Identify key data items. Set the pri ori ti es of the data qual i ty team by i denti fyi ng
the key data i tems i n each of the warehouse source systems. Key data i tems, by
defi ni ti on, are the data i tems that must achi eve and mai ntai n a hi gh l evel of data
qual i ty. By pri ori ti zi ng data i tems i n thi s manner, the team can target i ts efforts
on the more cri ti cal data areas and therefore provi des greater val ue to the enterpri se.
• Define cleansing tactics for key data items. For each key data i tem wi th poor
data qual i ty, defi ne an approach or tacti c for cl eani ng or rai si ng the qual i ty of that
data i tem. Whenever possi bl e, the cl eansi ng approach shoul d target the source
systems fi rst, so that errors are corrected at the source and not propagated to other
systems.
• Define error-prevention tactics for key data items. The enterpri se shoul d not
stop at error-correcti on acti vi ti es. The best way to el i mi nate data errors i s to
prevent them from happeni ng i n the fi rst pl ace. I f error-produci ng operati onal
processes are not corrected, they wi l l conti nue to popul ate enterpri se databases
wi th erroneous data. Operati onal and data-entry staff must be made aware of the
cost of poor data qual i ty. Reward mechani sms wi thi n the organi zati on may have to
be modi fi ed to create a worki ng envi ronment that focuses on preventi ng data errors
at the source.
• Implement quality improvement and error-prevention processes. Obtai n
the resources and tool s to execute the qual i ty i mprovement and error-preventi on
processi on. After some ti me, another assessment may be conducted, and a new set
of key data i tems may be targeted for qual i ty i mprovement.
Data Quality Assessment and Improvements
Data qual i ty assessments can be conducted at any ti me at di fferent poi nts al ong the
warehouse back-end. As shown i n Fi gure 9.2, assessments can be conducted on the data
whi l e i t i s i n the source systems, i n warehouse l oads i mages or i n the data warehouse i tsel f.
Note that whi l e data qual i ty products assi st i n the assessment and i mprovement of
data qual i ty, i t i s unreal i sti c to expect any si ngl e program or data qual i ty product to fi nd
and correct al l data qual i ty errors i n the operati onal systems or i n the data warehouse. Nor
i s i t real i sti c to expect data qual i ty i mprovements to be compl eted i n a matter of months. I t i s
unl i kel y that an enterpri se wi l l ever bri ng i ts databases to a state that i s 100 percent error free.
Despi te the l ong-term nature of the effort, however, the absol ute worst thi ng that any
warehouse Project Manager can do i s to i gnore the data qual i ty probl em i n the vai n hope
that i t wi l l di sappear. The enterpri se must be wi l l i ng and prepared to devote ti me and effort
to the tedi ous task of cl eani ng up data errors rather than sweepi ng the probl em under
the rug.
134 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Dim 1
Dim 4
Dim 2
Dim 3
FACTS
Load Image
Warehouse
Load
Dim 1
Dim 4
Dim 2
Dim 3
FACTS
Data Warehouse
Extraction,
Transformation
and Cleansing
Source
System Tables
(or Copies)
Data Quality Assessment
- invalid values
- incomplete data
- lack of referential integrity
- invalid patterns of values
- duplicate records
Figure 9.2. Data Qual i ty Assessments at the Warehouse Back-End
Correcting Data Errors at the Source
Al l data errors found are, under i deal ci rcumstances, corrected at the source, i .e., the
operati onal system database i s updated wi th the correct val ues. Thi s practi ce ensures that
subsequent data users at both the operati onal and deci si onal l evel s wi l l benefi t from cl ean data.
Experi ence has shown, however, that correcti ng data at the source may prove di ffi cul t
to i mpl ement for the fol l owi ng reasons:
• Operational responsibility. The responsi bi l i ty for updati ng the source system
data wi l l natural l y fal l i nto the hands of operati onal staff, who may not be so
i ncl i ned to accept the addi ti onal responsi bi l i ty of tracki ng down and correcti ng past
data-entry errors.
• Correct data are unknown. Even i f the peopl e i n operati ons know that the data
i n a gi ven record are wrong, there may be no easy way to determi ne the correct
data. Thi s i s parti cul arl y true of customer data (e.g., a customer’s soci al securi ty
number). The peopl e i n operati ons have no other resource but to approach the
customers one at a ti me to obtai n the correct data. Thi s i s tedi ous, ti me-consumi ng,
and potenti al l y i rri tati ng to customers.
Other Considerations
Many of the avai l abl e warehousi ng tool s have features that automate di fferent areas
of the warehouse extracti on, transformati on, and data qual i ty subsystems.
The more data sources there are, the hi gher the l i kel i hood of data qual i ty probl ems.
Li kewi se, the l arger the data vol ume, the hi gher the number of data errors to correct.
DATA WAREHOUSE I MPLEMENTATI ON 135
The i ncl usi on of hi stori cal data i n the warehouse wi l l al so present probl ems due to
changes (over ti me) i n system codes, data structures, and busi ness rul es.
9.6 BUILD WAREHOUSE LOAD SUBSYSTEM
The warehouse l oad subsystem takes the l oad i mages created by the extracti on and
transformati on subsystems and l oads these i mages di rectl y i nto the data warehouse. As
menti oned earl i er, the data to be l oaded are stored i n tabl es that have the same schema
desi gn as the warehouse i tsel f. The l oad process i s therefore fai rl y strai ghtforward from a
data standpoi nt.
Basic Features of a Load Subsystem
The l oad subsystem shoul d be abl e to perform the fol l owi ng:
• Drop indexes on the warehouse. When new records are i nserted i nto an i ndexed
tabl e, the rel ati onal database management system i mmedi atel y updates the i ndex
of the tabl e i n response. I n the context of a data warehouse l oad, where up to
hundreds of thousands of records are i nserted i n rapi d successi on i nto one si ngl e
tabl e, the i mmedi ate re-i ndexi ng of the tabl e after each i nsert resul ts i n a si gni fi cant
processi ng overhead. As a consequence, the l oad process sl ows down dramati cal l y.
To avoi d thi s probl em, drop the i ndexes on the rel evant warehouse tabl e pri or to
each l oad.
• Load dimension records. I n the source systems each record of a customer, product,
or transacti on i s uni quel y i denti fi ed through a key. Li kewi se, the customers, products,
and transacti ons i n the warehouse must be i denti fi abl e through a key val ue. Source
system keys are often i nappropri ate as warehouse keys, and a key generati on
approach i s therefore used duri ng the l oad process. I nsert new di mensi on records,
or update exi sti ng records based on the l oad i mages.
• Load fact records. The pri mary key of a Fact tabl e i s the concatenati on of the
keys of i ts rel ated di mensi on records. Each fact record therefore makes use to the
generated keys of the di mensi on records. Di mensi on records are l oaded pri or to the
fact records to al l ow the enforcement of referenti al i ntegri ty checks. The l oad
subsystem therefore i nserts new fact records or updates ol d records based on the
l oad i mages. Si nce the data warehouse i s essenti al l y a ti me seri es, most of the
records i n the Fact tabl e wi l l be new records.
• Compute aggregate records, using base fact and dimension records. After
the successful l oad of atomi c or base l evel data i nto the war ehouse, the l oad
subsystem may now compute aggregate records by usi ng the base-l evel fact and
di mensi on records. Thi s step i s performed onl y i f the aggregates are not pre-computed
for di rect l oadi ng i nto the warehouse.
• Rebuild or regenerate indexes. Once al l l oads have been compl eted, the i ndexes
on the rel evant tabl es are rebui l t or regenerated to i mprove query performance.
• Log load perceptions. Log al l referenti al i ntegri ty vi ol ati ons duri ng the l oad
process as l oad excepti ons. There are two types of referenti al i ntegri ty vi ol ati ons:
(a) mi ssi ng key val ues one of the key fi el ds of the fact record does not have a val ue:
and (b) wrong key val ues the key fi el ds have val ues, but one or more of them do
136 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
not have a correspondi ng di mensi on record. I n both cases, the warehousi ng team
has opti on of (a) not l oadi ng the record unti l the correct key val ues are found or
(b) l oadi ng the record, but repl aci ng the mi ssi ng or wrong key val ues wi th hard-
coded val ues that users can recogni ze as a l oad excepti on.
The l oad subsystem, as descri bed above, assumes that the l oad i mages do not yet
make use of warehouse keys; i .e., the l oad i mages contai n onl y source system keys.
The warehouse keys are therefore generated as part of the l oad process.
Warehousi ng teams may opt to separate the key generati on routi nes from the l oad
process. I n thi s scenari o, the key generati on routi ne i s appl i ed on the i ni ti al l oad
i mages (i .e., the l oad i mages created by the extracti on and transformati on subsystems).
The fi nal l oad i mages (wi th warehouse keys) are then l oaded i nto the warehouse.
Loading Dirty Data
Ther e ar e ongoi ng debates about l oadi ng di r ty data (i .e., data that fai l r efer enti al
i ntegr i ty checks) i nto the war ehouse. Some teams pr efer to l oad onl y cl ean data i nto
the war ehouse, ar gui ng that di r ty data can mi sl ead and mi si nfor m. Other s pr efer to
l oad al l data, both cl ean and di r ty, pr ovi ded that the di r ty data ar e cl ear l y mar ked as
di r ty. Dependi ng on the extent of data er r or s, the use of onl y cl ean data i n the war ehouse
can be equal to or mor e danger ous than r el yi ng on a mi x of cl ean and di r ty data. I f
mor e than 20 per cent of data ar e di r ty and onl y 80 per cent cl ean data ar e l oaded i nto
the war ehouse, the war ehouse user s wi l l be maki ng deci si ons based on an i ncompl ete
pi ctur e.
The use of hard-coded val ues to i denti fy warehouse data wi th referenti al i ntegri ty
vi ol ati ons on one di mensi on al l ows warehouse users to sti l l make use of the warehouse data
on cl ean di mensi ons.
Consi der the exampl e i n Fi gure 9.3. I f a Sal es Fact record i s dependent on Customer,
Data (Ti me di mensi on) and Product and i f the Customer key i s mi ssi ng, then a “Sal es per
Product” report from the warehouse wi l l sti l l produce the correct i nformati on.
TIME
Key: Date = March 6, 1998
CUSTOMER
Key: Customer ID = “INULL”
SALES FACT
Key: Customer ID = “INULL”
Key : Date = March 6, 1998
Key: Product ID = Px00001
Fact: Sales Amount = $1,000
Fact : Quantity Sold = 1
Referential Integrity Violation
handed through hard-coded
values (i.e., “INULL”)
PRODUCT
Key: Proudct ID = Px00001
Figure 9.3. Loadi ng Di rty Data
When a “sal es per customer ” r epor t i s pr oduced (as shown i n Fi gur e 9.4), the har d-
coded val ue that si gni fi es a r efer enti al i ntegr i ty vi ol ati on wi l l be l i sted as a customer I D,
and the user i s awar e that the cor r espondi ng sal es amount cannot be attr i buted to val i d
customer .
DATA WAREHOUSE I MPLEMENTATI ON 137
SALES PER CUSTOMER
Date: March 6, 1998
Customer
INULL
Customer A
Customer B
...
Sales Amount
1,000
10,000
8,000
...
Hard-coded values
clearly identify
dirty data.
Figure 9.4. Sampl e Report wi th Di rty Date I denti fi ed Through Hard-Coded Val ue.
By handl i ng a referenti al i ntegri ty vi ol ati on duri ng warehouse l oads i n the manner
descri bed above. The users get ful l pi cture of the facts on cl ean di mensi ons and are cl earl y
aware when di rty di mensi ons are used.
The Need for Load Optimization
The ti me requi red for a regul ar warehouse l oad i s often of great concern to warehouse
desi gners and project managers. Unl ess the warehouse was desi gned and archi tects to be
ful l y avai l abl e 24 hours a day, the warehouse wi l l be offl i ne and unavai l abl e to i ts users
duri ng the l oad peri od.
Much of the chal l enge i n bui l di ng the l oad subsystem therefore l i es i n opti mi zi ng the
l oad process to reduce the total ti me requi red. For thi s reason, paral l el l oad features i n l ater
rel eases of rel ati onal database management systems and paral l el processi ng capabi l i ti es i n
SMP and MPP machi nes are especi al l y wel come i n data warehousi ng i mpl ementati ons.
Test Loads
The team may l i ke to test the accuracy and performance of the warehouse l oad subsystem
on dummy data before attempti ng a real l oad wi th actual l oad i mages. The team shoul d
know as earl y as possi bl e how much l oad opti mi zati on work i s sti l l requi red.
Al so, by usi ng dummy data, the warehousi ng team does not have to wai t for the
compl eti on of the extracti on and transformati on subsystems to start testi ng the warehouse
l oad subsystem.
Warehouse l oad subsystem testi ng of course, i s possi bl e onl y i f the data warehouse
schema i s al ready up and avai l abl e.
Set Up Data Warehouse Schema
Create the data warehouse schema i n the devel opment envi ronment whi l e the team i s
constructi ng or confi guri ng the warehouse back-end subsystems (i .e., the data extracti on
and tr ansfor mati on subsystems, the data qual i ty subsystem, and the war ehouse l oad
subsystem).
As part of the schema setup, the warehouse DBA must do the fol l owi ng:
• Create warehouse tables. I mpl ement the physi cal warehouse database desi gn by
creati ng al l base-l evel fact and di mensi on tabl es, core and custom tabl es, and
aggregate tabl es.
138 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Build Indexes. Bui l d the requi red i ndexes on the tabl es accordi ng to the physi cal
warehouse database desi gn.
• Populate special referential tables and records. The data warehouse may requi re
speci al referenti al tabl es or records that are not created through regul ar warehouse
l oads. For exampl e, i f the warehouse team uses hard-coded val ues to handl e l oad
wi th referenti al i ntegri ty vi ol ati ons, the warehouse di mensi on tabl es must have records
that use the appropri ate hard-coded val ue to i denti fy fact records that have referenti al
i ntegri ty vi ol ati ons. I t i s usual l y hel pful to popul ate the data warehouse wi th test
data as soon as possi bl e. Thi s provi des the front-end team wi th the opportuni ty to
test the data access and retri eval tool s; even whi l e actual warehouse data are not
avai l abl e. Fi gure 9.5 presents a typi cal data warehouse scheme.
SALES
Fact Table
Client
Time Product
Organization
Figure 9.5. Sampl e Warehouse Schema
9.7 SET UP WAREHOUSE METADATA
Metadata have tradi ti onal l y been defi ned as “data about data.” Whi l e such a statement
does not seem very hel pful , i t i s actual l y qui te appropri ate as a defi ni ti on-metadata descri be
the contents of the data warehouse, i ndi cate where the warehouse data ori gi nal l y came
from, and document the busi ness rul es that govern the transformati on of the data.
Warehousi ng tool s al so use metadata as the basi s for automati ng certai n aspects of the
warehousi ng project. Chapter 13 i n the Technol ogy secti on of thi s book di scusses metadata i n
depth.
9.8 SET UP DATA ACCESS AND RETRIEVAL TOOLS
The data access and retri eval tool s are equi val ent to the ti p of the warehousi ng i ceberg.
Whi l e they may represent as l i ttl e as 10 percent of the enti re warehousi ng effort, they al l
are that users of the warehouse. As a resul t, these tool s are cri ti cal to the acceptance and
usabi l i ty of the warehouse.
Acquire and Install Data Access and Retrieval Tools
Acqui re and i nstal l the sel ected data access tool s i n the appropri ate envi ronments and
machi nes. The front-end team wi l l fi nd i t prudent to fi rst i nstal l the sel ected data access
tool s on a test machi ne that has access to the warehouse. The test machi ne shoul d be l oaded
wi th the software typi cal l y used by the enterpri se. Through thi s acti vi ty, the front-end team
may i denti fy unforeseen confl i cts between the vari ous software programs wi thout causi ng
i nconveni ence to anyone.
Veri fy that the data access and retri eval tool s can establ i sh and hol d connecti ons to the
data warehouse over the corporate network. I n the absence of actual warehouse data, the team
may opt to use test data i n the data warehouse schema to test the i nstal l ati on of front-end tool s.
DATA WAREHOUSE I MPLEMENTATI ON 139
Build Predefined Reports and Queries
The prototype i ni ti al l y devel oped duri ng warehouse pl anni ng i s refi ned by i ncorporati ng
user feedback and by bui l di ng al l predefi ned reports and queri es that have been agreed on
wi th the end-users.
Di fferent front-end tool s have di fferent requi rements for the effi ci ent di stri buti on of
predefi ned reports and queri es to al l users. The front-end team shoul d therefore perform the
appropri ate admi ni strati on tasks as requi red by the sel ected front-end tool s.
By bui l di ng these predefi ned reports and queri es, the data warehouse i mpl ementati on
team i s assured that the warehouse schema meets the deci si onal i nformati on requi rements
of the users.
Support staff, who wi l l eventual l y manage the warehouse Hel p Desk shoul d parti ci pate
i n thi s acti vi ty, si nce thi s parti ci pati on provi des excel l ent l earni ng opportuni ti es.
Set Up Role or Group Profiles
Defi ne rol e or group profi l es on the database management system. The major database
management systems provi de the use of a rol e or a group to defi ne the access ri ghts of
mul ti pl e users through one rol e defi ni ti on.
The warehousi ng team must determi ne the appropri ate rol e defi ni ti ons for the warehouse.
The fol l owi ng rol es are recommended as a mi ni mum:
• Warehouse user. The typi cal warehouse user i s granted Sel ect ri ghts on the
producti on warehouse tabl es. Updates are granted on the warehouse di mensi on
records.
• Warehouse administrator. Thi s rol e i s assi gned to users stri ctl y for the di rect
update of data warehouse di mensi on records. Select and update ri ghts are granted
on the warehouse di mensi on record.
• Warehouse developer. Thi s rol e appl i es to any warehouse i mpl ementati on team
member who works on the back-end of the warehouse. Users wi th thi s rol e can
create devel opment warehouse objects but cannot modi fy or update the structure
and content of producti on warehouse tabl es.
Set Up User Profiles and Map These to Role Profiles
Defi ne user profi l es for each warehouse user and assi gn one or more rol es to each user
profi l e to grant the user access to the warehouse. Whi l e i t i s possi bl e for mul ti pl e users to
use the same user profi l e, thi s practi ce i s greatl y di scouraged for the fol l owi ng reasons:
• Collection of warehouse statistics. War ehouse stati sti cs ar e col l ected as par t
of the war ehouse mai ntenance acti vi ti es. The team wi l l benefi t fr om knowi ng
(a) how many user s have access to the war ehouse. (b) whi ch user s ar e actual l y
maki ng use of the war ehouse, and (c) how often a par ti cul ar user makes use of
the war ehouse.
• Warehouse security. The warehousi ng team must be abl e to track the use of the
warehouse to a speci fi c i ndi vi dual , not just to a group of i ndi vi dual s. Users may
al so be l ess careful wi th I ds and passwords i f they know these are shared. Uni que
user I ds are al so requi red shoul d the warehouse start restri cti ng or granti ng access
based on record val ues i n warehouse tabl es (e.g., a branch manager can see onl y
the records rel ated to hi s or her branch).
140 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Audit Trail. I f one or more users have Update access to the warehouse, di sti nct,
user s I Ds wi l l al l ow the war ehouse team to tr ack down the war ehouse user
responsi bl e for each update.
• Query performance complaints. I n case where a query on the warehouse server
i s sl ow or has stal l ed, the warehouse admi ni strator wi l l be better abl e to i denti fy
the sl ow or stal l ed query when each user has a di sti nct user I D.
9.9 PERFORM THE PRODUCTION WAREHOUSE LOAD
The producti on data warehouse l oad can be performed onl y when the l oad i mages are
ready and both the warehouse schema and metadata are setup.
Pri or to the actual producti on warehouse l oad, i t i s a good practi ce to conduct parti al
l oads to get some i ndi cati on of the total l oad ti me. Al so, si nce the data warehouse schema
desi gn may requi re refi nement, parti cul arl y when the front-end tool s are fi rst setup, i t wi l l
be easi er and qui cker to make changes to the data warehouse schema when very l i ttl e data
have been l oaded. Onl y when the end users have had a chance to provi de posi ti ve feedback
shoul d l arge vol umes of data be l oaded i nto the warehouse.
Data war ehouses ar e not r efr eshed mor e than once ever y 24 hour s. I f the user
requi rements cal l for up-to-the-mi nute i nformati on for operati onal moni tori ng purposes,
then a data warehouse i s not the sol uti on; these requi rements shoul d be addressed through
an Operati onal Data Store.
The warehouse i s typi cal l y avai l abl e to end-users duri ng the worki ng day. For thi s
reason, warehouse l oads typi cal l y take pl ace at ni ght or over a weekend.
I f the retenti on peri od of the warehouse i s several years; the warehouse team shoul d
fi rst l oad the data for the current ti me peri od and veri fy the correctness of the l oad. Onl y
when the l oad i s successful shoul d the team start l oadi ng hi stori cal data i nto the warehouse.
Due to potenti al changes to the schemas of the sources systems over the past few years, i t
i s natural for the warehouse team to start from the most current peri od and work i n reverse
chronol ogi cal order when l oadi ng hi stori cal data.
9.10 CONDUCT USER TRAINING
The I T organi zati on i s encouraged to ful l y take over the responsi bi l i ty of conducti ng
user trai ni ng, contrary to the popul ar practi ce of havi ng product vendors or warehouse
consul tant to assi st i n the preparati on of the fi rst warehousi ng cl asses. Doi ng so wi l l enabl e
the warehousi ng team to conduct future trai ni ng courses i ndependentl y.
Scope of User Training
Conduct trai ni ng for al l i ntended users of thi s rol l out of the data warehouses. Prepare
trai ni ng materi al s i f requi red. The trai ni ng shoul d cover the fol l owi ng topi cs:
• What is a warehouse? Di fferent peopl e have di fferent expectati ons of what a data
warehouse i s. Start the trai ni ng wi th a warehouse defi ni ti on.
• Warehouse scope. Al l users must know the contents of the warehouses. The
tr ai ni ng shoul d ther efor e cl ear l y state what i s not suppor ted by the cur r ent
warehouse rol l out. Trai ners mi ght need to know what functi onal i ty has been deferred
to l ater phases, and why.
• Use of front-end tools. The users shoul d l earn how to use the front-end tool s.
Hi ghl y usabl e front-ends shoul d requi re fai rl y l i ttl e trai ni ng. Di stri bute al l rel evant
user documentati on to trai ni ng parti ci pants.
DATA WAREHOUSE I MPLEMENTATI ON 141
• Load timing and publication. Users shoul d be i nformed of the schedul e for
warehouse l oads (e.g., “the warehouse i s l oaded wi th sal es data on a weekl y basi s,
and a speci al month-end l oad i s performed for the GL expense data”). Users shoul d
al so know how the warehouse team i ntends to publ i sh the resul ts of each warehouse
l oad.
• Warehouse support structure. Users shoul d know how to get addi ti onal hel p
from the warehousi ng team. Di stri bute Hel p Desk phone numbers, etc.
Who Should Attend the Training?
Trai ni ng shoul d be conducted for al l i ntended end users of the data warehouse. Some
seni or managers, parti cul arl y those who do not use computers every day, may ask thei r
assi stants or secretari es to attend the trai ni ng i n thei r pl ace. I n thi s scenari o, the seni or
manager shoul d be requested to attend at l east the porti on of the trai ni ng that deal s wi th
the warehouse scope.
Different Users Have Different Training Needs
An understandi ng of the users computi ng l i teracy provi des i nsi ght to the type and pace
of trai ni ng requi red. I f the user base i s l arge enough, i t may be hel pful to di vi de the trai nees
i nto two groups - a basi c cl ass and an advanced cl ass. Power users wi l l otherwi se qui ckl y
become bored i n a basi c cl ass, and begi nners wi l l feel overwhel med i f they are i n an advanced
cl ass. Attempti ng to meet the trai ni ng needs of both types of users i n one cl ass may prove
to be di ffi cul t and frustrati ng.
At the end of the trai ni ng, i t i s a good practi ce to i denti fy trai ni ng parti ci pants who
woul d requi re post-trai ni ng fol l ow up and support from the warehouse i mpl ementati on
team. Al so ask parti ci pants to eval uate the warehouse trai ni ng, constructi ve cri ti ci sm wi l l
al l ow the trai ners to del i ver better trai ni ng i n the future.
Training as a Prerequisite to Testing
A subset of the users wi l l be requested to test the warehouse. Thi s user subset may
have to undergo user trai ni ng earl i er than others, si nce user trai ni ng i s a prerequi si te to
user testi ng. Users cannot adequatel y test the warehouse i f they do not know what i s i n i t
or how to use i t.
9.11 CONDUCT USER TESTING AND ACCEPTANCE
The data warehouse, l i ke any system, must undergo user testi ng and acceptance. Some
consi derati ons are di scussed bel ow:
Conduct Warehouse Trails
Representati ves of the end-user communi ty are requested to test thi s warehouse rol l out.
I n general , the fol l owi ng aspects shoul d be tested.
• Support of specified queries and reports. User s test the cor r ectness of
the quer i es and r epor ts of thi s war ehouse r ol l out. I n many cases, thi s i s
ach i ev ed by pr epar i n g th e s ame r epor t man u al l y (or th r ou gh ex i s ti n g
mechani sms) and compar i ng thi s r epor t to the one pr oduced by the war ehouse.
Al l di scr epanci es ar e accounted for , and the appr opr i ate cor r ecti ons ar e made.
The team shoul d not di scount the possi bi l i ty that the er r or s ar e i n the
manual l y pr epar ed r epor t.
142 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Performance/response time. Under the most i deal ci rcumstances, each warehouse
query wi l l be executed i n l ess than one or two seconds. However, thi s may not be
real i sti cal l y achi evabl e, dependi ng on the warehouse si ze (number of rows and
di mensi ons) and the sel ected tool s. Warehouse opti mi zati on at the hardware and
database l evel s can be used to i mprove response ti mes. The use of stored aggregates
wi l l l i kewi se i mprove warehouse performance.
• Usability of client front-end· Trai ni ng on the front-end tool s i s requi red, but the
tool s must be usabl e for the majori ty of the users.
• Ability to meet report frequency requirements. The warehouse must be abl e
to provi de the speci fi ed queri es and reports at the frequency (i .e., dai l y, weekl y,
monthl y, quarterl y, or yearl y) requi red by the users.
Acceptance
The rol l out i s consi dered accepted when the testi ng for thi s rol l out i s compl eted to the
sati sfacti on of the user communi ty. A concrete set of acceptance cri teri a can be devel oped
at the start of the warehouse rol l out for use l ater as the basi s for acceptance. The acceptance
cri teri a are hel pful to users because they know exactl y what to test. I t i s l i kewi se hel pful
to the warehousi ng team because they know what must be del i vered.
In Summary
Data warehouse i mpl ementati on i s wi thout questi on the most chal l engi ng part of data
warehousi ng. Not onl y wi l l the team have to resol ve the techni cal di ffi cul ti es of movi ng,
i ntegrati ng, and cl eani ng data, they wi l l al so face the more di ffi cul t task of addressi ng pol i cy
i ssues, resol vi ng organi zati onal confl i cts, and untangl i ng l ogi sti cal del ays.
I n general , the fol l owi ng areas present more probl ems duri ng warehouse i mpl ementati on
and bear the more watchi ng.
• Dirty data. The i denti fi cati on and cl eanup of di rty data can easi l y consume more
resources than the project can afford.
• Underestimated logistics. The l ogi sti cs i nvol ved i n warehousi ng typi cal l y requi re
more ti me than ori gi nal l y expected. Tasks such as i nstal l i ng the devel opment
envi r onment, col l ecti ng sour ce data, tr anspor ti ng data, and l oadi ng data ar e
gener al l y bel eaguer ed by l ogi sti cal pr obl ems. The ti me r equi r ed to l ear n and
confi gure warehousi ng tool s l i kewi se contri butes to del ays.
• Policies and political issues. The progress of the team can sl ow to a crawl i f a
key project i ssue remai ns unresol ved for too l ong.
• Wrong warehouse design. The wrong warehouse desi gn resul ts i n unmet user
requi rements or i nfl exi bl e i mpl ementati ons. I t al so creates rework for the schema
as wel l as al l the back-end subsystems; extracti on and transformati on, qual i ty
assurance, and l oadi ng.
At the end of the project, however, a successful team has the sati sfacti on of meeti ng the
i nformati on needs of key deci si on-makers i n a manner that i s unprecedented i n the enterpri se.
PART IV : TECHNOLOGY
A qui ck browse through the Process secti on of thi s book makes
i t qui te cl ear that a data warehouse project requi res a wi de
array of technol ogi es and tool s. The data warehousi ng products
mar ket (par ti cul ar l y the softwar e segment) i s a r api dl y
gr owi ng one; new vendor s conti nuousl y announce the
avai l abi l i ty of new pr oducts, whi l e exi sti ng vendor s add
warehousi ng-speci fi c features to thei r exi sti ng product l i nes.
Understandabl y, the gamut of tool s makes tool sel ecti on qui te
confusi ng. Thi s secti on of the book ai ms to l end order to the
war ehousi ng tool s mar ket by cl assi fyi ng these tool s and
technol ogi es. The two mai n categori es, understandabl y, are:
• Hardware and Operating Systems. Th es e r efer
pr i mar i l y to the war ehouse ser ver s and thei r r el ated
operati ng systems. Key i ssues i ncl ude database si ze, storage
opti ons, and backup and recovery technol ogy.
• Software. Thi s refers pri mari l y to the tool s that are used
to extr act, cl ean, i ntegr ate, popul ate, stor e, access,
di stri bute, and present warehouse data. Al so i ncl uded i n
thi s category are metadata reposi tori es that document the
data war ehouse. The major database vendor s have al l
jumped on the data war ehousi ng bandwagon and have
i ntroduced, or are pl anni ng to i ntroduce, features that al l ow
thei r database products to better support data warehouse
i mpl ementati ons.
I n addi ti on, thi s secti on of the book focuses on two key
technol ogy i ssues i n data warehousi ng:
• Warehouse Schema Design. We present an overvi ew of
the di mensi onal model i ng techni ques, as popul ari zed by
Ral ph Ki mbal l , to pr oduce database desi gns that ar e
parti cul arl y sui ted for warehousi ng i mpl ementati ons.
• Warehouse Metadata. We pr ovi de a qui ck l ook at
warehouse metadata—what i t i s, what i t shoul d encompass,
and why i t i s cri ti cal i n data warehousi ng.
This page
intentionally left
blank
145
The terms hardware and operati ng systems refers to the server pl atforms and operati ng
systems that serve as the computi ng envi ronment of the data warehouse. Warehousi ng
envi ronments are typi cal l y separate from the operati onal computi ng envi ronments (i .e., a
di fferent machi ne i s used) to avoi d potenti al resource contenti ons between operati onal and
deci si onal processi ng.
Enter pr i ses ar e cor r ectl y var i ed of computi ng sol uti ons that may compr omi se the
performance l evel s of mi ssi on-cri ti cal operati onal systems.
The major har dwar e vendor s have establ i shed data war ehousi ng i ni ti ati ves or
partnershi p programs wi th other fi rms i n a bi d to provi de comprehensi ve data warehousi ng
sol uti on. Frameworks to thei r customers. Thi s i s very consi stent wi th the sol uti on i ntegrator
rol e that major hardware vendors typi cal l y pl ay on l arge computi ng projects.
10.1 PARALLEL HARDWARE TECHNOLOGY
As we menti oned, the two pr i mar y categor i es of par al l el har dwar e used for data
warehousi ng are the symmetri c mul ti processi ng (SMP) machi nes and massi vel y paral l el
processi ng (MPP) machi nes.
Symmetric Multiprocessing
These systems consi st of from a pai r to as many as 64 processors that share memory
and di sk I /O resources equal l y under the control of one copy of the operati ng system, as
shown i n Fi gure 10.1. Because system resources are equal l y shared between processors,
they can be managed more effecti vel y. SMP systems make use of extremel y hi gh-speed
i nter connecti ons to al l ow each pr ocessor to shar e memor y on an equal basi s. These
i nterconnecti ons are two orders of magni tude faster than those found i n MPP systems, and
range from the 2.6 GB/sec i nterconnect on Enterpri se 4000, 5000, and 6000 systems to the
12.8 GB/sec aggregate throughput of the Enterpri se 10000 server wi th the Gi gapl ane-XB™
archi tecture.
I n addi ti on to hi gh bandwi dth, l ow communi cati on l atency i s al so i mportant i f the
system i s to show good scal abi l i ty. Thi s i s because common data war ehouse database
HARDWARE AND OFERAT¡NG $Y$TEM$
10
CHAFTER
146 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
oper ati ons such as i ndex l ookups and joi ns i nvol ve communi cati on of smal l data packets.
Thi s system pr ovi des ver y l ow l atency i nter connects. At onl y 400 nsec. For l ocal access,
the l atency of the Star fi r e ser ver ’s Gi gapl ane-XB i nter connects i s 200 ti mes l ess than the
80,000 nsec. Latency of the I BM SP2 i nter connects. Of al l the systems di scussed i n thi s
paper , the l owest l atency i s on the Enter pr i se 6000, wi th a uni for m l atency of onl y 300
nsec.
When the amount of data contai ned i n each message i s smal l , the i mportance of l ow
l atenci es are paramount.
CPU
1
CPU
2
CPU
3
Disk
Memory
Figure 10.1. SMP Archi tectures Consi st of Mul ti pl e Processors Shari ng Memory and I /O
Resources Through a Hi gh-Speed I nterconnect, and Use a Si ngl e Copy of the Operati ng System
Massively Parallel Processor Systems
Massi vel y paral l el processor systems use a l arge number of nodes each accessed usi ng
an i nterconnect mechani sm that supports message-based data transfers typi cal l y on the
order of 13 to 38 MB/sec. Fi gure 10.2, i l l ustrates that each node i s a sel f-suffi ci ent processor
compl ex consi sti ng of CPU, memory, and di sk subsystems. I n the vernacul ar, an MPP i s
consi dered a “shared nothi ng” system because memory and I /O resources are i ndependent,
wi th each node even supporti ng i ts own copy of the operati ng system. MPP systems promi se
unl i mi ted scal abi l i ty, wi th a growth path that al l ows as many processors as necessary to be
added to a system.
MPP archi tectures provi de excel l ent performance i n si tuati ons where a probl em can be
parti ti oned so that al l nodes can run i n paral l el wi th l i ttl e or no i nter-node communi cati on.
I n real i ty, the true ad hoc queri es typi cal of data warehouses can onl y rarel y be so wel l
parti ti oned, and thus l i mi t the performance that MPP archi tectures actual l y del i ver. When
ei ther data skew or hi gh i nter-node communi cati on requi rements prevai l , the scal abi l i ty of
MPP archi tectures i s severel y l i mi ted. Rel i abi l i ty i s a concern wi th MPP systems because
the l oss of a node does not merel y reduce the processi ng power avai l abl e to the whol e
system; i t can make any database objects that are whol l y or parti al l y l ocated on the fai l ed
node unavai l abl e.
An i nteresti ng i ndustry trend i s for MPP vendors to augment si ngl e-processor “thi n”
nodes wi th mul ti processor “fat” nodes usi ng many processors i n an SMP confi gurati on wi thi n
each node. I f the trend conti nues, each MPP node wi l l have i ncreasi ng numbers of processors,
fewer nodes, and the archi tecture begi ns to resembl e cl usters of SMP systems, di scussed
bel ow.
HARDWARE AND OPERATI NG SYSTEMS 147
Memory
CPU
Disk
Memory
CPU
Disk
Figure 10.2. MPP Archi tectures Consi st of a Set of I ndependent Shared-Nothi ng
Nodes, Each Contai ni ng Processor, Memory, and Local Di sks, Connected wi th a Message-
Based I nterconnect.
SMP vs MPP Hardware Configuration
Memory
CPU
Disk
Memory
CPU
Disk
CPU
1
CPU
2
CPU
3
Disk
Memory
• One Node
• Many Processors per Node
• Scale Up by Adding CPUs
or by Clustering
• Many Nodes
• One/More Processors per Node
• Each Node has its Own Memory
• Scale Up by Adding a Node
SMP
MPP
SMP i s the archi tecture yi el di ng the best rel i abi l i ty and performance, and has worked
through the di ffi cul t engi neeri ng probl ems to devel op servers that can scal e nearl y l i nearl y
wi th the i ncremental addi ti on of processors. I roni cal l y, the speci al shared-nothi ng versi ons
of merchant database products that run on MPP systems al so run wi th excel l ent performance
on SMP archi tectures, al though the reverse i s not true.
Clustered Systems
When data warehouses must be scal ed beyond the number of processors avai l abl e i n
current SMP confi gurati ons, or when the hi gh avai l abi l i ty (HA) characteri sti cs of a mul ti pl e-
system compl ex are desi rabl e, cl usters provi de an excel l ent growth path. Hi gh-avai l abi l i ty
software can enabl e the other nodes i n a cl uster to take over the functi ons of a fai l ed node,
ensuri ng around-the-cl ock avai l abi l i ty of enterpri se-cri ti cal data warehouses.
A cl uster of four SMP ser ver s i s i l l ustr ated i n Fi gur e 10.3. The same database
management software that expl oi ts mul ti pl e processors i n an SMP archi tecture and di sti nct
processi ng nodes i n MPP archi tecture can create query executi on pl ans that uti l i ze al l
servers i n the cl uster.
148 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG

SMP Nodes

High Speed Inter Connect
Figure 10.3. Cl uster of Four SMP Systems Connected vi a a Hi gh Speed I nterconnect Mechani sm.
Cl usters Enjoy the Advantages of SMP Scal abi l i ty wi th the Hi gh Avai l abi l i ty.
Characteristics of Redundant Systems
Wi th thei r i nherentl y superi or scal abi l i ty, SMP systems provi de the best bui l di ng bl ocks
for cl ustered systems. These systems are confi gured wi th mul ti -ported di sk arrays so that
the nodes whi ch have di rect di sk access enjoy the same di sk I /O rates as stand-al one SMP
systems. Nodes not havi ng di rect access to di sk data must use the hi gh-speed cl uster
i nterconnect mechani sm.
I n data warehouses depl oyed wi th cl usters, the database management system or l oad-
bal anci ng software i s responsi bl e for di stri buti ng the vari ous threads of the DSS queri es
across the mul ti pl e nodes for maxi mum effi ci ency. As wi th MPP systems, the more effecti vel y
that the query can be parti ti oned across the nodes i n the cl uster—and the l ess i nter-node
communi cati on that i s necessary—the more scal abl e the sol uti on wi l l be. Thi s l eads to the
concl usi on that cl usters shoul d be confi gured wi th as few nodes as possi bl e, wi th each SMP
node scal ed up as much as possi bl e before addi ti onal nodes are added to the cl uster.
Currentl y, cl usteri ng of 30-processor Enterpri se 6000 pl atforms, wi th the abi l i ty to
cl uster 64-processor Enterpri se 10000 systems avai l abl e i n the future.
Beware that, the l arger the number of nodes i n a cl uster, the more the cl uster l ooks l i ke
an MPP system, and the database admi ni strator may need to deal wi th the i ssues of l arge
numbers of nodes much sooner than they woul d wi th cl usters of more powerful SMP systems.
These are the fami l i ar i ssues of non-uni formi ty of data access, parti ti oni ng of database
tabl es across nodes, and the performance l i mi ts i mposed by the hi gh bandwi dth i nterconnects.
A new di recti on that i s taki ng wi th cl ustered systems i s to devel op software that al l ows a
si ngl e system i mage to be executed across a cl ustered archi tecture, i ncreasi ng the ease of
management far beyond that of today’s cl usters.
10.2 THE DATA PARTITIONING ISSUE
The range of archi tectural choi ces from MPP to SMP offers a compl ex deci si on space for
organi zati ons depl oyi ng data warehouses. Gi ven that compani es tend to make archi tectural
choi ces earl y, and then i nvest up to hundreds of mi l l i ons of dol l ars as they grow, the resul t
of choosi ng archi tecture that presents i ncreasi ngl y i ntractabl e parti ti oni ng probl ems and
the l i kel i hood of i dl e nodes can have consequences that measure i n the mi l l i ons of dol l ars.
HARDWARE AND OPERATI NG SYSTEMS 149
A system that scal es wel l expl oi ts al l of i ts processors and makes best use of the i nvestment
i n computi ng i nfrastructure.
Regardl ess of envi ronment, the si ngl e l argest factor i nfl uenci ng the scal abi l i ty of a
deci si on support system i s how the data i s parti ti oned across the di sk subsystem. Systems
that do not scal e wel l may have enti re batch queri es wai ti ng for a si ngl e node to compl ete
an operati on. On the other hand, for throughput envi ronments wi th mul ti pl e concurrent
jobs runni ng, these underuti l i zed nodes may be expl oi ted to accompl i sh other work.
There are typi cal l y three approaches to parti ti oni ng database records:
Range Partitioning
Range parti ti oni ng pl aces speci fi c ranges of tabl e entri es on di fferent di sks. For exampl e,
records havi ng “name” as a key may have names begi nni ng wi th A-B i n one parti ti on, C-
D i n the next, and so on. Li kewi se, a DSS managi ng monthl y operati ons mi ght parti ti on
each month onto a di fferent set of di sks. I n cases where onl y a porti on of the data i s used
i n a query—the C-D range, for exampl e—the database can avoi d exami ni ng the other sets
of data i n what i s known as partition elimination. Thi s can dramati cal l y reduce the ti me to
compl ete a query.
The di ffi cul ty wi th range parti ti oni ng i s that the quanti ty of data may vary si gni fi cantl y
from one parti ti on to another, and the frequency of data access may vary as wel l . For
exampl e, as the data accumul ates, i t may turn out that a l arger number of customer names
fal l i nto the M-N range than the A-B range. Li kewi se, mai l -order catal ogs fi nd thei r December
sal es to far outwei gh the sal es i n any other month.
Round-Robin Partitioning
Round-robi n parti ti oni ng evenl y di stri butes records across al l di sks that compose a
l ogi cal space for the tabl e, wi thout regard to the data val ues bei ng stored. Thi s permi ts even
workl oad di stri buti on for subsequent tabl e scans. Di sk stri pi ng accompl i shes the same resul t
—spreadi ng read operati ons across mul ti pl e spi ndl es—but wi th the l ogi cal vol ume manager,
not the DBMS, managi ng the stri pi ng. One di ffi cul ty wi th round-robi n parti ti oni ng i s that,
i f appropri ate for the query, performance cannot be enhanced wi th parti ti on el i mi nati on.
Hash Partitioning
Hash parti ti oni ng i s a thi rd method of di stri buti ng DBMS data evenl y across the set
of di sk spi ndl es. A hash functi on i s appl i ed to one or more database keys, and the records
are di stri buted across the di sk subsystem accordi ngl y.
Agai n, a drawback of hash parti ti oni ng i s that parti ti on el i mi nati on may not be possi bl e
for those queri es whose performance coul d be i mproved wi th thi s techni que. For symmetri c
mul ti processors, the mai n reason for data parti ti oni ng i s to avoi d “hot spots” on the di sks,
where records on one spi ndl e may be frequentl y accessed, causi ng the di sk to become a
bottl eneck. These probl ems can usual l y be avoi ded by combi nati ons of database parti ti oni ng
and the use of di sk arrays wi th stri pi ng. Because al l processors have equal access to memory
and di sks, the l ayout of data does not si gni fi cantl y affect processor uti l i zati on. For massi vel y
paral l el processors, i mproper data parti ti oni ng can degrade performance by an order of
magni tude or more. Because al l processors do not share memory and di sk resources equal l y,
the choi ce of on whi ch node to pl ace whi ch data has a si gni fi cant i mpact on query performance.
150 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
The choice of partition key is a critical, fixed decision, that has extreme consequences
over the life of an MPP-based data warehouse. Each database object can be parti ti oned once
and onl y once wi thout re-di stri buti ng the data for that object. Thi s deci si on determi nes l ong
i nto the future whether MPP processors are evenl y uti l i zed, or whether many nodes si t i dl e
whi l e onl y a few are abl e to effi ci entl y process database records. Unfortunatel y, because the
very purpose of data warehouses i s to answer ad hoc queri es that have never been foreseen,
a correct choi ce of parti ti on key i s one that i s, by i ts very defi ni ti on, i mpossi bl e to make.
Thi s i s a key reason why database admi ni strators who wi sh to mi ni mi ze ri sks tend to
r ecommend SMP ar chi tectur es wher e the choi ce of par ti ti on str ategy and keys have
si gni fi cantl y l ess i mpact.
Architectural Impacts on Query Performance
Three fundamental operati ons composi ng the steps of query executi on pl ans are tabl e
scans, joi ns, and i ndex l ookup operati ons. Si nce deci si on support system performance depends
on how wel l each of these operati ons are executed, i t’s i mportant to consi der how thei r
performance vari es between SMP and MPP archi tectures.
Table Scans
On MPP systems where the database records happen to be uni forml y parti ti oned across
nodes, good performance on si ngl e-user batch jobs can be achi eved because each node/
memory/di sk combi nati on can be ful l y uti l i zed i n paral l el tabl e scans, and each node has an
equi val ent amount of work to compl ete the scan. When the data i s not evenl y di stri buted,
or l ess than the ful l set of data i s accessed, l oad skew can occur, causi ng some nodes to
fi ni sh thei r scanni ng qui ckl y and remai n i dl e unti l the processor havi ng the l argest number
of records to process i s fi ni shed. Because the database i s stati cal l y parti ti oned, and the cost
of el i mi nati ng data skew by movi ng parts of tabl es across the i nterconnect i s prohi bi ti vel y
hi gh, tabl e scans may or may not equal l y uti l i ze al l processors dependi ng on the uni formi ty
of the data l ayout. Thus the i mpact of database parti ti oni ng on an MPP can al l ow i t to
perform as wel l as an SMP, or si gni fi cantl y l ess wel l , dependi ng on the nature of the query.
On an SMP system, al l processors have equal access to the database tabl es, so consi stent
performance i s achi eved regardl ess of the database parti ti oni ng. The database query coordi nator
si mpl y al l ocates a set of processes to the tabl e scan based on the number of processors
avai l abl e and the current l oad on the DBMS. Tabl e scans can be paral l el i zed by di vi di ng up
the tabl e’s records between processes and havi ng each processor exami ne an equal number of
records—avoi di ng the probl ems of l oad skew that can cri ppl e MPP archi tectures.
Join Operations
Consi der a database joi n operati on i n whi ch the tabl es to be joi ned are equal l y di stri buted
across the nodes i n an MPP archi tecture. A joi n of thi s data may have very good or very poor
performance dependi ng on the rel ati onshi p between the parti ti on key and the joi n key:
• I f the partition key is equal to the join key, a process on each of the MPP nodes can
perform the joi n operati on on i ts l ocal data, most effecti vel y uti l i zi ng the processor
compl ex.
• I f the partition key is not equal to the join key, each record on each node has the
potenti al to joi n wi th matchi ng records on al l of the other nodes.
HARDWARE AND OPERATI NG SYSTEMS 151
When the MPP hosts N nodes, the joi n operati on requi res each of the N nodes to
transmi t each record to the remai ni ng N-1 nodes, i ncreasi ng communi cati on overhead and
reduci ng joi n performance. The probl em gets even more compl ex when real -worl d data
havi ng an uneven di stri buti on i s anal yzed. Unfortunatel y, wi th ad-hoc queri es predomi nati ng
i n deci si on support systems, the case of parti ti on key not equal to the joi n key can be qui te
common.
To make matters worse, MPP parti ti oni ng deci si ons become more compl i cated when
joi ns among mul ti pl e tabl es are requi red. For exampl e, consi der Fi gure 10.4, where the DBA
must deci de how to physi cal l y parti ti on three tabl es: Suppl i er, PartSupp, and Part. I t i s
l i kel y that queri es wi l l i nvol ve joi ns between Suppl i er and PartSupp, as wel l as between
PartSupp and Part. I f the DBA deci des to parti ti on PartSupp across MPP nodes on the
Suppl i er key, then joi ns to Suppl i er wi l l proceed opti mal l y and wi th mi ni mum i nter-node
tr affi c. But then j oi ns between Par t and Par tSupp coul d r equi r e hi gh i nter -node
communi cati on, as expl ai ned above. The si tuati on i s si mi l ar i f i nstead the DBA parti ti ons
PartSupp on the Part key.
Aa
Ab
Ac
Ba
Bb
Bc
Ca
Cb
Cc
Aa
Ab
Ac
Ba
Bb
Bc
Ca
Cb
Cc
Supplier Table PartSupp Table Part Table
N
o
d
e

3
N
o
d
e

2
N
o
d
e

1
MPP: costl y i nter-node communi cati on requi red
SMP: al l communi cati on at memory speeds
Figure 10.4. The Database Logi cal Schema may Make i t I mpossi bl e to Parti ti on Al l Tabl es on
the Opti mum Joi n Keys. Parti ti oni ng Causes Uneven Performance on MPP, whi l e Performance on
SMP Performance i s Opti mal .
For an SMP, the records sel ected for a joi n operati on are communi cated through the
shared memory area. Each process that the query coordi nator al l ocates to the joi n operati on
has equal access to the database records, and when communi cati on i s requi red between
processes i t i s accompl i shed at memory speeds that are two orders of magni tude faster than
MPP i nterconnect speeds. Agai n, an SMP has consi stentl y good performance i ndependent of
database parti ti oni ng deci si ons.
Index Lookups
The query opti mi zer chooses i ndex l ookups when the number of records to retri eve i s
a smal l (si gni fi cantl y l ess than one percent) porti on of the tabl e si ze. Duri ng an i ndex
l ookup, the tabl e i s accessed through the rel evant i ndex, thus avoi di ng a ful l tabl e scan. I n
cases where the desi red attri butes can be found i n the i ndex i tsel f, the query opti mi zer wi l l
152 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
access the i ndex al one, perhaps through a paral l el ful l -i ndex scan, not needi ng to exami ne
the base tabl e at al l . For exampl e, assume that the i ndex i s parti ti oned evenl y across al l
nodes of an MPP, usi ng the same parti ti on key as used for the data tabl e. Al l nodes can be
equal l y i nvol ved i n sati sfyi ng the query to the extent that matchi ng data rows are evenl y
di stri buted across al l nodes. I f a gl obal i ndex—one not parti ti oned across nodes—i s used,
then the workl oad di stri buti on i s l i kel y to be uneven and scal abi l i ty l ow.
On SMP archi tectures, performance i s consi stent regardl ess of the pl acement of the
i ndex. I ndex l ookups are easi l y paral l el i zed on SMPs because each processor can be assi gned
to access i ts porti on of the i ndex i n a l arge shared memory area. Al l processors can be
i nvol ved i n every i ndex l ookup, and the hi gher i nterconnect bandwi dth can cause SMPs to
outperform MPPs even i n the case where data i s al so evenl y parti ti oned across the MPP
archi tecture.
10.3 HARDWARE SELECTION CRITERIA
The fol l owi ng sel ecti on are recommended for hardware sel ecti on:
• Scalability. The warehouse sol uti on i s scal ed up i n terms of space and processi ng
power. Thi s i s parti cul arl y i mportant i f the warehouse i s projected to grow at a
rapi d rate.
• Financial stability. The product vendor has proven to be a strong and vi si bl e pl ayer
i n the hardware segment, and i ts fi nanci al performance i ndi cates growth or stabi l i ty.
• Price/performance. The product performs wel l i n a pri ce/performance compari son
wi th other vendors of si mi l ar products.
• Delivery lead time. The product vendor can del i ver the hardware or an equi val ent
servi ce uni t wi thi n the requi red ti me frame. I f the uni t i s not readi l y avai l abl e
wi thi n the same country, there may be del ays due to i mportati on l ogi sti cs.
• Reference sites. The hardware vendor has a reference si te that i s usi ng a si mi l ar
uni t for the same purpose. The warehousi ng team can ei ther arrange a si te vi si t
or i ntervi ew representati ves from the si te vi si t. Al ternati vel y, an onsi te test of the
uni t can be conducted, especi al l y i f no reference i s avai l abl e.
• Availability of support. Support for the hardware and i ts operati ng system i s
avai l abl e, and support response ti mes are wi thi n the acceptabl e down ti me for the
warehouse.
Exampl es of hardware and operati ng system pl atforms are provi ded bel ow for
reference purposes onl y and are by no means an attempt to provi de a compl ete l i st
of compani es wi th warehousi ng pl atforms.
The tool s are l i sted i n al phabeti cal order by company name; the sequence does not
i mpl y any form of ranki ng.
• Digital. 64-bi t Al pha Servers and Di gi tal Uni x or Open VMS. Both SMP and MPP
confi gurati ons are avai l abl e.
• HP. HP 9000 Enterpri se Paral l el Serve.
• IBM. RS 6000 and the AI X oper ati ng system have been posi ti oned for data
warehousi ng. The AS/400 has been used for data mart i mpl ementati ons.
HARDWARE AND OPERATI NG SYSTEMS 153
• Microsoft. The Wi ndows NT operati ng system has been posi ti oned qui te successful l y
for data mart depl oyments.
• Sequent. Sequent NUMA-Q and the DYNI X operati ng system.
In Summary
Major hardware vendors have understandabl y establ i shed data warehousi ng i ni ti ati ves
or partnershi p programs wi th both software vendors and consul ti ng fi rms i n a bi d to provi de
comprehensi ve data warehousi ng sol uti ons to thei r customers.
Due to the potenti al si ze expl osi on of data warehouses, and enterpri se i s best served
by a powerful hardware pl atform that i s scal abl e both i n terms of processi ng power and di sk
capaci ty. I f the warehouse achi eves a mi ssi on-cri ti cal status i n the enterpri se, then the
rel i abi l i ty, avai l abi l i ty, and securi ty of the computi ng pl atform become key eval uati on cri teri a.
The cl ear separati on of operati onal and deci si onal pl atforms al so gi ves enterpri ses the
opportuni ty to use a di fferent computi ng pl atform for the warehouse (i .e., di fferent from the
operati onal systems).
154 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
154
A warehousi ng team wi l l requi re several di fferent types of tool s duri ng the course of
a warehousi ng project. These software products general l y fal l i nto one or more of the categori es
i l l ustrated i n Fi gure 11.1 and descri bed bel ow.
Data Access &
Retrieval
OLAP
Report
Writers
EIS/DSS
Data Mining
Alert System
Exception Reporting
Metadata
Data
Warehousing
Data
Mart(s)
Warehouse
Technology
Extraction &
Transformation
Source Data
Middleware
Extraction
Transformation
Quality Assurance
Load Image Creation
Figure 11.1. Data Warehouse Software Components
• Extraction and transformation. As part of the data extracti on and transformati on
process, the warehouse team requi res tool s that can extract, transform, i ntegrate,
cl ean, and l oad data from source systems i nto one or more data warehouse databases.
Mi ddl eware and gateway products may be requi red for warehouses that extract
data from host-based source systems.
9)4-0751/ 5.69)4-
11
+0)26-4
WAREHOUSI NG SOFTWARE 155
• Warehouse storage. Software products are al so requi red to store warehouse data
and thei r accompanyi ng metadata. Rel ati onal database management systems i n
parti cul ar are wel l sui ted to l arge and growi ng warehouses.
• Data access and retrieval. Di fferent types of software are requi red to access,
retri eve, di stri bute, and present warehouse data to i ts end users.
Tool exampl es l i sted throughout thi s chapter are provi ded for reference purposes onl y
and are by no means an attempt to provi de a compl ete l i st of vendors and tool s. The tool s
are l i sted i n al phabeti cal order by company name; the sequence does not i mpl y any form of
ranki ng.
Al so, many of the sampl e tool s l i sted automate more than one aspect of the warehouse
back-end process. Thus, a tool l i sted i n the extracti on category may al so have features that
fi t i nto the transformati on or data qual i ty categori es.
11.1 MIDDLEWARE AND CONNECTIVITY TOOLS
Connecti vi ty tool s provi de transparent access to source systems i n heterogeneous computi ng
envi ronments. Such tool s are expensi ve but qui te often prove to be i nval uabl e because they
provi de transparent access to database of di fferent types, resi di ng on di fferent pl atforms.
Exampl es of commerci al mi ddl eware and connecti vi ty tool s i ncl ude:
• IBM: Data Joi ner
• Oracle: Transparent Gateway
• SAS: SAS/Connect
• Sybase: Enterpri se Connect
11.2 EXTRACTION TOOLS
There are now qui te a number of extracti on tool s avai l abl e, maki ng tool sel ecti on a
potenti al l y compl i cated process.
Tool Selection
Warehouse teams have many opti ons when i t comes to extracti on tool s. I n general , the
choi ce of tool depends greatl y on the fol l owi ng factors:
• The source system platform and database. Extracti on and transformati on tool s
cannot access al l types of data sources on al l types of computi ng pl atforms. Unl ess
the team i s wi l l i ng to i nvest i n mi ddl eware, the tool opti ons are l i mi ted to those
that can work wi th the enterpri se’s source systems.
• Built-in extraction or duplication functionality. The source systems may have
bui l t-i n extracti on or dupl i cati on features, ei ther through appl i cati on code or through
database technol ogy. The avai l abi l i ty of these bui l t-i n tool s may hel p reduce the
techni cal di ffi cul ti es i nherent i n the data extracti on process.
• The batch windows of the operational systems. Some extracti on mechani sms
are faster or more effi ci ent than others. The batch wi ndows of the operati onal
systems determi ne the avai l abl e ti me frame for the extracti on process and therefore
may l i mi t the team to a certai n set of tool s or extracti on techni ques.
156 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
The enterpri se may opt to use si mpl e custom-programmed extracti on scri pts for open
homogeneous computi ng envi r onments, al though wi thout a di sci pl i ned appr oach to
documentati on, such an approach may create an extracti on system that i s di ffi cul t to mai ntai n.
Sophi sti cated extr acti on tool s ar e a better choi ce for sour ce systems i n pr opr i etar y,
heterogeneous envi ronments, al though these tool s are qui te expensi ve.
Extraction Methods
There are two pri mary methods for extracti ng data from source systems (See Fi gures
11.2):
Change-Based Replication
Bulk Extractions
Source
System
Data
Warehouse
Data
Warehouse
Source
System
Figure 11.2. Extracti on Opti ons
• Bulk extractions. The enti re data warehouse i s refreshed peri odi cal l y by extracti ons
from the source systems. Al l appl i cabl e data are extracted from the source systems for
loading into the warehouse. This approach heavily taxes the network connection between
source and target databases, but such warehouses are easi er to set up and mai ntai n.
• Change-based replication. Onl y data that have been newl y i nserted or updated
i n the source systems are extracted and l oaded i nto the warehouse. Thi s approach
pl aces l ess stress on the network (due to the smal l er vol ume of data to be transported)
but requi res more compl ex programmi ng to determi ne when a new warehouse
record must be i nserted or when an exi sti ng warehouse record must be updated.
Exampl es of Extracti on tool s i ncl ude:
• Apertus Carleton: Passport
• Evolutionary Technologies: ETI Extract
• Platinum: I nfoPump
11.3 TRANSFORMATION TOOLS
Transformati on tool s are aptl y named; they transform extracted data i nto the appropri ate
format, date structure, and val ues that are requi red by the data warehouse.
WAREHOUSI NG SOFTWARE 157
Most transformati on tool s provi de the features i l l ustrated i n Fi gure 11.3 and descri bed
bel ow.
SOURCE
SYSTEM
TYPE OF
TRANSFORMATION
DATA
WAREHOUSE
Address Field:
#123 ABC Street
XYZ City 1000
Republic of MN
No: 123
Street: ABC Stress
City: XYZ City
Country: Republic of MN
Postal Code: 1000
Field Splitting
System A,
Customer Title:
President
System B,
Customer Title:
CEO
Order Date:
August, 1998 05
Order Date:
08-08-1998
System A,
Customer Name:
John W. Smith
System B,
Customer Name:
John William Smith
Field Consolidation
Standardization
Deduplication
Customer Title:
President and CEO
Order Date:
August 05, 1998
Order Date:
August 08, 1998
Customer Name:
John William Smith
Figure 11.3. Data Transformati ons
• Field splitting and consolidation. Several l ogi cal data i tems may be i mpl emented
as a si ngl e physi cal fi el d i n the source systems, resul ti ng i n the need to spl i t up
a si ngl e source fi el d i nto more than one target warehouse fi el d. At the same ti me,
there wi l l be many i nstances when several source system fi el ds must be consol i dated
and stored i n one si ngl e warehouse fi el d. Thi s i s especi al l y true when the same
fi el d can be found i n more than one source system.
• Standardization. Standards and conventi ons for abbrevi ati ons, date formats, data
types, character formats, etc., are appl i ed to i ndi vi dual data i tems to i mprove
uni formi ty i n both format and content. Di fferent nami ng conventi ons for di fferent
war ehouse obj ect types ar e al so defi ned and i mpl emented as par t of the
transformati on process.
• De-duplication. Rul es are defi ned to i denti fy dupl i cate stores of customers or
products. I n many cases, the l ack of data makes i t di ffi cul t to determi ne whether
two records actual l y refer to the same customer or product. When a dupl i cate i s
158 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
i denti fi ed, two or more records are merged to form one warehouse record. Potenti al
dupl i cates can be i denti fi ed and l ogged for further veri fi cati on.
Warehouse l oad i mages (i .e., records to be l oaded i nto the warehouse) are created
towards the end of the transformati on process. Dependi ng on the team’s key generati on
approach, these l oad i mages may or may not yet have warehouse key.
• Apertus Carleton: Enterpri se/I ntegrator
• Data Mirror: Transformati on Server
• Informatica: Power Mart Desi gner
11.4 DATA QUALITY TOOLS
Data qual i ty tool s assi st warehousi ng teams wi th the task of l ocati ng and correcti ng
data errors that exi st i n the source system or i n the data warehouse. Experi ence has shown
that easi l y up to 15 per cent of the r aw data extr acted fr om oper ati onal systems ar e
i nconsi stent or i ncorrect. A hi gher percentage of data are l i kel y to be i n the wrong format.
Vari ati ons i n nami ng conventi ons, abbrevi ati ons, and formats resul t i n i nconsi stenci es
that i ncrease the di ffi cul ty or l ocati ng dupl i cate records. For exampl e, “14/F,” “14th Fl oor,”
14th Fi r.” Al l mean the same thi ng to operati onal staff but may not be recogni zed as
equi val ent duri ng the warehouse l oad.
Er r oneous spel l i ngs of names, addr esses, etc., but to homonyms l i kewi se cause
i nconsi stenci es. Updates (e.g., change of address) i n one system that are not propagated to
other source systems al so cause data qual i ty probl ems.
Data qual i ty tool s can hel p i denti fy and correct data errors, i deal l y at the source
systems. I f correcti ons at the source are not possi bl e, data qual i ty tool s can al so be used on
the warehouse l oad i mages or on the warehouse data i tsel f. However, thi s practi ce wi l l
i ntroduce i nconsi stenci es between the source systems and the warehouse data; the warehouse
team may i nadvertentl y create data synchroni zati on probl ems.
I t i s i nteresti ng to note that whi l e di rty data conti nue to be one of the bi ggest i ssues
for data warehousi ng i ni ti ati ves, research i ndi cates that data qual i ty i nvestments consi stentl y
recei ve but a smal l percentage of total warehouse spendi ng.
Exampl es of data qual i ty tool s i ncl ude the fol l owi ng:
• Data Flux: Data Qual i ty Workbench
• Pine Cone Systems: Content Tracker
• Prism: Qual i ty Manager
• Vality Technology: I ntegri ty Data Reengi neeri ng
11.5 DATA LOADERS
Data l oaders l oad transformed data (i .e., l oad i mages) i nto the data warehouse. I f l oad
i mages are avai l abl e on the same RDBMS engi ne as the warehouse, then stored procedures
can be used to handl e the warehouse l oadi ng.
I f the l oad i mages do not have warehouse keys, then data l oaders must generate the
appropri ate warehouse keys as part of the l oad process.
WAREHOUSI NG SOFTWARE 159
11.6 DATABASE MANAGEMENT SYSTEMS
A database management system i s requi red to store the cl eansed and i ntegrated data
for easy retri eval by busi ness users. Two fl avors of database management systems are
currentl y popul ar: Rel ati onal Databases and Mul ti di mensi onal Databases.
Relational Database Management Systems (RDBMS)
Al l major r el ati onal database vendor s have al r eady announced the avai l abi l i ty or
upcomi ng avai l abi l i ty of data warehousi ng rel ated features i n thei r products. These features
ai m to make the respecti ve RDBMSes parti cul arl y sui tabl e to very l arge databases (VLDB)
i mpl ementati ons. Exampl es of such features are bi t-mapped i ndexes and paral l el query
capabi l i ti es.
Exampl es of these products i ncl ude
• IBM: DB2
• Informix: I nformi x RDBMS
• Microsoft: SQL Server
• Oracle: Oracl e RDBMS
• Red Brick Systems: Red Bri ck Warehouse
• Sybase: RDBMS Engi ne-System I I
Multidimensional Databases (MDDBs)
Mul ti di mensi onal database engi nes store data i n hypercubes, i .e., pages of numbers
that are paged i n and out or memory on an as-needed basi s, dependi ng on the scope and
type of query. Thi s approach i s i n contrast to the use of tabl es and fi el ds i n rel ati onal
databases.
Exampl es of these products i ncl ude:
• Arbor: Essbase
• BrioQuery: Enterpri se
• Dimensional Insight: DI -Di ver
• Oracle: Express Server
Convergence of RDBMSes and MDDBs
Many rel ati onal database vendors have announced pl ants to i ntegrate mul ti di mensi onal
capabi l i ti es i nto thei r RDBMSes. Thi s i ntegrati on wi l l be achi eved by cachi ng SQL query
resul ts on a mul ti di mensi onal hypercube on the database. Such Database OLAP technol ogy
(someti mes referred to as DOLAP) ai ms to provi de warehousi ng teams wi th the best of both
OLAP words.
11.7 METADATA REPOSITORY
Al though ther e i s a cur r ent l ack of metadata r eposi tor y standar ds, ther e i s a
consci ousness that the metadata reposi tory shoul d support the documentati on of source
system data structures, transformati on busi ness rul es, the extracti on and transformati on
programs that move the data. And data structure defi ni ti ons of the warehouse or data
160 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
marts. I n addi ti on, the metadata reposi tory shoul d al so support aggregate navi gati on, query
stati sti c col l ecti on, and end-uses hel p for warehouse contents.
Metadata reposi tory products are al so referred to as i nformati on catal ogs and busi ness
i nformati on di rectori es. Exampl es of metadata reposi tori es i ncl ude.
• Apertus Carleton: Warehouse Control Center
• Informatica: Power Mart Reposi tory
• Intellidex: Warehouse Control Center
• Prism: Pri sm Warehouse Di rectory
11.8 DATA ACCESS AND RETRIEVAL TOOLS
Data warehouse users deri ve and obtai n i nformati on these types of tool s. Data access
and retri eval tool s are currentl y cl assi fi ed i nto the subcategori es bel ow:
Online Analytical Processing (OLAP) Tools
OLAP tool s al l ow users to make ad hoc queri es or generate canned queri es agai nst the
warehouse database. The OLAP category has si nce di vi ded further i nto the mul ti di mensi onal
OLAP (MOLAP) and rel ati onal OLAP (ROLAP) markets.
MOLAP products run agai nst a mul ti di mensi onal database (MDDB). These products
provi de excepti onal responses to queri es and typi cal l y have addi ti onal functi onal i ty or features,
such as budgeti ng and forecasti ng capabi l i ti es. Some of the tool s al so have bui l t-i n stati sti cal
functi ons. MOLAP tool s are better sui ted to power users i n the enterpri se.
ROLAP products, i n contrast, run di rectl y agai nst warehouses i n rel ati onal databases
(RDBMS). Whi l e the products provi de sl ower response ti me than thei r MOLAP counterparts,
ROLAP products are si mpl er and easi er to use and are therefore sui tabl e to the typi cal
warehouse user. Al so, si nce ROLAP products run di rectl y agai nst rel ati onal databases, they
can be used di rectl y wi th l arge enterpri se warehouses.
Exampl es of OLAP tool s i ncl ude:
Arbor Software: Essbase OLAP
Cognos: Power pl ay
Intranet Business Systems: R/ol apXL
Reporting Tools
These tool s al l ow users to produce canned, graphi c-i ntensi ve, sophi sti cated reports
based on the warehouse data. There are two mai n cl assi fi cati ons of reporti ng tool s: report
wri ters and report servers.
Report wri ters al l ow users to create parameteri zed reports that can be run by users on
an as needed basi s. These typi cal l y requi re some i ni ti al programmi ng to create the report
templ ate. Once the templ ate has been defi ned, however, generati ng a report can be as easy
as cl i cki ng a button or two.
Report servers are si mi l ar to report wri ters but have addi ti onal capabi l i ti es that al l ow
thei r users to schedul e when a report i s to be run. Thi s feature i s i n parti cul arl y hel pful i f
the warehouse team prefers to schedul e report generati on processi ng duri ng the ni ght, after
WAREHOUSI NG SOFTWARE 161
a successful warehouse l oad. By schedul i ng the report run for the eveni ng, the warehouse
team effecti vel y removes some of the processi ng from the dayti me, l eavi ng the warehouse
free for ad hoc queri es from onl i ne users. Some report servers al so come wi th automated
report di stri buti on capabi l i ti es. For exampl e, a report server can e-mai l a newl y generated
report to a speci fi ed user or generate a web page that users can access on the enterpri se
i ntranet. Report servers can al so store copi es of reports for easy retri eval by users over a
network on an as-needed basi s.
Exampl es of reporti ng tool s i ncl ude:
• IQ Software: I Q/Smart Server
• Seagate Software: Crystal Reports
Executive Information Systems (EIS)
EI S systems and other Deci si on Support Systems (DSS) are packaged appl i cati ons that
run agai nst warehouse data. These provi de di fferent executi ve reporti ng features, i ncl udi ng
“what i f” or scenari o-based anal ysi s capabi l i ti es and support for the enterpri se budgeti ng
process.
Exampl es of these tool s i ncl ude:
• Comshare: Deci si on
• Oracle: Oracl e Fi nanci al Anal yzer
Whi l e there are packages that provi de deci si onal reporti ng capabi l i ti es, there are EI S
and DSS devel opment tool s that enabl e the rapi d devel opment and mai ntenance of custom-
made deci si onal systems.
Exampl es i ncl ude.
• Microstrategy: DSS Executi ve
• Oracle: Express Objects
Data Mining
Data mi ni ng tool s search for i nconspi cuous patterns i n transacti on-grai ned data to shed
new l i ght on the operati ons of the enterpri se. Di fferent data mi ni ng products support di fferent
data mi ni ng al gori thms or techni ques (e.g., marked basket anal ysi s, cl usteri ng), and the
sel ecti on of a data mi ni ng tool i s often i nfl uenced by the number and type of al gori thms
supported.
Regardl ess of the mi ni ng techni ques, however, the objecti ve of these tool s remai n the
same: crunchi ng though l arge vol umes of data to i denti fy acti onabl e patterns that woul d
otherwi se have remai ned undetected.
Data mi ni ng tool s wor k best wi th tr ansacti on-gr ai ned data. For thi s r eason, the
depl oyment of data mi ni ng tool s may resul t i n a dramati c i ncrease i n warehouse si ze. Due
to di sk costs, the warehousi ng team may fi nd i tsel f havi ng to make the pai nful compromi se
of stori ng transacti on-grai ned data for onl y a subset of i ts customers. Other teams may
compromi se by stori ng transacti on-grai ned data for a short ti me on a fi rst-i n-fi rst-out basi s
(e.g., transacti on for al l customers, but for the l ast si x months onl y).
162 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
One l ast i mportant note about data mi ni ng: Si nce these tool s i nfer rel ati onshi ps and
patterns i n warehouse data, a cl ean data warehouse wi l l al ways produce better resul ts than
a di rty warehouse. Di rty data may mi sl ead both the data mi ni ng tool s and thei r users by
produci ng erroneous concl usi ons.
Exampl es of data mi ni ng products i ncl ude:
• ANGOSS: Knowl edge STUDI O
• Data Distilleries: Data Surveyor
• Hyper Parallel: Di scovery
• IBM: I ntel l i gent Mi ner
• Integral Solutions: Cl ementi ne
• Magnify: Pattern
• Neo Vista Software: Deci si on Seri es
• Syllogic: Syl l ogi c Data Mi ni ng Tool
Exception Reporting and Alert Systems
These systems hi ghl i ght or cal l an end-user’s attenti on to data of a set of condi ti ons
about data that are defi ned as excepti ons. An enterpri se typi cal l y i mpl ements three types
of al erts.
• Operational alerts from individual operational systems. These have l ong
been used i n OLTP appl i cati ons and are typi cal l y used to hi ghl i ght excepti ons
rel ati ng to transacti ons i n the operati onal system. However, these types of al erts
are l i mi ted by the data scope of the OLTP appl i cati on concerned.
• Operational alerts from the operational data store. These al er ts r equi r e
i ntegrated operati onal data and therefore are possi bl e onl y on the Operati onal
Data Store. For exampl e, a bank branch manager may wi sh to be al erted when a
bank customer who has mi ssed a l oan payment has made a l arge wi thdrawal from
hi s deposi t account.
• Decisional alerts from the data warehouse. These al erts requi re compari sons
wi th hi stori cal val ues and therefore are possi bl e onl y on the data warehouse. For
exampl e, a sal es manager may wi sh to be al erted when the sal es for the current
month are found to be at l east 8 percent l ess than sal es for the same month l ast year.
Products that can be used as excepti on reporti ng or al ert systems i ncl ude:
• Compulogic: Dynami c Query messenger
• Pine cone systems: Acti vator Modul e (Content Tracker)
Web-Enabled Products
Front-end tool s bel ongi ng to the above categori es gradual l y been addi ng web-publ i shi ng
features. Thi s devel opment i s spurred by the growi ng i nterest i n i ntranet technol ogy as a
cost-effecti ve al ternati ve for shari ng and del i veri ng i nformati on wi thi n the enterpri se.
11.9 DATA MODELING TOOLS
Data model i ng tool s al l ow users to prepare and mai ntai n an i nformati on model of both
the source database and the target database. Some of these tool s al so generate the data
WAREHOUSI NG SOFTWARE 163
structures based on the model s that are stored or are abl e to create model s by reverse
engi neeri ng wi th exi sti ng databases. I T organi zati ons that have enterpri se data model s wi l l
qui te l i kel y have documented these model s usi ng a data model i ng tool . Whi l e these tool s are
ni ce to have, they are not a prerequi si te for a successful data warehouse project.
As an asi de, some enterpri ses make the mi stake of addi ng the enterpri se data model
to the l i st of data warehouse pl anni ng del i verabl es. Whi l e an enterpri se data model i s
hel pful to warehousi ng, parti cul arl y duri ng the source system audi t, i t i s defi ni tel y not a
prerequi si te of the warehousi ng project. Maki ng the enterpri se model a prerequi si te of a
del i verabl e of the project wi l l onl y serve to di vert the team’s attenti on from bui l di ng a
warehouse to documenti ng what data currentl y exi sts.
Exampl es i ncl ude:
• Cayenne Software: Terrai n
• Relational Matters: Syntagma Desi gner
• Sybase: Power Desi gner Warehouse Archi tect
11.10 WAREHOUSE MANAGEMENT TOOLS
These tool s assi st war ehouse admi ni str ator s i n the day-to-day management and
admi ni strati on of the warehouse. Di fferent warehouse management tool s support or automate
aspects of the warehouse admi ni strati on and management tasks.
For exampl e, some tool s focus on the l oad process and therefore track the l oad hi stori es
of the warehouse. Other tool s track the types of queri es that users di rect to the warehouse
and i denti fy whi ch data are not and therefore are candi dates for removal .
Exampl es i ncl ude.
• Pine Cone Systems: Usage Tracker, Refreshment Tracker
• Red Brick Systems: Enterpri se Control and Coordi nati on
11.11 SOURCE SYSTEMS
Data warehouses woul d not be possi bl e wi thout source systems, i .e., the operati onal
systems of the enterpri se that serve as the pri mary source of warehouse data. Al though
stri ctl y speaki ng, the source systems are not data warehousi ng software products, they do
i nfl uence the sel ecti on of these tool s or products.
The computi ng envi ronments of the source systems general l y determi ne the compl exi ty
of extracti ng operati onal data. As can be expected, heterogeneous computi ng envi ronments
i ncrease the di ffi cul ti es that a data warehouse team may encounter wi th data extracti on
and transformati on.
Appl i cati on packages (e.g., i ntegr ated banki ng or i ntegr ated manufactur i ng and
di stri buti on systems) wi th propri etary database structures wi l l al so pose data access probl ems.
External data sources may al so be used. Exampl es i ncl ude Bl oomberg News, Lundberg,
A.C. Ni el sen, Dun and Bradstreet, Mai l code or Zi p code data, Dow Jones news servi ce,
Lexi s, New York Ti mes Servi ces, and Nexi s.
164 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
In Summary
Qui te a number of technol ogy vendors are suppl yi ng warehousi ng products i n more
than one category and a cl ear trend towards the i ntegrati on of di fferent warehousi ng products
i s evi denced by effor ts to shar e metadata acr oss di ffer ent pr oducts and by the many
partnershi ps and al l i ances formed between warehousi ng vendors.
Despi te thi s, ther e i s sti l l no cl ear mar ket l eader for an i ntegr ated sui te of data
warehousi ng products. Warehousi ng teams are sti l l forced to take on the responsi bi l i ty of
i ntegrati ng di sparate products, tool s, and envi ronments or to rel y on the servi ces of a
sol uti on i ntegrator. Unti l thi s si tuati on changes, enterpri ses shoul d careful l y eval uate the
form of the tool s they eventual l y sel ect for di fferent aspects of thei r warehousi ng i ni ti ati ve.
The i ntegrati on probl ems posed by the source system data are di ffi cul t enough wi thout
addi ng tool i ntegrati on probl ems to the project.
165
Di mensi onal model i ng i s a term used to refer to a set of data model i ng techni ques that
have gai ned popul ar i ty and acceptance for data war ehouse i mpl ementati ons. The
acknowl edged guru of di mensi onal model i ng i s Ral ph Ki mbal l , and the most thorough
l i ter atur e cur r entl y avai l abl e on di mensi onal model i ng i s hi s book enti tl ed ‘The Data
War ehouse, Tool ki t. Pr acti cal Techni ques for Bui l di ng Di mensi onal Data War ehouses’,
publ i shed by John Wi l ey & Sons (I SBN: 0-471-15337-0).
Thi s chapter i ntroduces di mensi onal model i ng as one of the key techni ques i n data
warehousi ng and i s not i ntended as a repl acement for Ral ph Ki mbal l ’s book.
12.1 OLTP SYSTEMS USE NORMALIZED DATA STRUCTURES
Most I T professi onal s are qui te fami l i ar wi th normal i zed database structures, si nce
normal i zati on i s the standard database desi gn techni que for the rel ati onal database of
Onl i ne Transacti onal Processi ng (OLTP) systems. Normal i zed database structures make i t
possi bl e for oper ati onal systems to consi stentl y r ecor d hundr eds of di scr ete, i ndi vi dual
transacti ons, wi th mi ni mal ri sk of data l oss or data error.
Al though normal i zed databases are appropri ate for OLTP systems, they qui ckl y create
probl ems when used wi th deci si onal systems.
Users Find Normalized Data Structures to Understand
Any I T professi onal who has asked a busi ness user to revi ew a ful l y normal i zed enti ty
rel ati onshi p di agram has fi rst-hand experi ence of the probl em. Normal i zed data structures
si mpl y do not map to the natural thi nki ng processes of busi ness users. I t i s unreal i sti c to
expect busi ness users to navi gate through such data structures.
I f busi ness user s are expected to per for m quer i es agai nst the war ehouse database on
an ad hoc basi s and i f I T pr ofessi onal s want to r emove themsel ves fr om the r epor t-
cr eati on l oop, then user s must be pr ovi ded wi th data str uctur es that ar e si mpl e and easy
to under stand. Nor mal i zed data str uctur es do not pr ovi de the r equi r ed l evel of si mpl i ci ty
and fr i endl i ness.
WAREHOU$E $CHEMA DE$¡GN
12
CHAFTER
166 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Normalized Data Structures Require Knowledge of SQL
To create even the most basi c of queri es and reports agai nst a normal i zed data structure,
one requi res knowl edge of SQL (Structured Query Language) - somethi ng that shoul d not
be expected of busi ness users, especi al l y deci si on-makers. Seni or executi ves shoul d not have
to l earn how to wri te programmi ng code, and even i f they knew how, thei r ti me i s better
spent on non-programmi ng acti vi ti es.
Unsurpri si ngl y, the use of normal i zed data structures resul ts i n many hours of I T
resources devoted to wri ti ng reports for operati onal and deci si onal managers.
Normalized Data Structures are not Optimized to Support Decisional Queries
By thei r very nature, deci si onal queri es requi re the summati on of hundreds to thousands
of fi gures stored i n perhaps many rows i n the database. Such processi ng on a ful l y normal i zed
data structure i s sl ow and cumbersome. Consi der the sampl e data structure i n Fi gure 12.1.
Individual
Customer
Customer
Corporate
Customer
Order
Account Type
Account
Order Line Item
Product
Product Type
Figure 12.1. Exampl e of a Normal i zed Data Structure
I f a busi ness manager requi res a Product Sal es per Customer report (see Fi gure 12.2),
the program code must access the Customer, Account, Account Type, Order Li ne I tem, and
Product Tabl es to compute the total s. The WHERE cl ause of the SWL statement wi l l be
strai ghtforward but l ong; records of the di fferent tabl es have to be rel ated to one another
to produce the correct resul t.
PRODUCT SALES PER CUSTOMER, Data : March 6, 1998
Customer Product Name Sales Amount
Customer A Product X 1,000
Product Y 10,000
Customer B Product X 8,000
……… ……………. . ……………. .
Figure 12.2. Product Sal es per Customer Sampl e Report.
WAREHOUSE SCHEMA DESI GN 167
12.2 DIMENSIONAL MODELING FOR DECISIONAL SYSTEMS
Di mensi onal model i ng provi des a number of techni ques or pri nci pl es for denormal i zi ng
the database structure to create schemas that are sui tabl e for supporti ng deci si onal processi ng.
These model i ng pri nci pl es are di scussed i n the fol l owi ng secti ons.
Two Types of Tables : Facts and Dimensions
Two types of tabl es are used i n di mensi onal model i ng: Fact tabl es and Di mensi onal
tabl es.
Fact Tables
Fact tabl es are used to record actual facts or measures i n the busi ness. Facts are the
numeri c data i tems that are of i nterest to the busi ness.
Bel ow are exampl es of facts for di fferent i ndustri es
• Retail. Number of uni ts sol d, sal es amount
• Telecommunications. Length of cal l i n mi nutes, average number of cal l s.
• Banking. Average dai l y bal ance, transacti on amount
• Insurance. Cl ai ms amounts
• Airline. Ti cket cost, baggage wei ght
Facts are the numbers that users anal yze and summari ze to gai n a better understandi ng
of the busi ness.
Dimension Tables
Di mensi on tabl es, on the other hand, establ i sh the context of the facts. Di mensi onal
tabl es store fi el ds that descri be the facts.
Bel ow are exampl es of di mensi ons for the same i ndustri es :
• Retail. Store name, store zi p, product name, product category, day of week.
• Telecommunications. Cal l ori gi n, cal l desti nati on.
• Banking. Customer name, account number, data, branch, account offi cer.
• Insurance. Pol i cy type, i nsured pol i cy
• Airline. Fl i ght number, fl i ght desti nati on, ai rfare cl ass.
Facts and Dimensions in Reports
When a manager requi res a report showi ng the revenue for Store X, at Month Y, for
Product Z, the manager i ssui ng the Store di mensi on, the Ti me di mensi on, and the Product
di mensi on to descri be the context of the revenue (fact).
Thus for the sampl e r epor t i n Fi gur e 12.3, sal es r egi on and countr y ar e di mensi onal
attr i butes, “ 2Q, 1997” i s a di mensi onal val ue. These data i tems establ i sh the context
and l end meani ng to the facts i n the r epor t-sal es tar gets and sal es actual .
168 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
For 2Q, 1997
Sales Region Country Target (in ‘000s) Actuals (in ‘000s)
Asi a Phi l i ppi nes 14,000 15,050
Hong Kong 10,000 10,500
Europe France 4,000 4,050
I tal y 6,000 8,150
North Ameri ca Uni ted States 1,000 1,500
Canada 7,000 500
Afri ca Egypt 5,600 6,200
Figure 12.3. Second Quarter Sal es Sampl e Report
12.3 STAR SCHEMA
The mul ti di mensi onal vi ew of data that i s expressed usi ng rel ati onal database semanti cs
i s provi ded by the database schema desi gn cal l ed start schema. The basi c premi se of star
schemas i s that i nformati on can be cl assi fi ed i nto two groups; facts and di mensi ons. Facts
are the core data el ement bei ng anal yzed. For exampl e, uni ts of i ndi vi dual i tems sol d are
facts, whi l e di mensi ons are attri butes about the facts. For exampl e, di mensi ons are the
product types purchased and the date of purchase.
Vi sual l y, a di mensi onal schema l ooks very much l i ke a star, hence the use of the term
star schema to descri be di mensi onal model s. Fact tabl es resi de at the center of the schema,
and thei r di mensi ons are typi cal l y drawn around i t, as shown i n Fi gure 12.4.
SALES
Fact Table
Client
Time Product
Organization
Figure 12.4. Di mensi onal Star Scheme Exampl e
I n Fi gure 12.4, the di mensi ons are Cl i ent, Ti me, Product and Organi zati on. The fi el ds
i n these tabl es are used to descri be the facts i n the sal es fact tabl e.
Facts are Fully Normalized, Dimensions are Denormalized
One of the key pri nci pl es of di mensi onal model i ng i s the use of ful l y normal i zed Fact
tabl es together wi th ful l y denormal i zed Di mensi on tabl es. Unl i ke di mensi onal schemas, a
ful l y normal i zed database schema no doubt woul d i mpl ement some of these di mensi ons as
many l ogi cal (and physi cal ) tabl es.
I n Fi gure 12.4 note that because the Di mensi on tabl es are denormal i zed, the schema
shows no outl yi ng tabl es beyond the four di mensi onal tabl es. A Ful l y normal i zed product
di mensi on, i n contrast, may have the addi ti onal tabl es shown i n Fi gure 12.5.
WAREHOUSE SCHEMA DESI GN 169
Product
Group
Product
Subgroup
Product
Category
Product
Figure 12.5. Normal i zed Product Tabl es
I t i s the use of these addi ti onal normal i zed tabl es that decreases the fri endl i ness and
navi gabi l i ty of the schema. By denormal i zi ng the di mensi ons, one makes avai l abl e to the
user al l rel evant attri butes i n one tabl e.
12.4 DIMENSIONAL HIERARCHIES AND HIERARCHICAL DRILLING
As a resul t of denormal i zati on of the di mensi ons, each di mensi on wi l l qui te l i kel y have
hi erarchi es that i mpl y the groupi ng and structure.
The easi est exampl e can be found i n the Ti me di mensi on. As shown i n Fi gure 12.6, the
Ti me di mensi on has a Day-Month-Quarter-Year hi erarchy. Si mi l arl y, the Store di mensi on
may have a Ci ty-Country-Regi on-Al l stores hi erarchy. The Product di mensi on may have a
Product-Product category-Product Department-Al l Products hi erarchy.
Time
Year
Quarter
Month
Day
Store
All Stores
Region
Country
City
Product
All Products
Product Department
Product Category
Product
Figure 12.6. Di mensi onal Hi erarchi es
When warehouse users dri l l up and down for detai l , they typi cal l y dri l l up and down
these di mensi onal hi erarchi es to obtai n more or l ess detai l , about the busi ness.
For exampl e, a user may i ni ti al l y have a sal es report showi ng the total sal es for al l
regi ons for the year. Fi gure 12.7 rel ates the hi erarchi es to the sal es report.
Time
Year
Quarter
Month
Day
Store
All
Region
Country
City
Product
All Products
Product Department
Product Category
Product
170 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Product Sales
Year Region Sales
1996 Asi a 1,000
Eur ope 50,000
Ameri ca 20,000
1997 Asi a 1,5000
———- ———
Figure 12.7. Di mensi onal Hi erarchi es and the Correspondi ng Report Sampl e
For such a report, the busi ness user i s at (1) the Year l evel of the Ti m hi erarchy; (2)
the Regi on l evel of the Store hi erarchy; and (3) the Al l Products l evel of the Product
hi erarchy.
A dri l l -down al ong any of the di mensi ons can be achi eved ei ther by addi ng a new
col umn or by repl aci ng an exi sti ng col umn i n the report. For exampl e, dri l l i ng down the
Ti me di mensi on can be achi eved by addi ng Quarter as a second col umn i n the report, as
shown i n Fi gure 12.8.
Time
Year
Quarter
Month
Day
Store
All
Region
Country
City
Product
All Products
Product Department
Product Category
Product
Product Sales
Year Quarter Region Sales
1996 Q1 Asi a 200
Q2 Asi a 200
Q3 Asi a 250
Q4 Asi a 350
Q1 Europe 10,000
———— ———— ———- ———
Figure 12.8. Dri l l i ng Down Di mensi onal Hi erarchi es
12.5 THE GRANULARITY OF THE FACT TABLE
The term granul ari ty i s used to i ndi cate the l evel of detai l stored i n the tabl e. The
granul ari ty of the Fact tabl e fol l ows natural l y from the l evel of detai l of i ts rel ated di mensi ons.
For exampl e, i f each Ti me record represents a day, each Product record represents a
product, and each Organi zati on record represents one branch, then the grai n of a sal es Fact
tabl e wi th these di mensi ons woul d l i kel y be; sal es per product per day per branch.
WAREHOUSE SCHEMA DESI GN 171
Proper i denti fi cati on of the granul ari ty of each schema i s cruci al to the useful ness and
cost of the warehouse. Granul ari ty at too hi gh a l evel severel y l i mi ts the abi l i ty of users to
obtai n addi ti onal detai l . For exampl e, i f each ti me record represented an enti re year, there
wi l l be one sal es fact record for each year, and i t woul d not be possi bl e to obtai n sal es fi gures
on a monthl y or dai l y basi s.
I n contrast, granul ari ty at too l ow a l evel resul ts i n an exponenti al i ncrease i n the si ze
requi rements of the warehouse. For exampl e, i f each ti me record represented an hour, there
wi l l be one sal es fact record for each hour of the day (or 8,760 sal es fact records for a year
wi th 365 days for each combi nati on of Product, Cl i ent, and Organi zati on). I f dai l y sal es facts
are al l that are requi red, the number of records i n the database can be reduced dramati cal l y.
The Fact Table Key Concatenates Dimension Keys
Si nce the granul ari ty of the fact tabl e determi nes the l evel of detai l of the di mensi ons
that surround i t, i t fol l ows that the key of the fact tabl e i s actual l y a concatenati on of the
keys of each of i ts di mensi ons.
Table 12.1. Properties of Fact and Dimension Tables
Property Client Table Product Table Time Table

Sales Table
Tabl e Type Di mensi on Di mensi on Di mensi on

Fact
One Recor d i s One Cl i ent One Pr oduct One Day

Sal es per Cl i ent
per Pr oduct per
Day
Key Cl i ent Key Pr oduct Key Ti me Key

Cl i nt Key +
Pr oduct Key +
Ti me Key
Sampl e Fi el ds
or Attr i butes
Fi r st Name Last
Name Gender
Ci ty Wei ght
Countr y
Pr oduct Name
Col or Si ze
Pr oduct Cl ass
Pr oduct Gr oup
Date Month Year Day of
Month Day of Week
Week Number Weekday
Fl ag Hol i day Fl ag
Amount Sol d
Quanti ty Sol d
Thus, i f the granul ari ty of the sal es schema i s sal e per cl i ent per product per day, the
sal es fact tabl e key i s actual l y the concatenati on of the cl i ent key, the product key and the
ti me key (Day), as presented i n Tabl e 12.1.
12.6 AGGREGATES OR SUMMARIES
Aggregates or Summari es are one of the most powerful concepts i n data warehousi ng.
The proper use of aggregates dramati cal l y i mproves the performance of the data warehouse
i n terms of query response ti mes, and therefore i mproves the overal l performance and
usabi l i ty of the warehouse.
Computation of Aggregates is Based on Base-Level Schemas
An aggregate i s a pre cal cul ated summary stored wi thi n the warehouse, usual l y i n a
separate schema. Aggregates are typi cal l y computed based on records at the most detai l ed
(or base) l evel (see Fi gure 12.9). They are used to i mprove the performance of the warehouse
for those queri es that requi re onl y hi gh-l evel or summari zed data.
172 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Time
Year
Quarter
Month
Day
Store
All Stores
Region
Country
City
Product
All Products
Product Department
Product Category
Product
Figure 12.9 Base-Level Schemes Use the Bottom Level of Di mensi onal Hi erarchi es
Aggr egates ar e mer el y summar i es of the base-l evel data hi gher poi nts al ong the
di mensi onal hi erarchi es, as i l l ustrated i n Fi gure 12-10.
Time
Year
Quarter
Month
Day
Store
All
Region
Country
City
Product
All Products
Product Department
Product Category
Product
Figure 12.10 Aggregate Schemes are hi gher al ong the Di mensi onal Hi erarchi es
Rather than runni ng a hi gh-l evel query agai nst base-l evel or detai l ed data, users can
r un the quer y agai nst aggr egated data. Aggr egates pr ovi de dr amati c i mpr ovements i n
performance because of si gni fi cantl y smal l er number of records.
Aggregates have Fewer Records than Do Base-Level Schemas
Consi der the schema i n Fi gure 12.11 wi th the fol l owi ng characteri sti cs
Sales
Transaction
Table
Time Store
Product
Figure 12.11. Sampl e Schema
Wi th the assumpti ons outl i ned above, i t i s possi bl e to compute the number of fact
records requi red for di fferent types of queri es.
WAREHOUSE SCHEMA DESI GN 173
I f a Query I nvol ves… Then i t Must Retri eve or Summari ze
1 Product, 1 Store, and 1 Week Onl y 1 record from the Schema
1 Product, Al l Stores, 1 Week 10 records from the Schema
1 Brand, 1 Store, 1 Week 100 records from the Schema
1 Brand, Al l Store, 1 Year 52,000 records from the Schema
I f aggregates had been pre-cal cul ated and stored so that each aggregate record provi des
facts for a brand per store per week, the thi rd query above (1 Brand, 1 Store, 1 Week) woul d
requi re onl y 1 record, i nstead of 100 records. Si mi l arl y, the fourth query above (1 Brand, Al l
Stores, 1 Year) woul d requi re onl y 520 records i nstead of 52,000. The resul ti ng i mprovements
i n query response ti mes are obvi ous.
12.7 DIMENSIONAL ATTRIBUTES
Di mensi onal attri butes pl ay a very cri ti cal rol e i n di mensi onal star schemas. The attri bute
val ues are used to establ i sh the context of the facts.
For exampl e, a fact tabl e record may have the fol l owi ng keys: Date, Store, I D, Product
I D (wi th the correspondi ng Ti me Store, and Product di mensi ons). I f the key fi el ds have the
val ues “February 16.1998, “101,” and “ABC” respecti vel y, then the di mensi onal attri butes i n
the Ti me, Store, and Product di mensi ons can be used to establ i sh the context of the facts
i n the Fact Record as fol l ows.
Store
Time
Product
Sales
Fact table
Store Name:
Store City:
Store Zip:
Store Size:
Store Layout:
...etc.
One Stop Shop
Evanston
60201
500 sq ft.
Classic
Product Name:
Product Category:
Product Department:
Product Color:
...etc.
Joy Tissue Paper
Paper Products
Groceries
White
Date:
Day of Week:
Short Day of Week:
Month Name:
Short Month Name:
Quarter in Year:
...etc.
February 16
Monday
Mon
February
Feb
1
Figure 12.12. For an Exampl e
Form Fi gure 12.12, i t can be qui ckl y understood that one of the sal es records i n the Fact
tabl e refers to the sal e of Joy Ti ssue Paper at the One Stop Shop on the day of February 16.
12.8 MULTIPLE STAR SCHEMAS
A data warehouse wi l l most l i kel y have mul ti pl e star schemas, i .e. many Fact tabl es.
Each schema i s desi gned to meet a speci fi c set of i nformati on needs. Mul ti pl e schemas, each
focusi ng on a di fferent aspect of the busi ness, are natural i n a di mensi onal warehouse.
174 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Equal l y normal i s the use of the same Di mensi on tabl e i n more than one schema. The
cl assi c exampl e of thi s i s the Ti me di mensi on. The enterpri ses can reuse the Ti me di mensi on
i n al l warehouse schemas, provi ded that the l evel of detai l i s appropri ate.
For exampl e, a retai l company that has one star schema to track profi tabi l i ty per store
may make use of the same Ti me di mensi on tabl e i n the star schema that tracks profi tabi l i ty
by product.
Core and Custom Tables
There are many i nstances when di sti nct products wi thi n the enterpri se are si mi l ar enough
that these can share the same data structure i n the warehouse. For exampl e, banks that offer
both current accounts and savi ngs accounts wi l l treat these two types of products di fferentl y,
but the facts that are stored are fai rl y si mi l ar and can share the same data structure.
Unfortunatel y, there are al so many i nstances when di fferent products wi l l have di fferent
characteri sti cs and di fferent i nteresti ng facts. Sti l l wi thi n the banki ng exampl e, a credi t
card product wi l l have facts that are qui te di fferent from the current account or savi ngs
account. I n thi s scenari o, the bank has heterogeneous products that requi re the use of Core
and Custom tabl es i n the warehouse schema desi gn.
Core Fact and Di mensi on tabl es store facts that are common to al l types of products,
and Custom Fact and Di mensi on tabl es stor e facts that ar e speci fi c to each di sti nct
heterogeneous product.
Thus, i f warehouse users wi sh to anal yze data across al l products, they wi l l make use
of the Core Fact and di mensi on tabl e. I f users wi sh to anal yze data speci fi c to one type of
product, they wi l l make use of the appropri ate Custom Fact and Di mensi on tabl es.
Note that the keys i n the custom tabl es are i denti cal those i n the Core tabl es. Each
Custom Di mensi on tabl e i s a subset of the Core Di mensi on tabl e, wi th the Custom tabl es
contai ni ng addi ti onal attri butes speci fi c to each heterogeneous product.
12.9 ADVANTAGES OF DIMENSIONAL MODELING
Di mensi onal model i ng present warehousi ng teams wi th si mpl e but powerful concepts
for desi gni ng l arge-scal e data warehouses usi ng rel ati onal database technol ogy.
• Dimensional modeling is simple. Di mensi onal model i ng techni ques make i t
possi bl e for warehouse desi gners to create database schemas that busi ness users
can easi l y grasp and comprehend. There i s no need for extensi ve trai ni ng on how
to read di agrams, and there are no confusi ng rel ati onshi p between di fferent data
i tems. The di mensi ons mi mi c perfectl y the mul ti di mensi onal vi ew that users have
of the busi ness.
• Dimensional modeling promotes data quality. By i ts very nature, the star
schema al l ows warehouse admi ni strators to enforce referenti al i ntegri ty checks on
the warehouse. Si nce the fact record key i s a concatenati on of the keys of i ts rel ated
di mensi ons, a fact record i s successful l y l oaded onl y i f the correspondi ng di mensi ons
records are dul y defi ned and al so exi st i n the database.
By enfor ci ng for ei gn key constr ai nts as a for m of r efer enti al i ntegr i ty check,
warehouse DBAs add a l i ne of defense agai nst corrupted warehouse data.
WAREHOUSE SCHEMA DESI GN 175
• Performance optimization is possible through aggregates. As the si ze of the
warehouse i ncreases, performance opti mi zati on becomes a pressi ng concern. Users
who have to wai t hours to get a response to a query wi l l qui ckl y become di scouraged
wi th the warehouse. Aggregates are one of the most manageabl e ways by whi ch
query performance can be opti mi zed.
• Dimensional modeling makes use of relational database technology. Wi th
di mensi onal model i ng, busi ness users are abl e to work wi th mul ti di mensi onal vi ews
wi thout havi ng to use mul ti di mensi onal database (MDDB) structures: Al though
MDDBs are useful and have thei r pl ace i n the warehousi ng archi tecture, they have
severe si ze l i mi tati ons.
Di mensi onal model i ng al l ows I T professi onal s to rel y on hi ghl y scal abl e rel ati onal
database technol ogy of thei r l ar ge-scal e war ehousi ng i mpl ementati on, wi thout
compromi si ng on the usabi l i ty of the warehouse schema.
In Summary
Di mensi onal model i ng provi des a number of techni ques or pri nci pl es for denormal i zi ng
the database structure to create schemas that are sui tabl e for supporti ng deci si onal processi ng.
The mul ti di mensi onal vi ew of data that i s expressed usi ng rel ati onal database semanti cs
i s provi ded by the database schema desi gn cal l ed star schema. The basi c premi se of star
schemas i s that i nformati on can be cl assi fi ed i nto two groups: facts and di mensi ons, i n
whi ch the ful l y normal i zed fact tabl e resi des at the center of the star schema and i s surrounded
by ful l y denor mal i zed di mensi onal tabl es and the key of the fact tabl e i s actual l y a
concatenati on of the keys of each of i ts di mensi ons.
176 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
176
Metadata, or the i nformati on about the enterpri se data, i s emergi ng as a cri ti cal el ement
i n effecti ve data management, especi al l y i n the data warehousi ng arena. Vendors as wel l as
users have began to appreci ate the val ue of metadata. At the same ti me, the rapi d prol i ferati on
of data mani pul ati on and management tool s has resul ted i n al most as many di fferent “fl avors”
and treatments of metadata as there are tool s.
Thus, thi s chapter di scusses metadata as one of the most i mportant components of a
data warehouse, as wel l as the i nfrastructure that enabl es i ts storage, management, and
i ntegrati on wi th other components of a data warehouse. Thi s i nfrastructure i s known as
metadata reposi tory, and i t i s di scussed i n the exampl es of several reposi tory i mpl ementati ons,
i ncl udi ng the reposi tory products from Pl ati num Technol ogi es, Pri sm Sol uti on, and R&O.
13.1 METADATA DEFINED
Metadata i s one of the most i mportant aspects of data warehousi ng. I t i s data about
data stored i n the warehouse and i ts users. At a mi ni mum, metadata contai ns
• The l ocati on and descri pti on of warehouse system and data components (warehouse
objects).
• Names, defi ni ti on, structure, and content of the data warehouse and end-user
vi ews.
• I denti fi cati on of authori tati ve data sources (systems of record).
• I ntegrati on and transformati on rul es used to popul ate the data warehouse these
i ncl ude the mappi ng method from operati onal databases i nto the warehouse, and
al gori thms used to convert, enhance, or transform data.
• I ntegrati on and transformati on rul es used to del i ver data to end-user anal yti cal
tool s.
• Subscri pti on i nformati on for the i nformati on del i very to the anal ysi s subscri bers
• Data warehouse operati onal i nformati on, whi ch i ncl udes a hi story of warehouse
updates, refreshments, snapshots, versi ons, ownershi p authori zati ons, and extract
audi t trai l .
9)4-075- -6),)6)
+0)26-4
13
WAREHOUSE METADATA 177
• Metri cs used to anal yze warehouse usage and performance vi s-à-vi s end-user usage
patterns.
• Securi ty authori zati ons, access control l i sts, etc.
13.2 METADATA ARE A FORM OF ABSTRACTION
I t i s fai rl y easy to appl y abstracti on on concrete, tangi bl e i tems. I nformati on technol ogy
professi onal s do thi s al l the ti me when they desi gn operati onal systems. A concrete product
i s abstracted and descri bed by i ts properti es (i .e., data attri butes) - for exampl e, name, col or,
wei ght, si ze, pri ce. A person can al so be abstracted and descri bed through hi s name, age,
gender, occupati on, etc.
Abstr acti on compl exi ty i ncr eases when the i tem that i s abstr acted i s not as concr ete;
however , such abstr acti on i s ti l l r outi nel y per for med i n oper ati onal systems. For exampl e,
a banki ng tr ansacti on can be descr i bed by the tr ansacti on amount, tr ansacti on cur r ency,
tr ansacti on type (e.g., wi thdr awal ) and the date and ti me when the tr ansacti on took
pl ace.
Fi gure 13.1 and Fi gure 13.2 present two metadata exampl es for data warehouses; the
fi rst exampl e provi des sampl e metadata for warehouse fi el ds. The second provi des sampl e
metadata for warehouse di mensi ons. These metadata are supported by the Warehouse-
Desi gner software product that accompani es thi s book.
Metada Example for Warehouse Fields
• Field Name. The name of the fi el d, as i t wi l l
be used i n the physi cal tabl e.
• Caption. The name of the fi el d, as i t shoul d appear to users.
• Data Type. The appropri ate data type for the fi el d, as
supported by the target RDBMS.
• Index Type. The type of i ndex to be used on thi s attri bute.
• Key? Whether thi s i s a Key fi el d i n the Di mensi on tabl e.
• Format. The format of thi s fi el d.
• Description. A descri pti on or defi ni ti on of thi s fi el d.
Figure 13.1. Metadata Exampl e for Warehouse Fi el ds
I n data war ehousi ng, abstr acti on i s appl i ed to the data sour ces, extr acti on and
transformati on rul es and programs, data structure, and contents of the data warehouse
i tsel f. Si nce the data warehouse i s a reposi tory of data, the resul ts of such an abstracti on
the metadata can be descri bed as “data about data”.
178 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Metada Example for Warehouse Dimension Table
• Name. The physi cal tabl e name to be used i n the database.
• Caption. The busi ness or l ogi cal name of the di mensi on; used
by warehouse users to refer to the di mensi on.
• Description. The standard defi ni ti on of the di mensi on.
• Usage Type. I ndi cates i f a warehouse object i s used as a
fact or as a fact or as a di mensi on.
• Key Option. I ndi cates how keys are to be managed for thi s
di mensi on. Val i d val ues are Overwri te, Generate New Del hi , and
Create Versi on.
• Source. I ndi cates the pri mary data source for thi s di mensi on.
Thi s fi el d i s provi ded for documentati on purpose onl y.
• Online? I ndi cates whether the physi cal tabl e i s actual l y
popul ated correctl y and i s avai l abl e for use by the users
of the data warehouse.
Figure 13.2. Metadata Exampl e for Warehouse Di mensi ons
13.3 IMPORTANCE OF METADATA
Metadata are i mportant to a data warehouse for several reasons. To expl ai n why, we
exami ne the di fferent uses of metadata.
Metadata Establish the Context of the Warehouse Data
Metadata hel p warehouse admi ni strators and users l ocated and understand data i tems,
both i n the source systems and i n the warehouse data structures. For exampl e, the data
val ue 02/05/1998 may mean di fferent dates dependi ng on the date conventi on used. The
same set of numbers can be i nterpreted as February 5, 1998 or as May 2, 1998. I f metadata
descri bi ng the format of thi s data fi el d were avai l abl e, the defi ni te and unambi guous meani ng
of the data i tem coul d be easi l y determi ned.
I n operati onal systems, software devel opers and database admi ni strators deal wi th
metadata every day. Al l techni cal documentati on of source systems are metadata i n one
form or another. Metadata, however, remai n for the most part transparent to the end users
of operati onal systems. They percei ve the operati onal system as a bl ack box and i nteract
onl y wi th the user i nterface.
Thi s practi ce i s i n di rect contrast to data warehousi ng, where the users of deci si onal
systems acti vel y browse through the contents of the data warehouse and must fi rst understand
the warehouse contents before they can make effecti ve use of the data.
Metadata Facilitate the Analysis Process
Consi der the typi cal process that busi ness anal ysi s fol l ow as part of thei r work. Enterpri se
anal ysi s must go through the process of l ocati ng data, retri evi ng data, i nterpreti ng and
anal yzi ng data to yi el d i nformati on, presenti ng the i nformati on, and then recommendi ng
courses of acti on.
WAREHOUSE METADATA 179
To make the data warehouse useful to enterpri se anal ysi s, the metadata must provi de
warehouse end users wi th the i nformati on they need to easi l y perform the anal ysi s steps.
Thus, metadata shoul d al l ow users to qui ckl y l ocate data that are i n the warehouse. The
metadata shoul d al so al l ow anal ysts to i nterpret data correctl y by provi di ng i nformati on
about data formats (as i n the above data exampl e) and data defi ni ti ons.
As a concrete exampl e, when a data i tem i n the warehouse Fact tabl e i s l abel ed “Profi t”
the user shoul d be abl e to consul t the warehouse metadata to l earn how the Profi t data i tem
i s computed.
Metadata are a Form of Audit Trail for Data Transformation
Metadata document the transformati on of source data i nto warehouse data. Warehouse
metadata must be abl e to expl ai n how a parti cul ar pi ece of warehouse data was deri ved
from the operati onal systems. Al l busi ness rul es that govern the transformati on of data to
new val ues of new formats are al so documented as metadata.
Thi s form of audi t trai l i s requi red i f users are to gai n confi dence i n the veraci ty and
qual i ty of warehouse data. I t i s al so essenti al to the user’s understandi ng of warehouse data
to know where they came from.
I n addi ti on, some warehousi ng products use thi s type of metadata to generate extracti on
and transformati on scri pts for use on the warehouse back-end.
Metadata Improve or Maintain Data Quality
Metadata can i mprove or mai ntai n warehouse data qual i ty through the defi ni ti on of
val i d val ues for i ndi vi dual warehouse data i tems. Pri or to actual l oadi ng i nto the warehouse,
the warehouse l oad i mages can be revi ewed by a data qual i ty tool to check for compl i ance
wi th val i d val ues for key data i tems. Data errors are qui ckl y hi ghl i ghted for correcti on.
Metadata can even be used, as the basi s for any error-correcti on processi ng that shoul d
be done i f a data error i s found. Error-correcti on rul es are documented i n the metadata
reposi tory and executed by program code on an as needed basi s.
13.4 TYPES OF METADATA
Al though ther e ar e sti l l ongoi ng di scussi ons and debates r egar di ng standar ds for
metadata reposi tori es, i t i s general l y agreed that metadata reposi tory must consi der the
metadata types descri bed i n the next subsecti ons.
Administrative Metadata
Admi ni strati ve metadata contai n descri pti ons of the source database and thei r contents,
the data warehouse objects and the busi ness rul es used to transform data from the sources
i nto the data warehouse.
• Data sources. These are descri pti ons of al l data sources used by the warehouse,
i ncl udi ng i nformati on about the data ownershi p. Each record and each data i tem
i s defi ned to ensure a uni form understandi ng by al l warehousi ng team members
and warehouse users. Any rel ati onshi ps between di fferent data sources (e.g., one
provi des data to another) are al so documented.
• Source-to-target field mapping. The mappi ng of source fi el ds (i n operati onal
systems) to target fi el d (i n the data warehouse) expl ai ns what fi el ds are used to
180 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
popul ate the data warehouse. I t al so documents the transformati ons and formatti ng
changes that were appl i ed to the ori gi nal , raw data to deri ve the warehouse data.
• Warehouse schema design. Thi s model of the data warehouse descri bes the
warehouses servers, databases, database tabl es, fi el ds, and any hi erarchi es that
may exi st i n the data. Al l referenti al tabl es, system codes, etc., are al so documented.
• Warehouse back-end data structure. Thi s i s a model of the back-end of the
warehouse, i ncl udi ng stagi ng tabl es, l oad i mage tabl es, and nay other temporary
data structures that are used duri ng the data transformati on process.
• Warehouse back-end tools or programs. A defi ni ti on of each extr acti on,
transformati on, and qual i ty assurance program or tool that i s used to bui l d or
refresh the data warehouse. Thi s defi ni ti on i ncl udes how often the programs are
run, i n what sequence, what parameters are expected, and the actual source code
of the programs (i f appl i cabl e). I f these programs are generated, the name of the
tool and the data and ti me when the programs were generated shoul d al so be
i ncl uded.
• Warehouses architecture. I f the warehouse archi tecture i s one where an enterpri se
war ehouse feeds many depar tmental or ver ti cal data mar ts, the war ehouses
archi tecture shoul d be documented as wel l . I f the data mart contai ns a l ogi cal
subset of the warehouse contents, thi s subset shoul d al so be defi ned.
• Business rules and policies. Al l appl i cabl e busi ness r ul es and pol i ci es ar e
documented. Exampl es i ncl ude busi ness formul as for computi ng costs or profi ts.
• Access and security rules. Rul es governi ng the securi ty and access ri ghts of
users shoul d l i kewi se be defi ned.
• Units of measure. Al l uni ts of measurement and conversi on rates used between
di fferent uni ts shoul d al so be documented. Especi al l y i f conversi on formul as and
rates change over ti me.
End-user Metadata
End-user metadata hel p users create thei r queri es and i nterpret the resul ts. Users may
al so need to know the defi ni ti ons of the warehouse data, thei r descri pti ons, and any hi erarchi es
that may exi st wi thi n the vari ous di mensi ons.
• Warehouses contents. Metadata–must descri bed the data structure and contents
of the data warehouse i n user-fri endl y terms. The vol ume of data i n vari ous schemas
shoul d l i kewi se be pr esented. Any al i ases that ar e used for data i tems, ar e
documented as wel l . Rul es used to create summari es and other pre-computed total
are al so documented.
• Predefined queries and reports. Queri es and reports that have been predefi ned
and that are readi l y avai l abl e to users shoul d be documented to avoi d dupl i cati on
of effort. I f a report server i s used, the schedul e for generati ng new reports shoul d
be made known.
• Business rules and policies. Al l busi ness rul es appl i cabl e to the warehouse data
shoul d be documented i n busi ness terms. Any changes to busi ness rul es over ti me
shoul d al so be documented i n the same manner.
WAREHOUSE METADATA 181
• Hierarchy definitions. Descri pti ons of the hi erarchi es i n warehouse di mensi ons
are al so documented i n end-user metadata. Hi erarchy defi ni ti ons are parti cul arl y
i mportant to support dri l l i ng up and down warehouses di mensi ons.
• Status information. Di fferent rol l outs of the data warehouse wi l l be i n di fferent
stages of devel opment. Status i nformati on i s requi red to i nform warehouse users
of the warehouse status at any poi nt i n ti me. Status i nformati on may al so vary at
the tabl e l evel . For exampl e, the base-l evel schemas of the warehouse may al ready
be avai l abl e and onl i ne to users whi l e the aggregates are bei ng computed.
• Data quality. Any known data qual i ty probl ems i n the warehouse shoul d be cl earl y
documented for the users. Thi s wi l l prompt users to make careful use of warehouse
data.
• A hi story of al l warehouse l oads, i ncl udi ng data vol ume, data errors encountered,
and l oad ti me frame. Thi s shoul d be synchroni zed wi th the warehouse status
i nformati on. The l oad schedul e shoul d al so be avai l abl e—users need to know when
new data wi l l be avai l abl e.
• Warehouse purging rules. The rul es that determi ne when data i s removed from
the warehouse shoul d al so be publ i shed for the benefi t of warehouse end-users.
Users need thi s i nformati on to understand when data wi l l become unavai l abl e.
Optimization Metadata
Metadata are mai ntai ned to ai d i n the opti mi zati on of the data warehouse desi gn and
performance. Exampl es of such metadata i ncl ude:
• Aggregate definitions. Al l warehouse aggregates shoul d al so be documented i n
the metadata r eposi tor y. War ehouse fr ont-end tool s wi th aggr egate navi gati on
capabi l i ti es rel y on thi s type of metadata to work properl y.
• Collection of query statistics. I t i s hel pful to track the types of queri es that are
made agai nst the warehouse. Thi s i nformati on serves as an i nput to the warehouse
admi ni str ator for database opti mi zati on and tuni ng. I t al so hel ps to i denti fy
warehouse data that are l argel y unused.
13.5 METADATA MANAGEMENT
A frequentl y occurri ng probl em i n data warehousi ng i s the i nabi l i ty to communi cate to
the end user what i nformati on resi des i n the data warehouse and how i t can be accessed.
The key to provi di ng users and appl i cati ons wi th a roadmap to the i nformati on stored i n the
warehouse i s the metadata. I t can defi ne al l data el ements and thei r attri butes, data sources
and ti mi ng, and the rul es that govern data use and data transformati ons. Metadata needs
to be col l ected as the war ehouse i s desi gned and bui l t. Si nce metadata descr i bes the
i nformati on i n the warehouse from mul ti pl e vi ewpoi nts (i nput, sources, transformati on,
access, etc.), i t i s i mperati ve that the same metadata or i ts consi stent repl i cas be avai l abl e
to al l tool s sel ected for the warehouse i mpl ementati on, thus enforci ng the i ntegri ty and
accuracy of the warehouse i nformati on. The metadata al so has to be avai l abl e to al l warehouse
users i n order to gui de them as they use the warehouse. Even though there are a number
of tool s avai l abl e to hel p users understand and use the warehouse, these tool s need to be
careful l y eval uated before any purchasi ng deci si on i s made. I n other words, a wel l -thought-
182 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
through strategy for col l ecti ng, mai ntai ni ng, and di stri buti ng metadata i s needed for a
successful data warehouse i mpl ementati on.
13.6 METADATA AS THE BASIS FOR AUTOMATING WAREHOUSING TASKS
Al though metadata have tr adi ti onal l y been used as a for m of after -the-fact
documentati on, there i s a cl ear trend i n data warehousi ng toward metadata taki ng on a
more acti ve rol e. Al most al l the major data warehouse products or tool s al l ow thei r users
to record and mai ntai n metadata about the warehouse, and make use of the metadata as
a basi s for automati ng one or more aspects of the back-end warehouse process.
For exampl e:
• Extraction and transformation. Users of extracti on and transformati on tool s
can speci fy source-to-target fi el d mappi ngs and enter al l busi ness rul es that govern
the transformati on of data from the source to the target. The mappi ng (whi ch i s a
form of metadata) serves as the basi s for generati ng scri pts that automate the
extracti on and transformati on process.
• Data quality. Users of data qual i ty tool s can speci fy val i d val ues for di fferent data
i tems i n ei ther the source system, l oad i mage, or the warehouse i tsel f. These data
qual i ty tool s use such metadata as the basi s for i denti fyi ng and correcti ng data
er r or s.
• Schema generation. Si mi l arl y, users of Warehouse Desi gner use the tool to record
metadata rel ati ng to the data structure of a di mensi onal data warehouse or data
mart i nto the tool . Warehouse Desi gner then uses the metadata as the basi s for
generati ng the SQL Data Defi ni ti on Language (DDL) statements that create data
warehouse tabl es, fi el ds, i ndexes, aggregates, etc.
• Front-end tools. Front-end tool s al so make use of metadata to gai n access to the
warehouse database. R/OLAPXL (the ROLAP front-end tool ) makes use of metadata
to di spl ay warehouse tabl es and fi el ds and to redi rect queri es to summary tabl es
(i .e., aggregate navi gati on).
13.7 METADATA TRENDS
One of the cl earl y observabl e trends i n the data warehouse arena i s the i ncrease i n
requi rements to i ncorporate external data wi thi n the data warehouse. Thi s i s necessary i n
order to reduce costs and to i ncrease competi ti veness and busi ness abi l i ty. However, the
process of i ntegrati ng external and i nternal data i nto the warehouse faces a number of
chal l enges:
• I nconsi stent data formats
• Mi ssi ng or i nval i d data
• Di fferent l evel s of aggregati on
• Semanti c i nconsi stency (e.g., di fferent codes may mean di fferent thi ngs from di fferent
suppl i ers of data)
• Unknown or questi onabl e data qual i ty and ti mel i ness
WAREHOUSE METADATA 183
Al l these i ssues put an addi ti onal burden on the col l ecti on and management of the
common metadata defi ni ti ons. Some of thi s burden i s bei ng addressed by standards i ni ti ati ves
l i ke Metadata Coal i ti on’s Metadata I nterchange Speci fi cati on, descri bed i n Sec. 11.2. But
even wi th the standards i n pl ace, the metadata reposi tory i mpl ementati ons wi l l have to be
suffi ci entl y robust and fl exi bl e to rapi dl y adopt to handl i ng a new data source, to be abl e
to overcome the semanti c di fferences as wel l as potenti al di fferences i n l ow-l evel data formats,
medi a types, and communi cati on protocol s, to name just a few.
Moreover, as data warehouses are begi nni ng to i ntegrate vari ous data types i n addi ti on
to tradi ti onal al phanumeri c data types, the metadata and i ts reposi tory shoul d be abl e to
handl e the new enri ched data content as easi l y as the si mpl e data types before. For exampl e,
i ncl udi ng text, voi ce, i mage, ful l -moti on vi deo, and even Web pages i n HTML format i nto the
data warehouse may requi re a new way of presenti ng and managi ng the i nformati on about
these new data types. But the ri ch data types menti oned above are not l i mi ted to just these
new medi a types. Many organi zati ons are begi nni ng to seri ousl y consi der stori ng, navi gati ng,
and otherwi se managi ng data descri bi ng the organi zati onal structures. Thi s i s especi al l y
true when we l ook at data warehouses deal i ng wi th human resources on a l arge scal e (e.g.,
the enti re organi zati on, or an enti ty l i ke a state, or a whol e country). And of course, we can
al ways further compl i cate the i ssue by addi ng ti me and space di mensi ons to the data
warehouse. Stori ng and managi ng temporal and spati al data and data about i t i s a new
chal l enge for metadata tool vendors and standard bodi es al i ke.
In Summary
Al though qui te a l ot has been wri tten or sai d about the i mportance of metadata, there
i s yet to be a consi stent and rel i abl e i mpl ementati on of warehouse metadata and metadata
reposi tori es on an i ndustry-wi de scal e.
To Address thi s i ndustry-wi de i ssue, an organi zati on cal l ed the Meta Data Coal i ti on
was formed to defi ne and support the ongoi ng evol uti on of a metadata i nterchange format.
The coal i ti on has rel eased a metadata i nterchange speci fi cati on that ai ms to be the standard
for shari ng metadata among di fferent types of products. At l east 30 warehousi ng vendors
are currentl y members of thi s organi zati on.
Unti l a cl ear metadata standard i s establ i shed, enterpri ses have no choi ce but to i denti fy
the type of metadata requi red by thei r respecti ve warehouse i ni ti ati ves, then acqui re the
necessary tool s to support thei r metadata requi rements.
184 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
184
The successful i mpl ementati on of data warehousi ng technol ogi es creates new possi bi l i ti es
for enterpri ses. Appl i cati ons that previ ousl y were not feasi bl e due to the l ack of i ntegrated
data are now possi bl e. I n thi s chapter, we take a qui ck l ook at the di fferent types of
enterpri ses that i mpl ement data warehouses and the types of appl i cati ons that they have
depl oyed.
14.1 THE EARLY ADOPTERS
Among the earl y adopters of warehousi ng technol ogi es were the tel ecommuni cati ons,
banki ng, and retai l sectors.
Thus, most earl y warehousi ng appl i cati ons can be found i n these i ndustri es. For exampl e,
• Telecommunication : Compani es were i nterested i n anal yzi ng (among other thi ngs)
network uti l i zati on, the cal l i ng patterns of thei r cl i ents, and the profi tabi l i ty of
thei r product offeri ngs. Such i nformati on was and sti l l i s requi red for formul ati ng,
modi fyi ng, and offer i ng di ffer ent subscr i pti on packages wi th speci al r ates and
i ncenti ves to di fferent customers.
• Banks : Were and sti l l are i nterested i n effecti vel y managi ng the bank’s asset and
l i abi l i ty por tfol i os, anal yzi ng pr oduct and customer pr ofi tabi l i ty, and pr ofi l i ng
customers and customer profi tabi l i ty, and profi l i ng customers and househol ds as a
means of i denti fyi ng target marketi ng and cross-sel l i ng opportuni ti es.
• Retail: The retai l sector was i nterested i n sal es trends, parti cul arl y buyi ng patterns
that are i nfl uenced by changi ng seasons, sal es promoti ons, hol i days, and competi tor
acti vi ti es. Wi th the i ntroducti on of customer di scount cards, the retai l sector was
abl e to attri bute previ ousl y anonymous purchases to i ndi vi dual customers. I ndi vi dual
buyi ng habi ts and l i kes are now used as i nputs to formul ati ng sal es promoti ons and
gui di ng di rect marketi ng acti vi ti es.
14.2 TYPES OF WAREHOUSING APPLICATIONS
Al though warehousi ng found i t earl y use i n di fferent i ndustri es wi th di fferent i nformati on
WAREHOU$¡NG AFFL¡CAT¡ON$
14
CHAFTER
WAREHOUSI NG APPLI CATI ONS 185
requi rements, i t i s sti l l possi bl e to categori ze the di fferent warehousi ng appl i cati ons i nto the
fol l owi ng types and tasks:
Sales and Marketing
• Performance trend analysis. Si nce a data warehouse i s desi gned to store hi stori cal
data, i t i s an i deal technol ogy for anal yzi ng per for mance tr ends wi thi n an
or gani zati on. War ehouse user s can pr oduce r epor ts that compar e cur r ent
performance to hi stori cal fi gures. Thei r anal ysi s may hi ghl i ght trends that reveal
a major opportuni ty or confi rm a suspected probl em. Such performance trend anal ysi s
capabi l i ti es are cruci al to the success of pl anni ng acti vi ti es (e.g., sal es forecasti ng).
• Cross-selling. A data warehouse provi des an i ntegrated vi ew of the enterpri se’s
many rel ati onshi ps wi th i ts cus><Chapter14| Warehousi ng Appl i cati ons><tomers.
By obtai ni ng a cl ear er pi ctur e of customer s and the ser vi ces that they avai l
themsel ves of, the enterpri se can i denti fy opportuni ti es for cross-sel l i ng addi ti onal
products and servi ces to exi sti ng customers.
• Customer profiling and target marketing. I nternal enterpri se data can be
i ntegrated wi th census and demographi c data to anal yze and deri ve customer profi l es.
These profi l es consi der factors such as age, gender, mari tal status, i ncome brackets,
purchasi ng hi story, and number of dependents. Through these profi l es, the enterpri se
can, wi th some accuracy, esti mate how appeal i ng customers wi l l fi nd a parti cul ar
product or product mi x. By model i ng customers i n thi s manner, the enterpri se has
better i nputs to target marketi ng efforts.
• Promotions and product bundling. The data warehouse allows enterprises to analyze
thei r customers’ purchasi ng hi stori es as an i nput to promoti ons and product bundl i ng.
Thi s i s parti cul arl y hel pful i n the retai l sector, where rel ated products from di fferent
vendors can be bundl ed together and offered at a more attracti ve pri ce. The success
of di fferent promoti ons can be eval uated through the warehouse data as wel l .
• Sales tracking and reporting. Al though enterpri ses have l ong been abl e to track
and report on thei r sal es performance, the ready avai l abi l i ty of data i n the warehouse
dramati cal l y si mpl i fi es thi s task.
14.3 FINANCIAL ANALYSIS AND MANAGEMENT
• Risk analysis and management. I ntegrated warehouse data al l ow enterpri ses to
anal yze thei r ri sk exposure. For exampl e, banks want to effecti vel y manage thei r
mi x of assets and l i abi l i ti es. Loan departments want to manage thei r ri sk exposure
to sectors or i ndustri es that are not performi ng wel l . I nsurance compani es want to
i denti fy customer profi l es and i ndi vi dual customers who have consi stentl y proven
to be unprofi tabl e and to adjust thei r pri ci ng and product offeri ngs accordi ngl y.
• Profitability analysis. I f operati ng costs and revenues are tracked or al l ocated at
a suffi ci entl y detai l ed l evel i n operati onal systems, a data warehouse can be used
for profi tabi l i ty anal ysi s. Users can sl i ce and di ce through warehouse data to produce
reports that anal yze the enterpri se’s profi tabi l i ty by customer, agent or sal esman,
pr oduct, ti me per i od, geogr aphy, or gani zati onal uni t, and any other busi ness
di mensi on that the user requi res.
186 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
General Reporting
• Exception reporting. Through the use of excepti on reporti ng or al ert systems,
enterpri se managers are made aware of i mportant or si gni fi cant events (e.g., more
than x% drop i n sal es for the current month current year vs. same month, l ast
year). Managers can defi ne the excepti ons that are of i nterest to them. Through
excepti ons or al erts, enterpri se managers l earn about busi ness si tuati ons before
they escal ate i nto major probl ems. Si mi l arl y, managers l earn about si tuati ons that
can be expl oi ted whi l e the wi ndow of opportuni ty i s sti l l open.
Customer Care and Service
• Customer relationship management. Warehouse data can al so be used as the
basi s for managi ng the enterpri se’s rel ati onshi p wi th i ts many customers. Customers
wi l l be far from pl eased i f di fferent groups i n the same enterpri se ask them for the
same i nformati on more than once. Customers appreci ate enterpri ses that never
forget speci al i nstructi ons, preferences, or requests. I ntegrated customer data can
serve as the basi s for i mprovi ng and growi ng the enterpri se’s rel ati onshi ps wi th
each of i ts customers and are therefore cri ti cal to effecti ve customer rel ati onshi p
management.
14.4 SPECIALIZED APPLICATIONS OF WAREHOUSING TECHNOLOGY
Date warehousi ng technol ogy can be used to devel op hi ghl y speci al i zed appl i cati ons, as
di scussed bel ow.
Call Center Integration
Many or gani zati ons, par ti cul ar l y those i n the banki ng, fi nanci al ser vi ces, and
tel ecommuni cati ons i ndustri es, are l ooki ng i nto Cal l Center appl i cati ons to better i mprove
thei r customer r el ati onshi ps. As wi th any Oper ati onal Data Stor e or data war ehouse
i mpl ementati on, Cal l Center appl i cati ons face the daunti ng task of i ntegrati ng data from
many di sparate sources to form an i ntegrated pi cture of the customer’s rel ati onshi p wi th the
enterpri se.
What has not readi l y been apparent to i mpl ementers of cal l centers i s that Operati onal
Data Store and data warehouse technol ogi es are the appropri ate I T archi tecture components
to support cal l center appl i cati ons. Consi der Fi gure 14.1.
Call Center
Application
Data
Access and
Retrieval
Data
Warehouse
Workflow Technology
Operational Data Store
C
T
I
Middle
ware
System 1 System 2 System 3 System N
Operational Systems
Figure 14.1. Cal l Center Archi tecture usi ng Operati onal Data Store and Data Warehouse Technol ogi es
WAREHOUSI NG APPLI CATI ONS 187
• Data from mul ti pl e sources are i ntegrated i nto an Operati onal Data Store provi de
a current, i ntegrated vi ew of the enterpri se operati ons.
• The Cal l Centers appl i cati on uses the Operati onal Data Store as i ts pri mary source
of customer i nformati on. The Cal l Center al so extends the contents of the Operati onal
Data Store by di rectl y updati ng the ODS.
• Workfl ow technol ogi es faci l i ti es used i n conjuncti on wi th the appropri ate mi ddl eware
are i ntegrated wi th both the Operati onal Data Store and the Cal l Center appl i cati ons.
• At regul ar i nterval s, the Operati onal Data Store feeds the enterpri se data warehouse.
The data warehouse has i ts own set of data access and retri eval technol ogi es to
provi de deci si onal i nformati on and reports.
Credit Bureau Systems
Credi t bureaus for the banki ng, tel ecommuni cati ons, and uti l i ty compani es can benefi t
from the use of warehousi ng technol ogi es for i ntegrati ng negati ve customer data from many
di fferent enterpri ses. Data are i ntegrated, then stored i n a reposi tory that can be accessed
by al l authori zed users, ei ther di rectl y or through a network connecti on.
For thi s process to work smoothl y, the credi t bureau must set standard formats and
defi ni ti ons for al l the data i tems i t wi l l recei ve. Data provi ders extract data from thei r
respecti ve operati onal systems and submi t these data, usi ng standard data storage medi a.
The credi t bureau transforms, i ntegrates, de-dupl i cates, cl eans, and l oads the data i nto
a warehouse that i s desi gned speci fi cal l y to meet the queryi ng requi rements of both the
credi t bureau and i ts customers.
The credi t bureau can al so use data warehousi ng technol ogi es to mi ne and anal yze the
credi t data to produce i ndustry-speci fi c and cross-i ndustry reports. Patterns wi thi n the
customer database can be i denti fi ed through stati sti cal anal ysi s (e.g. typi cal profi l e of a
bl ackl i sted customer) and can be made avai l abl e to credi t bureau customers. Warehouse
management and admi ni strati on modul es, such as those that track and anal yze queri es, can
be used as the basi s for bi l l i ng credi t bureau customers.
In Summary
The bottom l i ne of any data war ehousi ng i nvestment r ests i ts abi l i ty to pr ovi de
enterpri ses wi th genui ne busi ness val ue. Data warehousi ng technol ogy i s merel y an enabl er;
the true val ues comes from the i mprovements that enterpri ses make to deci si onal and
operati onal busi ness processes-i mprovement that transl ate to better customer servi ce, hi gher-
qual i ty products, reduced costs, or faster del i ver ti mes.
Data warehousi ng appl i cati ons, as descri bed i n thi s chapter, enabl e enterpri ses to
capi tal i ze on the avai l abi l i ty of cl ean, i ntegrated data; Warehouse users are abl e to transform
data i nto i nformati on and to use that i nformati on to contri bute to the enterpri se’s bottom
l i ne.
This page
intentionally left
blank
PART V : MAINTENANCE,
EVOLUTION AND TRENDS
After the i ni ti al data warehouse project i s compl eted, i t may
seem that the bul k of the work i s done. I n real i ty, however,
the warehousi ng team has taken just the fi rst step of a l ong
jour ney.
Thi s secti on of the book expl ores the next steps by consi deri ng
the fol l owi ng:
• Warehouse Maintenance and Evolution. Thi s chapter
pr esents the major consi der ati ons for mai ntai ni ng and
evol vi ng the warehouse.
• Warehousing Trends. Thi s chapter l ooks at trends i n
data warehousi ng projects.
This page
intentionally left
blank
191
Wi th the data warehouse i n producti on, the warehousi ng team wi l l face a new set of
chal l enges—the mai ntenance and evol uti on of the warehouse.
15.1 REGULAR WAREHOUSE LOADS
New or updated data must be l oaded regul arl y from the source systems i nto the data
warehouse to ensure that the l atest data are avai l abl e to warehouse users. Thi s l oadi ng i s
typi cal l y conducted duri ng the eveni ngs, when the operati onal systems can be taken offl i ne.
Each step i n the back-end process—extract, transform, qual i ty assure, and l oad–must be
performed for each warehouse l oad.
New warehouse l oads i mpl y the need to cal cul ate and popul ate aggregate tabl es wi th
new records. I n cases where the data warehouse feeds one or more data marts, the warehouse
l oadi ng i s not compl ete unti l the data marts are l oaded wi th the l atest data.
15.2 WAREHOUSE STATISTICS COLLECTION
War ehouse usage stati sti cs shoul d be col l ected on a r egul ar basi s to moni tor the
performance and uti l i zati on of the warehouse. The fol l owi ng types of stati sti cs wi l l prove to
be i nsi ghtful :
• Queries per day. The number of queri es the warehouse responds to, on any gi ven
day i s categori zed i nto l evel s of compl exi ty whenever possi bl e. Queri es agai nst
summary tabl es al so i ndi cate the useful ness of these stored aggregates.
• Query response times. I t i s the ti me a query takes to execute.
• Alerts per day. I t refers to the number of al erts or excepti ons that are tri ggered
by the warehouse on any gi ven day, i f an al ert system i s i n pl ace.
• Valid users. I t i s the number of users who have access to the warehouse.
• Users per day. I t i s the number of users who actual l y make use of the warehouse
on any gi ven day. Thi s number can be compared to the number of val i d users.
WAREHOU$E MA¡NTENANCE AND EVOLUT¡ON
15
CHAFTER
192 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Frequency of use. I t i s the number of ti mes a user actual l y l ogs on to the data
warehouse wi thi n a gi ven ti me frame. Thi s stati sti cs i ndi cates how much the
warehouse supports the user’s day-to-day acti vi ti es.
• Session length. I t i s the l ength of ti me a user stays onl i ne each ti me he l ogs on
to the data warehouse.
• Time of day, day of week, day of month. The stati sti cs of the ti me of day, day
of week, and day of month when each query i s executed may hi ghl i ght peri ods
where there i s constant, heavy usage of warehouse data.
• Subject areas. Thi s i denti fi es whi ch of the subject areas i n the warehouse are
more frequentl y used. Thi s i nformati on al so serves as a gui de for subject areas that
are candi dates for removal .
• Warehouse size. I t i s the number of records of data for each warehouse tabl e after
each warehouse l oad. Thi s stati sti cs i s a useful i ndi cator of the growth rate of the
warehouse.
• Warehouse contents profile. I t i s the stati sti cs about the warehouse contents
(e.g., total number of customers or accounts, number of empl oyees, number of
uni que products, etc.). Thi s i nformati on provi des i nteresti ng metri cs about the
busi ness growth.
15.3 WAREHOUSE USER PROFILES
As more users access the warehouse, the usabi l i ty of the data access and retri eval tool s
become cri ti cal . The majori ty of users wi l l not have the pati ence to l earn a whol e new set
of tool s and wi l l si mpl y conti nue the current and conveni ent practi ce of submi tti ng requests
to the I T department.
The war ehouse team must ther efor e eval uate the pr ofi l es of each of the i ntended
warehouse users. Thi s user eval uati on can al so be used as i nput to tool sel ecti on and to
determi ne the number of l i censes requi red for each data access and retri eval tool .
I n general , there are three types of warehouse end users, and thei r preferred method
for i nteracti ng wi th the data warehouse vari es accordi ngl y. These users are:
• Senior and executive management. These end users general l y prefer to vi ew
i nformati on through predefi ned reports wi th bui l t-i n hi erarchi cal dri l l i ng capabi l i ti es.
They prefer reports that use graphi cal presentati on medi a, such as charts and
model s, to qui ckl y convey i nformati on.
• Middle management and senior analysts. These i ndi vi dual s prefer to create
thei r own queri es and reports, usi ng the avai l abl e tool s. They create i nformati on
i n an ad hoc styl e, based on the i nfor mati on needs of seni or and executi ve
management. However, thei r i nterest i s often l i mi ted to a speci fi c product group,
a speci fi c geographi cal area, or a speci fi c aspect of the enterpri se’s performance.
The preferred i nterfaces for users of thi s type are spreadsheets and front-ends that
provi de budgeti ng and forecasti ng capabi l i ti es.
• Business analyst and IT support. These i ndi vi dual s are among the heavi est
users of warehouse data and are the ones who perform actual data col l ecti on and
WAREHOUSE MAI NTENANCE AND EVOLUTI ON 193
anal ysi s. They create the charts and reports that are requi red to present thei r
fi ndi ngs to seni or management. They al so prefer to work wi th tool s that al l ow them
to create thei r own queri es and reports.
The above categori es descri be the typi cal user profi l es. The actual preference of i ndi vi dual
users may vary, dependi ng on i ndi vi dual I T l i teracy and worki ng styl e.
15.4 SECURITY AND ACCESS PROFILES
A data warehouse contai ns cri ti cal i nformati on i n a readi l y accessi bl e format. I t i s
therefore i mportant to keep secure not onl y the warehouse data but al so the i nformati on
that i s di sti l l ed from the warehouse.
OLTP approaches to securi ty, such as the restri cti on of access to cri ti cal tabl es, wi l l not
work wi th a data warehouse because of the expl oratory fashi on by whi ch warehouse data
are used. Most anal ysts wi l l use the warehouse i n an ad hoc manner and wi l l not necessari l y
know at the outset what subject areas they wi l l be expl ori ng or even what range of queri es
they wi l l be creati ng. By restri cti ng user access to certai n tabl es, the warehouse securi ty
may i nadvertentl y i nhi bi t anal ysts and other warehouse users from di scoveri ng cri ti cal and
meani ngful i nformati on.
I ni ti al warehouse rol l outs typi cal l y requi re fai rl y l ow securi ty because of the smal l and
targeted set of users i ntended for the i ni ti al rol l outs. There wi l l be therefore a need to revi si t
the securi ty and access profi l es of users as each rol l out i s depl oyed.
When users l eave an organi zati on, thei r correspondi ng user profi l es shoul d be removed
to prevent the unauthori zed retri eval and use of warehouse data.
Al so, i f the war ehouse data ar e made avai l abl e to user s over the publ i c i nter net
i nfrastructure, the appropri ate securi ty measures shoul d be put i n pl ace.
15.5 DATA QUALITY
Data qual i ty (or the l ack thereof) wi l l conti nue to pl ague warehousi ng efforts i n the
years to come. The enterpri se wi l l need to determi ne how data errors wi l l be handl ed i n the
warehouse. There are two general approaches to data qual i ty probl ems:
• Only clean data gets in. Onl y data that are certi fi ed 100 percent correct are
l oaded i nto the warehouse. Users are confi dent that the warehouse contai ns correct
data and can take deci si ve acti on based on the i nformati on i t provi des. Unfortunatel y,
data errors may take a l ong ti me to i denti fy, and even more to fi x, before the
compl eti on of a ful l warehouse l oad. Al so, a vast majori ty of queri es (e.g., who are
our top-10 customers? how many product combi nati ons are we sel l i ng?) wi l l not be
meani ngful i f a warehouse l oad i s i ncompl ete.
• Clean as we go. Al l data are l oaded i nto the warehouse, but mechani sms are
defi ned and i mpl emented to i denti fy and correct data errors. Al though such an
approach al l ows warehouse l oads to take pl ace, the qual i ty of the data i s suspect-
abl e and may resul t i n mi sl eadi ng i nformati on and i l l -i nformed deci si ons. The
questi onabl e data qual i ty may al so cause probl ems wi th user acceptance—users
wi l l be l ess i ncl i ned to use the warehouse i f they do not bel i eve the i nformati on i t
provi des.
194 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
I t i s unreal i sti c to expect that al l data qual i ty errors wi l l be corrected duri ng the course
of one warehouse rol l out. However, acceptance of thi s real i ty does not mean that data
qual i ty efforts are for naught and can be abandoned.
Whenever possi bl e, correct the data i n the source systems so that cl eaner data are
pr ovi ded i n the next war ehouse l oad. Pr ovi de mechani sms for cl ear l y i denti fyi ng di r ty
warehouse data. I f users know whi ch parts of the warehouse are suspectabl e, they wi l l be
abl e to fi nd val ue i n the data that are correct.
I t i s an unfor tunate fact that ol der enter pr i ses have l ar ger data vol umes and,
consequentl y, a l arger vol ume of data errors.
15.6 DATA GROWTH
I ni ti al warehouse depl oyments may not face space or capaci ty probl ems, but as ti me
passes and the warehouse si ze grows wi th each new data l oad, the proper management of
data growth expansi on prol i ferati on grows i n i mportance.
There are several ways to handl e data growth, i ncl udi ng:
• Use of aggregates. The use of stored aggregates si gni fi cantl y reduces the space
requi red by the data especi al l y i f the data are requi red onl y at a hi ghl y summari zed
l evel . The detai l ed data can be del eted or archi ved after aggregates have been
created. However, the removal of detai l ed data i mpl i es the l oss of the abi l i ty to dri l l
down for more detai l . Al so new summari es at other l evel s may not be deri vabl e
from the current portfol i o of aggregate schemas.
• Limiting the time frame. Al though users want the warehouse to store as much
data for as l ong as possi bl e, there may be a need to compromi se by l i mi ti ng the
l ength of hi stori cal data i n the warehouse.
• Removing unused data. Usi ng query stati sti cs gathered over ti me, i t i s possi bl e
for warehouse admi ni strators to i denti fy rarel y used data i n the warehouse. These
records are i deal candi dates for removal si nce thei r storage resul ts i n costs wi th
very l i ttl e busi ness val ue.
15.7 UPDATES TO WAREHOUSE SUBSYSTEMS
As ti me passes, a number of condi ti ons wi l l necessi tate changes to the data structure
of the warehouse, i ts stagi ng areas, i ts back-end subsystems and consequentl y, i ts metadata.
We descri be some of these condi ti ons i n the fol l owi ng subsecti ons.
Source System Evolution
As the source systems evol ve, so by necessi ty does the data warehouse. I t i s therefore
cri ti cal that any pl ans to change the scope, functi onal i ty, and avai l abi l i ty of the source
systems al so consi der any possi bl e i mpact on the data warehouse. The CI O i s i n the best
posi ti on to ensure that the project efforts are coordi nated across mul ti pl e projects.
• Changes in scope. Scope changes i n operati onal systems typi cal l y i mpl y one or
more of the fol l owi ng: the avai l abi l i ty of new data i n an exi sti ng system, or removal
of previ ousl y avai l abl e data i n an exi sti ng system, or the mi grati on of currentl y
avai l abl e data to a new or di fferent computi ng envi ronment. An exampl e of the
l atter i s the depl oyment of a new system to repl ace an exi sti ng one.
WAREHOUSE MAI NTENANCE AND EVOLUTI ON 195
• Change in functionality. There are ti mes when the data structure al ready exi sti ng
i n the operati onal systems remai ns the same but the processi ng l ogi c and busi ness
rul es governi ng the i nput of future data i s changed. Such changes requi re updates
to data i ntegri ty rul es and metadata used for qual i ty assurance. Al l qual i ty assurance
programs shoul d l i kewi se be updated.
• Change in availability. Addi ti onal demands on the operati onal system may affect
the avai l abi l i ty of the source system (e.g., smal l er batch wi ndows). The batch wi ndows
may affect the schedul e of regul ar warehouse extracti ons and may pl ace new effi ci ency
and performance demands on the warehouse extracti on and transformati on subsystems.
Use of New or Additional External Data
Some data are commerci al l y avai l abl e for purchase and can be i ntegrated i nto the data
warehouse as the busi ness needs evol ve. The use of external data presents i ts own set of
di ffi cul ti es due to the l i kel i hood of i ncompati bl e formats or l evel of detai l .
The use of new or addi ti onal external data has the same i mpact on the warehouse back-
end subsystems as changes do to i nternal data sources.
15.8 DATABASE OPTIMIZATION AND TUNING
As query stati sti cs are col l ected and user base i ncreases, there wi l l be a need to perform
database optimization and tuning tasks to maintain an acceptable level of warehouse performance.
To avoi d or control the i mpact of nasty surpri ses, the users are to be i nformed whenever
changes are made to the producti on database. Al so any changes to the database shoul d fi rst
be tested i n a safe envi ronment.
Databases can be tuned through a number of approaches, i ncl udi ng but not l i mi ted to
the fol l owi ng:
• Use of parallel query options. Some of the major database management systems
offer opti ons that wi l l spl i t up a l arge query i nto several smal l er queri es that can
be run i n paral l el . The resul ts of the smal l er queri es are then combi ned and presented
to users as a si ngl e resul t set. Whi l e such opti ons have costs, thei r i mpl ementati on
i s transparent to users, who noti ce onl y the i mprovements i n response ti me.
• Indexing strategies. As very l arge database (VLDB) i mpl ementati ons are becomi ng
more popul ar, database vendors are offeri ng i ndexi ng opti ons or strategi es to i mprove
the response ti mes to queri es agai nst very l arge tabl es.
• Dropping of referential integrity checking. Whi l e debates sti l l exi st as to
whether or not referenti al i ntegri ty checki ng shoul d be l eft on duri ng warehouse
l oadi ng. I t i s an undeni abl e fact that when referenti al i ntegri ty i s turned off, the
l oadi ng of warehouse data becomes faster. Some parti es reason that si nce data are
checked pri or to warehouse l oadi ng, there wi l l be no need to enforce referenti al
i ntegri ty constrai nts.
15.9 DATA WAREHOUSE STAFFING
Not al l organi zati ons wi th a data warehouse choose to create a permanent uni t to
admi ni ster and mai ntai n i t. Each organi zati on wi l l have to deci de i f a permanent uni t i s
requi red to mai ntai n the data warehouse.
196 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
A permanent uni t has the advantage of focusi ng the warehouse staff formal l y on the
care and feedi ng of the data warehouse. A permanent uni t al so i ncreases the conti nui ty i n
staff assi gnments by decreasi ng the possi bi l i ty of l osi ng staff to other I T projects or systems
i n the enterpri se.
The use of matri x organi zati ons i n pl ace of permanent uni ts has al so proven to be
effecti ve, provi ded that rol es and responsi bi l i ti es are cl earl y defi ned and that the I T di vi si on
i s not undermanned.
I f the warehouse devel opment was parti al l y or compl etel y outsource to thi rd parti es
because of a shortage of i nternal I T resources, the enterpri se may fi nd i t necessary to staff
up at the end of the warehouse rol l out. As the project draws to a cl ose, the consul tants or
contractors wi l l be turni ng over the day-to-day operati ons of the warehouse to i nternal I T
staff. The l ack of i nternal I T resources may resul t i n haphazard turnovers. Al ternati vel y,
the enterpri se may have to outsource the mai ntenance of the warehouse.
15.10 WAREHOUSE STAFF AND USER TRAINING
The enterpri se may fi nd i t hel pful to establ i sh a trai ni ng program for both technol ogy
staff and end users.
User Training
• Warehousing overview. Hal f-day overvi ews can be prepared for executi ve or
seni or management to manage expectati ons.
• User roles. User trai ni ng shoul d al so cover general data warehousi ng concepts
and expl ai n how users are i nvol ved duri ng data warehouse pl anni ng, desi gn, and
constructi on acti vi ti es.
• Warehouse contents and metadata. Once a data warehouse has been depl oyed,
the user trai ni ng shoul d focus strongl y on the contents of the warehouse. Users
must understand the data that are now avai l abl e to them and must understand
al so the l i mi tati ons i mposed by the scope of the warehouse. The contents and usage
of busi ness metadata shoul d al so be expl ai ned.
• Data access and retrieval tools. User trai ni ng shoul d al so focus on the sel ected end-
user tool s. I f users fi nd the tool s di ffi cul t to use, the I T staff wi l l qui ckl y fi nd themsel ves
saddl ed wi th the unwel come task of creati ng reports and queri es for end users.
Warehouse Staff Training
Warehouse staff requi re trai ni ng on a number of topi cs coveri ng the pl anni ng, desi gn,
i mpl ementati on, management, and mai ntenance of data warehouses. Dependi ng on thei r
project rol es, the staff wi l l need to speci al i ze or focus on di fferent areas or di fferent aspects
of the warehousi ng l i fe cycl e. For exampl e, the metadata admi ni strator needs speci al i zed
courses on metadata reposi tory management. Whereas the warehouse DBA needs di mensi onal
model i ng trai ni ng.
15.11 SUBSEQUENT WAREHOUSE ROLLOUTS
The data warehouse i s extended conti nuousl y. Data warehouse desi gn and constructi on
ski l l s wi l l al ways be needed as l ong as end-user requi rements and busi ness si tuati ons
WAREHOUSE MAI NTENANCE AND EVOLUTI ON 197
conti nue to evol ve. Each subsequent rol l out i s desi gned to extend the functi onal i ty and scope
of the warehouse.
As new user requi rements are studi ed and subsequent warehouse rol l outs get underway,
the overal l data warehouse archi tecture i s revi si ted and modi fi ed as needed.
Data marts are depl oyed as needed wi thi n the i ntegrati ng framework of the warehouse.
Mul ti pl e, unrel ated data marts are to be avoi ded because these wi l l merel y create unnecessary
data management and admi ni strati on probl ems.
15.12 CHARGEBACK SCHEMES
I t may be necessary at some poi nt for the I T Department to start chargi ng user groups
for warehouse usage, as a way of obtai ni ng conti nuous fundi ng for the data warehouse
i ni ti ati ve.
Note that chargeback schemes wi l l work onl y i f there are rel i abl e mechani sms to track
and moni tor usage of the warehouse per user. They al so put the warehouse to the test—
users wi l l have to feel that they are getti ng thei r money’s worth each ti me they use the
warehouse. Warehouse usage wi l l drop i f users feel that the warehouse has no val ue.
15.13 DISASTER RECOVERY
The chal l enges of depl oyi ng new technol ogy may cause warehouse admi ni strators to
pl ace a l ower pri ori ty on di saster recovery. As ti me passes and more users come to depend
on the data war ehouse, however , the war ehouse achi eves mi ssi on-cr i ti cal status. The
appropri ate di saster recovery procedures are therefore requi red to safeguard the conti nuous
avai l abi l i ty and rel i abi l i ty of the warehouse.
I deal l y, the warehouse team conducts a dry run of the di saster recovery procedure at
l east once pri or to the depl oyment of the warehouse. Di saster recovery dri l l s on a regul ar
basi s wi l l al so prove hel pful .
Some di saster s may r equi r e the r ei nstal l ati on of oper ati ng systems and database
management systems apart from the rel oadi ng or warehouse data and popul ati on of aggregate
tabl es. The recovery procedures shoul d consi der thi s possi bi l i ty.
A fi nal note: Revi ew the di saster recovery pl an at the end of each rol l out. The pl an may
have to be updated i n l i ght of changes to the archi tecture, scope, and si ze of the warehouse.
In Summary
A data warehouse i ni ti ati ve does not stop wi th one successful warehouse depl oyment.
The warehousi ng team must sustai n the i ni ti al momentum by mai ntai ni ng and evol vi ng the
data warehouse.
Unfor tunatel y, mai ntenance acti vi ti es r emai n ver y much i n the backgr ound—often
unseen and unappreci ated unti l somethi ng goes wrong. Many peopl e do not real i ze that
evol vi ng a warehouse can be tri cki er than the i ni ti al depl oyment. The warehousi ng team
has to meet a new set of i nformati on requi rements wi thout compromi si ng the performance
of the i ni ti al depl oyment and wi thout l i mi ti ng the war ehouse’s abi l i ty to meet futur e
requi rements.
198 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
198
Thi s chapter takes a l ook at trends i n the data warehousi ng i ndustry and thei r possi bl e
i mpl i cati ons on future warehousi ng projects.
16.1 CONTINUED GROWTH OF THE DATA WAREHOUSE INDUSTRY
The data war ehousi ng i ndustr y conti nues to gr ow i n ter ms of spendi ng, pr oduct
avai l abi l i ty and projects. The number of data warehouse vendors conti nues to i ncrease, as
does the number of avai l abl e warehousi ng products. Such a trend, however, may abate i n
the face of market consol i dati on, whi ch began i n the mi d-1990s and conti nues to thi s day.
Smal l compani es wi th compati bl e products can be seen mergi ng (e.g.. the November
1997 merger of Apertus Technol ogi es and Carl eton Corporati on) to create l arger, more
competi ti ve war ehouse pl ayer s. Lar ger , establ i shed cor por ati ons have set objecti ves of
becomi ng end-to-end warehouse sol uti on provi ders and are acqui ri ng technol ogi es from
ni che pl ayers to ful fi l l these goal s (e.g.. the February 1998 acqui si ti on of I ntel l i dex Systems
by Sybase and the March 1998 acqui si ti on of Logi c Works by Pl ati num Technol ogi es).
Partnershi ps and al l i ances between vendors conti nue to be popul ar. The i ncreasi ng
maturi ty of the warehousi ng software market i s i nevi tabl y turni ng warehouse software i nto
off-the-shel f packages that can be pi eced together. Al ready compani es are posi ti oni ng groups
of products (thei r own or a combi nati on of products from mul ti pl e vendors) as i ntegrated
warehousi ng sol uti ons.
16.2 INCREASED ADOPTION OF WAREHOUSING TECHNOLOGY BY MORE
INDUSTRIES
The i ndustr i es to adopt data war ehousi ng technol ogi es ear l i er have been the
tel ecommuni cati ons, banki ng, and retai l verti cal markets. The i mpetus for thei r earl y adopti on
of warehousi ng technol ogi es has been attri buted l argel y to government deregul ati on, and
i ncr eased competi ti on among i ndustr y pl ayer s–condi ti ons that hei ghtened the need for
i ntegrated i nformati on.
Over the past few years, however, other i ndustri es have begun i nvesti ng strongl y i n
data warehousi ng technol ogi es. These i ncl ude, but are not l i mi ted to, compani es i n fi nanci al
9)4-0751/ 64-,5
16
+0)26-4
WAREHOUSI NG TRENDS 199
servi ces, heal thcare, i nsurance, manufacturi ng, petrochemi cal , pharmaceuti cal , transportati on
and di stri buti on, as wel l as uti l i ti es.
Despi te the i ncreasi ng adopti on of warehousi ng technol ogi es by other i ndustri es, however,
out research i ndi cates that the tel ecommuni cati ons and banki ng i ndustri es conti nue to l ead
i n warehouse-rel ated spendi ng, wi th as much as 15 percent of thei r technol ogy budgets
al l ocated to warehouse-rel ated purchases and projects.
16.3 INCREASED MATURITY OF DATA MINING TECHNOLOGIES
Data mi ni ng tool s wi l l conti nue to mature, and more organi zati ons wi l l adopt thi s type
of warehousi ng technol ogy. Learni ng from data mi ni ng appl i cati ons wi l l become more wi del y
avai l abl e i n the trade press and other commerci al publ i cati ons, thereby i ncreasi ng the
chances of data mi ni ng success of l ate adopters.
Data mi ni ng i ni ti ati ves are typi cal l y dri ven by marketi ng and sal es departments and
are understandabl y more popul ar i n l arge compani es wi th very l arge databases. Si nce these
tool s work best wi th detai l ed data at the transacti on grai n, the popul ari ty of data mi ni ng
tool s wi l l natural l y coi nci de wi th a boom i n very l arge (terabyte-si ze) data warehouses. Data
mi ni ng projects wi l l al so underscore further the i mportance of data qual i ty i n warehouse
i mpl ementati ons.
16.4 EMERGENCE AND USE OF METADATA INTERCHANGE STANDARDS
There i s currentl y no metadata reposi tory that i s a cl ear i ndustry l eader for warehouse
i mpl ementati ons. Each pr oduct vendor has defi ned i ts own set of metadata r eposi tor y
standards as requi red by i ts respecti ve products or product sui te.
Efforts have l ong been underway to defi ne an i ndustry-wi de set of metadata i nterchange
standar ds, and a Metadata I nter change Speci fi cati on i s avai l abl e fr om the Meta Data
Coal i ti on, whi ch has at l east 30 vendor compani es as members.
16.5 INCREASED AVAILABILITY OF WEB-ENABLED SOLUTIONS
Data warehousi ng technol ogi es conti nue to be affected by the i ncreased popul ari ty of
i ntranets and i ntranet-based sol uti ons. As a resul t, more and more data access and retri eval
tool s are becomi ng web enabl ed, whi l e more organi zati ons are requi ri ng web-enabl ed features
as a warehousi ng requi rement for thei r data access and retri eval tool s.
Some organi zati ons have started usi ng the I nternet as a cost-effecti ve mechani sm for
provi di ng remote users wi th access to warehouse data. Understandabl y, organi zati ons are
concerned about the securi ty requi rements of such a setup. The warehouse no doubt contai ns
the most i ntegrated, and cl eanest data i n the enti re enterpri se. Such hi ghl y cri ti cal and
sensi ti ve data may fal l i nto the wrong hands i f the appropri ate securi ty measures are not
i mpl emented.
16.6 POPULARITY OF WINDOWS NT FOR DATA MART PROJECTS
The Wi ndows NT operati ng system wi l l conti nue to gai n popul ari ty as a data mart
operati ng system. The operati ng system i s frequentl y bundl ed wi th hardware features that
are candi dates for base-l evel or l ow-end warehousi ng pl atforms.
200 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
16.7 AVAILABILITY OF WAREHOUSING MODULES FOR APPLICATION
PACKAGES
Compani es that devel op and market major operati onal appl i cati on packages wi l l soon
be offeri ng warehousi ng modul es as part of thei r sui te of appl i cati ons. These appl i cati on
packages i ncl ude SAP, Baan, and Peopl eSoft. Compani es that offer these appl i cati on packages
are i n a posi ti on to capi tal i ze on the popul ari ty of data warehousi ng by creati ng warehousi ng
modul es that make use of data i n thei r appl i cati ons.
These compani es are fami l i ar wi th the data structures of thei r respecti ve appl i cati ons
and they can therefore offer confi gurabl e warehouse back-ends to extract, transform, qual i ty
assure, and l oad operati onal data i nto a separate deci si onal data structure desi gned to meet
the basi c deci si onal reporti ng requi rements of thei r customers.
Understandabl y, each enterpri se requi res the abi l i ty to customi ze these basi c warehousi ng
modul es to meet thei r speci fi c requi rements; customi zati on i s defi ni tel y possi bl e wi th the
ri ght peopl e usi ng the ri ght devel opment tool s.
16.8 MORE MERGERS AND ACQUISITIONS AMONG WAREHOUSE PLAYERS
Mergers and acqui si ti ons wi l l conti nue i n the data warehouse market, dri ven by l arge
corporati ons acqui ri ng ni che speci al ti es, and smal l compani es mergi ng to create a l arger
warehouse pl ayer.
Exampl es i ncl ude:
• The acqui si ti on of Stanford technol ogi cal group by I nformi x.
• The acqui si ti on of I RI software’s OLAP technol ogi es by Oracl e Corporati on.
• The acqui si ti on of Panorama Software Systems (and thei r OLAP technol ogy) by
Mi crosoft Corporati on.
• The merger of Carl eton Corporati on and Apertus technol ogi es.
• The acqui si ti on of HP I ntel l i gent Warehouse by Pl ati num Technol ogi es.
• The acqui si ti on of I ntel l i ndex Systems by Sybase I nc., for the former’s metadata
reposi tory product.
• The acqui si ti on of Logi c Works, I nc., by Pl ati num Technol ogi es.
In Summary
I n al l respects, the data warehousi ng i ndustry shows al l si gns of conti nued growth at
an i mpressi ve rate. Enterpri ses can expect more mature products i n al most al l software
segments, especi al l y wi th the avai l abi l i ty of second- or thi r d-gener ati on pr oducts.
I mprovements i n pri ce/performance rati os wi l l conti nue i n the hardware market.
Some vendor consol i dati on can be expected, al though new compani es and products wi l l
conti nue to appear. More partnershi ps and al l i ances among di fferent vendors can al so be
expected.
PART VI: ON-LINE ANALYTICAL
PROCESSING
The mai n topi cs that are covered i n thi s part are:
• Definition. I t was ori gi nal l y defi ned by the l ate Dr. Codd i n
terms of 12 rul es, l ater extended to the l ess wel l -known 18
‘features’. Al l 18 are anal yzed, al ong wi th preferred ‘FASMI ’
test.
• Origin. Despi te the recent hype, OLAP products go back
much further than many peopl e thi nk. Thi s secti on refl ects
on the l essons that can be l earned from the 30+ years’ of
mul ti di mensi onal anal ysi s.
• Market Analysis. A fast way to segment products based on
thei r archi tectures, pl us the OLAP archi tectural square, used
to ensure shortl i sts are rati onal .
• Architecture. Confusi on abounds i n di scussi ons about OLAP
archi tectures, wi th terms l i ke ROLAP, MOLAP, HOLAP and
even DOLAP pr ol i fer ati ng. Thi s secti on expl ai ns the
di fferences.
• Multi Dimensional Data Structures. Ther e i s a l ot of
confusi ng ter mi nol ogy used to descr i be mul ti di mensi onal
str uctur es. Thi s secti on cl ar i fi es our use of ter ms l i ke
hypercubes and mul ti cubes.
• Applications. Descri bes the mai n OLAP appl i cati ons, and
some of the i ssues that ari se wi th them.
This page
intentionally left
blank
203
17.1 WHAT IS OLAP?
The term, of course, stands for ‘On-Line Analytical Processing’. But that is not a definition; it’s not
even a clear description of what OLAP means. I t certainly gives no indication of why you use an OLAP
tool, or even what an OLAP tool actually does. And it gives you no help in deciding if a product is an
OLAP tool or not.
Thi s probl em, started as soon as researchi ng on the OLAP i n l ate 1994, as we needed
to deci de whi ch products fel l i nto the category. Deci di ng what i s an OLAP has not been
easi er si nce then, as more and more vendors cl ai m to have ‘OLAP compl i ant’ products,
whatever that may mean (often the they don’t even know). I t i s not possi bl e to rel y on the
vendors’ own descri pti ons and the membershi p of the l ong-defunct OLAP counci l was not a
rel i abl e i ndi cator of whether or not a company produces OLAP products. For exampl e,
several si gni fi cant OLAP vendors were never members or resi gned, and several members
were not OLAP vendors. Membershi p of the i nstantl y mori bund repl acement Anal yti cal
Sol uti ons Forum was even l ess of a gui de, as i t was i ntended to i ncl ude non-OLAP vendors.
The Codd rules al so turned out to be an unsui tabl e way of detecti ng ‘OLAP compl i ance’,
so researchers were forced to create thei r own defi ni ti on. I t had to be si mpl e, memorabl e
and product-i ndependent, and the resul ti ng defi ni ti on i s the ‘FASMI ’ test. The key thi ng
that al l OLAP products have i n common i s mul ti di mensi onal i ty, but that i s not the onl y
requi rement for an OLAP product.
Here, we wanted to defi ne the characteri sti cs of an OLAP appl i cati on i n a speci fi c way,
wi thout di ctati ng how i t shoul d be i mpl emented. As research has shown, there are many
ways of i mpl ementi ng OLAP compl i ant appl i cati ons, and no si ngl e pi ece of technol ogy shoul d
be offi ci al l y requi red, or even recommended. Of course, we have studi ed the technol ogi es
used i n commerci al OLAP products and thi s report provi des many such detai l s. We have
suggested i n whi ch ci rcumstances one approach or another mi ght be preferred, and have
al so i denti fi ed areas where we feel that al l the products currentl y fal l short of what we
regard as a technol ogy i deal .
Here the defi ni ti on i s desi gned to be short and easy to remember – 12 rul es or 18
features are far too many for most peopl e to carry i n thei r heads; i t i s easy to summari ze
164,7+61
17
+0)26-4
204 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
the OLAP defi ni ti on i n just fi ve key words: Fast Anal ysi s of Shared Mul ti di mensi onal
I nformati on – or, FASMI for short.
Fast Analysis of Shared Multidimensional Information
FAST means that the system i s targeted to del i ver most responses to users wi thi n
about fi ve seconds, wi th the si mpl est anal yses taki ng no more than one second and very few
taki ng more than 20 seconds. I ndependent research i n The Netherl ands has shown that
end-users assume that a process has fai l ed i f resul ts are not recei ved wi thi n 30 seconds, and
they are apt to hi t ‘Al t+Ctrl +Del ete’ unl ess the system warns them that the report wi l l take
l onger ti me. Even i f they have been warned that i t wi l l take si gni fi cantl y l onger ti me, users
are l i kel y to get di stracted and l ose thei r chai n of thought, so the qual i ty of anal ysi s suffers.
Thi s speed i s not easy to achi eve wi th l arge amounts of data, parti cul arl y i f on-the-fl y and
ad hoc cal cul ati ons are requi red. Vendors resort to a wi de vari ety of techni ques to achi eve
thi s goal , i ncl udi ng speci al i zed forms of data storage, extensi ve pre-cal cul ati ons and speci fi c
hardware requi rements, but we do not thi nk any products are yet ful l y opti mi zed, so we
expect thi s to be an area of devel opi ng technol ogy. I n parti cul ar, the ful l pre-cal cul ati on
approach fai l s wi th very l arge, sparse appl i cati ons as the databases si mpl y get too l arge,
whereas doi ng everythi ng on-the-fl y i s much too sl ow wi th l arge databases, even i f exoti c
hardware i s used. Even though i t may seem mi racul ous at fi rst i f reports that previ ousl y
took days now take onl y mi nutes, users soon get bored of wai ti ng, and the project wi l l be
much l ess successful than i f i t had del i vered a near i nstantaneous response, even at the cost
of l ess detai l ed anal ysi s. The OLAP Survey has found that sl ow quer y r esponse i s
consi stentl y the most often-ci ted techni cal pr obl em wi th OLAP pr oducts, so too many
depl oyments are cl earl y sti l l fai l i ng to pass thi s test.
ANALYSI S means that the system can cope wi th any busi ness l ogi c and stati sti cal
anal ysi s that i s rel evant for the appl i cati on and the user, and keep i t easy enough for the
target user. Al though some pre-programmi ng may be needed, we do not thi nk i t acceptabl e
i f al l appl i cati on defi ni ti ons have to be done usi ng a professi onal 4GL. I t i s certai nl y necessary
to al l ow the user to defi ne new ad hoc cal cul ati ons as part of the anal ysi s and to report on
the data i n any desi red way, wi thout havi ng to program, so we excl ude products (l i ke Oracl e
Di scoverer) that do not al l ow adequate end-user ori ented cal cul ati on fl exi bi l i ty. We do not
mi nd whether thi s anal ysi s i s done i n the vendor’s own tool s or i n a l i nked external product
such as a spreadsheet si mpl y that al l the requi red anal ysi s functi onal i ty be provi ded i n an
i ntui ti ve manner for the target users. Thi s coul d i ncl ude speci fi c features l i ke ti me seri es
anal ysi s, cost al l ocati ons, cur r ency tr ansl ati on, goal seeki ng, ad hoc mul ti di mensi onal
structural changes, non-procedural model i ng, excepti on al erti ng, data mi ni ng and other
appl i cati on dependent features. These capabi l i ti es di ffer wi del y between products, dependi ng
on thei r target markets.
SHARED means that the system i mpl ements al l the secur i ty r equi r ements for
confi denti al i ty (possi bl y down to cel l l evel ) and, i f mul ti pl e wri te access i s needed, concurrent
update l ocki ng at an appropri ate l evel . Not al l appl i cati ons need users to wri te data back,
but for the growi ng numbers that do, the system shoul d be abl e to handl e mul ti pl e updates
i n a ti mel y, secure manner. Thi s i s a major area of weakness i n many OLAP products, whi ch
tend to assume that al l OLAP appl i cati ons wi l l be read-onl y, wi th si mpl i sti c securi ty control s.
Even products wi th mul ti -user read-wri te often have crude securi ty model s; an exampl e i s
Mi crosoft OLAP Servi ces.
I NTRODUCTI ON 205
MULTI DI MENSI ONAL i s our key requi rement. I f we had to pi ck a one-word defi ni ti on
of OLAP, thi s i s i t. The system must provi de a mul ti di mensi onal conceptual vi ew of the
data, i ncl udi ng ful l support for hi erarchi es and mul ti pl e hi erarchi es, as thi s i s certai nl y the
most l ogi cal way to anal yze busi nesses and organi zati ons. We are not setti ng up a speci fi c
mi ni mum number of di mensi ons that must be handl ed as i t i s too appl i cati on dependent and
most products seem to have enough for thei r target markets. Agai n, we do not speci fy what
under l yi ng database technol ogy shoul d be used pr ovi di ng that the user gets a tr ul y
mul ti di mensi onal conceptual vi ew.
I NFORMATI ON i s al l of the data and deri ved i nformati on needed, wherever i t i s and
however much i s rel evant for the appl i cati on. We are measuri ng the capaci ty of vari ous
products i n terms of how much i nput data they can handl e, not how many Gi gabytes they
take to store i t. The capaci ti es of the products di ffer greatl y – the l argest OLAP products
can hol d at l east a thousand ti mes as much data as the smal l est. Ther e ar e many
consi der ati ons her e, i ncl udi ng data dupl i cati on, RAM r equi r ed, di sk space uti l i zati on,
performance, i ntegrati on wi th data warehouses and the l i ke.
We thi nk that the FASMI test i s a reasonabl e and understandabl e defi ni ti on of the
goal s OLAP i s meant to achi eve. Researches encourage users and vendors to adopt thi s
defi ni ti on, whi ch we hope wi l l avoi d the controversi es of previ ous attempts.
The techni ques used to achi eve i t i ncl ude many fl avors of cl i ent/server archi tecture,
ti me seri es anal ysi s, object-ori entati on, opti mi zed propri etary data storage, mul ti threadi ng
and vari ous patented i deas that vendors are so proud of. We have vi ews on these as wel l ,
but we woul d not want any such technol ogi es to become part of the defi ni ti on of OLAP.
Vendors who are covered i n thi s report had every chance to tel l us about thei r technol ogi es,
but i t i s thei r abi l i ty to achi eve OLAP goal s for thei r chosen appl i cati on areas that i mpressed
us most.
17.2 THE CODD RULES AND FEATURES
I n 1993, E.F. Codd & Associ ates publ i shed a whi te paper, commi ssi oned by Arbor
Software (now Hyperi on Sol uti ons), enti tl ed ‘Provi di ng OLAP (On-Li ne Anal yti cal Processi ng)
to User-Anal ysts: An I T Mandate’. The l ate Dr. Codd was very wel l known as a respected
database researcher from the 1960s ti l l the l ate 1980s and i s credi ted wi th bei ng the
i nventor of the rel ati onal database model i n 1969. Unfortunatel y, hi s OLAP rul es proved to
be controversi al due to bei ng vendor-sponsored, rather than mathemati cal l y based.
The OLAP whi te paper i ncl uded 12 rul es, whi ch are now wel l known. They were fol l owed
by another si x (much l ess wel l known) rul es i n 1995 and Dr. Codd al so restructured the
rul es i nto four groups, cal l i ng them ‘features’. The features are bri efl y descri bed and eval uated
here, but they are now rarel y quoted and l i ttl e used.
Basic Features
1. Multidimensional Conceptual View (Ori gi nal Rul e 1). Few woul d argue wi th
thi s feature; l i ke Dr. Codd, we bel i eve thi s to be the central core of OLAP. Dr Codd
i ncl uded ‘sl i ce and di ce’ as part of thi s requi rement.
2. Intuitive Data Manipulation (Or i gi nal Rul e 10). Dr . Codd pr efer r ed data
mani pul ati on to be done through di rect acti ons on cel l s i n the vi ew, wi thout recourse
206 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
to menus or mul ti pl e acti ons. One assumes that thi s i s by usi ng a mouse (or
equi val ent), but Dr. Codd di d not actual l y say so. Many products fai l on thi s,
because they do not necessari l y support doubl e cl i cki ng or drag and drop. The
vendors, of course, al l cl ai m otherwi se. I n our vi ew, thi s feature adds l i ttl e val ue
to the eval uati on process. We thi nk that products shoul d offer a choi ce of modes (at
al l ti mes), because not al l users l i ke the same approach.
3. Accessibility: OLAP as a Mediator (Ori gi nal Rul e 3). I n thi s rul e, Dr. Codd
essenti al l y descri bed OLAP engi nes as mi ddl eware, si tti ng between heterogeneous
data sources and an OLAP front-end. Most products can achi eve thi s, but often
wi th more data stagi ng and batchi ng than vendors l i ke to admi t.
4. Batch Extraction vs Interpretive (New). Thi s rul e effecti vel y requi red that
products offer both thei r own stagi ng database for OLAP data as well as offeri ng
l i ve access to external data. We agree wi th Dr. Codd on thi s feature and are
di sappoi nted that onl y a mi nori ty of OLAP products properl y compl y wi th i t, and
even those products do not often make i t easy or automati c. I n effect, Dr. Codd was
endor si ng mul ti di mensi onal data stagi ng pl us par ti al pr e-cal cul ati on of l ar ge
mul ti di mensi onal databases, wi th transparent reach-through to underl yi ng detai l .
Today, thi s woul d be regarded as the defi ni ti on of a hybri d OLAP, whi ch i s i ndeed
becomi ng a popul ar archi tecture, so Dr. Codd has proved to be very percepti ve i n
thi s area.
5. OLAP Analysis Models (New). Dr. Codd requi red that OLAP products shoul d
support al l four anal ysi s model s that he descri bed i n hi s whi te paper (Categori cal ,
Exegeti cal , Contempl ati ve and Formul ai c). We hesi tate to si mpl i fy Dr Codd’s erudi te
phraseol ogy, but we woul d descri be these as parameteri zed stati c reporti ng, sl i ci ng
and di ci ng wi th dri l l down, ‘what i f? ’ anal ysi s and goal seeki ng model s, respecti vel y.
Al l OLAP tool s i n thi s Report support the fi rst two (but some other cl ai mants do
not ful l y support the second), most support the thi rd to some degree (but probabl y
l ess than Dr. Codd woul d have l i ked) and a few support the fourth to any usabl e
extent. Perhaps Dr. Codd was anti ci pati ng data mi ni ng i n thi s rul e?
6. Client/Server Architecture (Ori gi nal Rul e 5). Dr. Codd requi red not onl y that
the product shoul d be cl i ent/server but al so that the server component of an OLAP
product shoul d be suffi ci entl y i ntel l i gent that vari ous cl i ents coul d be attached wi th
mi ni mum effort and programmi ng for i ntegrati on. Thi s i s a much tougher test than
si mpl e cl i ent/server, and rel ati vel y few products qual i fy. We woul d argue that thi s
test i s probabl y tougher than i t needs to be, and we prefer not to di ctate archi tectures.
However, i f you do agree wi th the feature, then you shoul d be aware that most
vendors, who cl ai m compl i ance, do so wrongl y. I n effect, thi s i s al so an i ndi rect
requi rement for openness on the desktop. Perhaps Dr. Codd, wi thout ever usi ng the
term, was thi nki ng of what the Web woul d one day del i ver? Or perhaps he was
anti ci pati ng a wi del y accepted API standard, whi ch sti l l does not real l y exi st.
Perhaps, one day, XML for Anal ysi s wi l l fi l l thi s gap.
7. Transparency (Ori gi nal Rul e 2). Thi s test was al so a tough but val i d one. Ful l
compl i ance means that a user of, say, a spreadsheet shoul d be abl e to get ful l val ue
from an OLAP engi ne and not even be aware of where the data ul ti matel y comes
I NTRODUCTI ON 207
from. To do thi s, products must al l ow l i ve access to heterogeneous data sources
from a ful l functi on spreadsheet add-i n, wi th the OLAP server engi ne i n between.
Al though al l vendors cl ai med compl i ance, many di d so by outrageousl y rewri ti ng
Dr. Codd’s words. Even Dr. Codd’s own vendor-sponsored anal yses of Essbase and
(then) TM/1 i gnore part of the test. I n fact, there are a few products that do fully
compl y wi th the test, i ncl udi ng Anal ysi s Servi ces, Express, and Hol os, but nei ther
Essbase nor i TM1 (because they do not support l i ve, transparent access to external
data), i n spi te of Dr. Codd’s apparent endorsement. Most products fai l to gi ve ei ther
ful l spreadsheet access or l i ve access to heterogeneous data sources. Li ke the previ ous
feature, thi s i s a tough test for openness.
8. Multi-User Support (Ori gi nal Rul e 8). Dr. Codd recogni zed that OLAP appl i cati ons
were not al l read-onl y and sai d that, to be regarded as strategi c, OLAP tool s must
provi de concurrent access (retri eval and update), i ntegri ty and securi ty. We agree
wi th Dr. Codd, but al so note that many OLAP appl i cati ons are sti l l read-onl y.
Agai n, al l the vendors cl ai m compl i ance but, on a stri ct i nterpretati on of Dr. Codd’s
words, few are justi fi ed i n so doi ng.
Special Features
9. Treatment of Non-Normalized Data (New). Thi s refers to the i ntegrati on between
an OLAP engi ne and denormal i zed source data. Dr. Codd poi nted out that any data
updates performed i n the OLAP envi ronment shoul d not be al l owed to al ter stored
denormal i zed data i n feeder systems. He coul d al so be i nterpreted as sayi ng that
data changes shoul d not be al l owed i n what are normal l y regarded as cal cul ated
cel l s wi thi n the OLAP database. For exampl e, i f Essbase had al l owed thi s, Dr Codd
woul d perhaps have di sapproved.
10. Storing OLAP Results: Keeping them Separate from Source Data (New).
Thi s i s real l y an i mpl ementati on rather than a product i ssue, but few woul d di sagree
wi th i t. I n effect, Dr. Codd was endorsi ng the wi del y-hel d vi ew that read-wri te
OLAP appl i cati ons shoul d not be i mpl emented di rectl y on l i ve transacti on data,
and OLAP data changes shoul d be kept di sti nct from transacti on data. The method
of data wri te-back used i n Mi crosoft Anal ysi s Servi ces i s the best i mpl ementati on
of thi s, as i t al l ows the effects of data changes even wi thi n the OLAP envi ronment
to be kept segregated from the base data.
11. Extraction of Missing Values (New). Al l mi ssi ng val ues are cast i n the uni form
representati on defi ned by the Rel ati onal Model Versi on 2. We i nterpret thi s to
mean that mi ssi ng val ues are to be di sti ngui shed from zero val ues. I n fact, i n the
i nterests of stori ng sparse data more compactl y, a few OLAP tool s such as TM1 do
break thi s rul e, wi thout great l oss of functi on.
12. Treatment of Missing Values (New). Al l mi ssi ng val ues are to be i gnored by the
OLAP anal yzer regardl ess of thei r source. Thi s rel ates to Feature 11, and i s probabl y
an al most i nevi tabl e consequence of how mul ti di mensi onal engi nes treat al l data.
208 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Reporting Features
13. Flexible Reporting (Ori gi nal Rul e 11). Dr. Codd requi red that the di mensi ons
can be l ai d out i n any way that the user requi res i n reports. We woul d agree that
most products are capabl e of thi s i n thei r formal report wri ters. Dr. Codd di d not
expl i ci tl y state whether he expected the same fl exi bi l i ty i n the i nteracti ve vi ewers,
perhaps because he was not aware of the di sti ncti on between the two. We prefer
that i t i s avai l abl e, but note that rel ati vel y fewer vi ewers are capabl e of i t. Thi s i s
one of the reasons that we prefer that anal ysi s and reporti ng faci l i ti es be combi ned
i n one modul e.
14. Uniform Reporting Performance (Ori gi nal Rul e 4). Dr. Codd requi red that
reporti ng performance be not si gni fi cantl y degraded by i ncreasi ng the number of
di mensi ons or database si ze. Curi ousl y, nowhere di d he menti on that the performance
must be fast, merel y that i t be consi stent. I n fact, our experi ence suggests that
mer el y i ncr easi ng the number of di mensi ons or database si ze does not affect
performance si gni fi cantl y i n ful l y pre-cal cul ated databases, so Dr. Codd coul d be
i nterpreted as endorsi ng thi s approach – whi ch may not be a surpri se gi ven that
Arbor Software sponsored the paper. However, reports wi th more content or more
on-the-fl y cal cul ati ons usual l y take l onger (i n the good products, performance i s
al most l i nearl y dependent on the number of cel l s used to produce the report, whi ch
may be more than appear i n the fi ni shed report) and some di mensi onal l ayouts wi l l
be sl ower than others, because more di sk bl ocks wi l l have to be read. There are
di fferences between products, but the pri nci pal factor that affects performance i s
the degree to whi ch the cal cul ati ons are performed i n advance and where l i ve
cal cul ati ons are done (cl i ent, mul ti di mensi onal server engi ne or RDBMS). Thi s i s
far more i mportant than database si ze, number of di mensi ons or report compl exi ty.
15. Automatic Adjustment of Physical Level (Supersedes Ori gi nal Rul e 7). Dr.
Codd wanted the OLAP system adjust i ts physi cal schema automati cal l y to adapt
to the type of model , data vol umes and sparsi ty. We agree wi th hi m, but are
di sappoi nted that most vendors fal l far short of thi s nobl e i deal . We woul d l i ke to
see more progress i n thi s area and al so i n the rel ated area of determi ni ng the
degree to whi ch model s shoul d be pre-cal cul ated (a major i ssue that Dr. Codd
i gnores). The Panorama technol ogy, acqui red by Mi crosoft i n October 1996, broke
new ground here, and users can now benefi t from i t i n Mi crosoft Anal ysi s Servi ces.
Dimension Control
16. Generic Dimensionality (Ori gi nal Rul e 6). Dr Codd took the puri st vi ew that
each di mensi on must be equi val ent i n both i ts structure and operati onal capabi l i ti es.
Thi s may not be unconnected wi th the fact that thi s i s an Essbase characteri sti c.
However, he di d al l ow addi ti onal operati onal capabi l i ti es to be granted to sel ected
di mensi ons (pr esumabl y i ncl udi ng ti me), but he i nsi sted that such addi ti onal
functi ons shoul d be granted to any di mensi on. He di d not want the basi c data
structures, formul ae or reporti ng formats to be bi ased towards any one di mensi on.
Thi s has proven to be one of the most controversi al of al l the ori gi nal 12 rul es.
Technol ogy focused products tend to l argel y compl y wi th i t, so the vendors of such
products support i t. Appl i cati on focused products usual l y make no effort to compl y,
I NTRODUCTI ON 209
and thei r vendors bi tterl y attack the rul e. Wi th a stri ctl y puri st i nterpretati on, few
products ful l y compl y. We woul d suggest that i f you are purchasi ng a tool for
general purpose, mul ti pl e appl i cati on use, then you want to consi der thi s rul e, but
even then wi th a l ower pri ori ty. I f you are buyi ng a product for a speci fi c appl i cati on,
you may safel y i gnore the rul e.
17. Unlimited Dimensions & Aggregation Levels (Ori gi nal Rul e 12). Techni cal l y,
no product can possi bl y compl y wi th thi s feature, because there i s no such thi ng as
an unl i mi ted enti ty on a l i mi ted computer. I n any case, few appl i cati ons need more
than about ei ght or ten di mensi ons, and few hi erarchi es have more than about si x
consol i dati on l evel s. Dr. Codd suggested that i f a maxi mum must be accepted, i t
shoul d be at l east 15 and preferabl y 20; we bel i eve that thi s i s too arbi trary and
takes no account of usage. You shoul d ensure that any product you buy has l i mi ts
that are greater than you need, but there are many other l i mi ti ng factors i n OLAP
products that are l i abl e to troubl e you more than thi s one. I n practi ce, therefore,
you can probabl y i gnore thi s requi rement.
18. Unrestricted Cross-dimensional Operations (Or i gi nal Rul e 9). Dr . Codd
asserted, and we agree, that al l forms of cal cul ati on must be al l owed across al l
di mensi ons, not just the ‘measures’ di mensi on. I n fact, many products whi ch use
onl y rel ati onal storage are weak i n thi s area. Most products, such as Essbase, wi th
a mul ti di mensi onal database are strong. These types of cal cul ati ons are i mportant
i f you are doi ng compl ex cal cul ati ons, not just cross tabul ati ons, and are parti cul arl y
rel evant i n appl i cati ons that anal yze profi tabi l i ty.
17.3 THE ORIGINS OF TODAY’S OLAP PRODUCTS
The OLAP term dates back to 1993, but the i deas, technol ogy and even some of the
products have ori gi ns l ong before then.
APL
Mul ti di mensi onal anal ysi s, the basi s for OLAP, i s not new. I n fact, i t goes back to 1962,
wi th the publ i cati on of Ken I verson’s book, A Programming Language. The fi rst computer
i mpl ementati on of the APL l anguage was i n the l ate 1960s, by I BM. APL i s a mathemati cal l y
defi ned l anguage wi th mul ti di mensi onal vari abl es and el egant, i f rather abstract, processi ng
oper ator s. I t was or i gi nal l y i ntended mor e as a way of defi ni ng mul ti di mensi onal
transformati ons than as a practi cal programmi ng l anguage, so i t di d not pay attenti on to
mundane concepts l i ke fi l es and pri nters. I n the i nterests of a succi nct notati on, the operators
were Greek symbol s. I n fact, the resul ti ng programs were so succi nct that few coul d predi ct
what an APL program woul d do. I t became known as a ‘Wri te Onl y Language’ (WOL),
because i t was easi er to rewri te a program that needed mai ntenance than to fi x i t.
Unfortunatel y, thi s was al l l ong before the days of hi gh-resol uti on GUI screens and
l aser pri nters, so APL’s Greek symbol s needed speci al screens, keyboards and pri nters.
Later, Engl i sh words were someti mes used as substi tutes for the Greek operators, but
puri sts took a di m vi ew of thi s attempted popul ari zati on of thei r el i te l anguage. APL al so
devoured machi ne resources, partl y because earl y APL systems were i nterpreted rather
than bei ng compi l ed. Thi s was i n the days of very costl y, under-powered mai nframes, so
210 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
appl i cati ons that di d APL justi ce were sl ow to process and very expensi ve to run. APL al so
had a, perhaps undeserved, reputati on for bei ng parti cul arl y hungry for memory, as arrays
were processed i n RAM.
However, i n spi te of these i nauspi ci ous begi nni ngs, APL di d not go away. I t was used i n
many 1970s and 1980s busi ness appl i cati ons that had si mi l ar functi ons to today’s OLAP
systems. I ndeed, I BM devel oped an enti re mai nframe operati ng system for APL, cal l ed VSPC,
and some peopl e regarded i t as the personal producti vi ty envi ronment of choi ce l ong before
the spr eadsheet made an appear ance. One of these APL-based mai nfr ame pr oducts was
ori gi nal l y cal l ed Frango, and l ater Fregi . I t was devel oped by I BM i n the UK, and was used
for i nteracti ve, top-down pl anni ng. A PC-based descendant of Frango surfaced i n the earl y
1990s as KPS, and the product remai ns on sal e today as the Anal yst modul e i n Cognos Pl anni ng.
Thi s i s one of several APL-based products that Cognos has bui l t or acqui red si nce 1999.
However, APL was si mpl y too el i ti st to catch on wi th a l arger audi ence, even i f the
hardware probl ems were eventual l y to be sol ved or become i rrel evant. I t di d make an
appearance on PCs i n the 1980s (and i s sti l l used, someti mes i n a revamped form cal l ed “J”)
but i t ceased to have any market si gni fi cance after about 1980. Al though i t was possi bl e to
program mul ti di mensi onal appl i cati ons usi ng arrays i n other l anguages, i t was too hard for
any but professi onal programmers to do so, and even techni cal end-users had to wai t for a
new generati on of mul ti di mensi onal products.
Express
By 1970, a more appl i cati on-ori ented mul ti di mensi onal product, wi th academi c ori gi ns,
had made i ts fi rst appearance: Express. Thi s, i n a compl etel y rewri tten form and wi th a
modern code-base, became a wi del y used contemporary OLAP offeri ng, but the ori gi nal
1970’s concepts sti l l l i e just bel ow the surface. Even after 30 years, Express remai ns one of
the major OLAP technol ogi es, al though Oracl e struggl ed and fai l ed to keep i t up-to-date
wi th the many newer cl i ent/server products. Oracl e announced i n l ate 2000 that i t woul d
bui l d OLAP server capabi l i ti es i nto Oracl e9i starti ng i n mi d 2001. The second rel ease of the
Oracl e9i OLAP Opti on i ncl uded both a versi on of the Express engi ne, cal l ed the Anal yti c
Workspace, and a new ROLAP engi ne. I n fact, del ays wi th the new OLAP technol ogy and
appl i cati ons means that Oracl e i s sti l l sel l i ng Express and appl i cati ons based on i t i n 2005,
some 35 years after Express was fi rst rel eased. Thi s means that the Express engi ne has
l i ved on wel l i nto i ts fourth decade, and i f the OLAP Opti on eventual l y becomes successful ,
the renamed Express engi ne coul d be on sal e i nto i ts fi fth decade, maki ng i t possi bl y one
of the l ongest-l i ved software products ever.
More mul ti di mensi onal products appeared i n the 1980s. Earl y i n the decade, Stratagem
appeared, and i n i ts eventual gui se of Acumate (now owned by Lucent), thi s too was sti l l
marketed to a l i mi ted extent ti l l the mi d 1990s. However, al though i t was a di stant cousi n
of Express, i t never had Express’ market share, and has been di sconti nued. Al ong the way,
Stratagem was owned by CA, whi ch was l ater to acqui re two ROLAPs, the former Prodea
Beacon and the former I nformati on Advantage Deci si on Sui te, both of whi ch soon di ed.
System W
Comshare’s System W was a di fferent styl e of mul ti di mensi onal product. I ntroduced i n
1981, i t was the fi r st to have a hyper cube appr oach and was much mor e or i ented to
I NTRODUCTI ON 211
end-user devel opment of fi nanci al appl i cati ons. I t brought i n many concepts that are sti l l
not wi del y adopted, l i ke ful l non-procedural rul es, ful l screen mul ti di mensi onal vi ewi ng and
data edi ti ng, automati c recal cul ati on and (batch) i ntegrati on wi th rel ati onal data. However,
i t too was heavy on hardware and was l ess programmabl e than the other products of i ts day,
and so was l ess popul ar wi th I T professi onal s. I t i s al so sti l l used, but i s no l onger sol d and
no enhancements are l i kel y. Al though i t was rel eased on Uni x, i t was not a true cl i ent/
server product and was never promoted by the vendor as an OLAP offeri ng.
I n the l ate 1980s, Comshare’s DOS One-Up and l ater, Wi ndows-based Commander
Pri sm (l ater cal l ed Comshare Pl anni ng) products used si mi l ar concepts to the host-based
System W. Hyperi on Sol uti on’s Essbase product, though not a di rect descendant of System
W, was al so cl earl y i nfl uenced by i ts fi nanci al l y ori ented, ful l y pre-cal cul ated hypercube
approach, whi ch causes database expl osi on (so a desi gn deci si on made by Comshare i n 1980
l i ngered on i n Essbase unti l 2004). I roni cal l y, Comshare subsequentl y l i censed Essbase
(rather than usi ng any of i ts own engi nes) for the engi ne i n some of i ts modern OLAP
products, though thi s rel ati onshi p was not to l ast. Comshare (now Geac) l ater swi tched to
the Mi crosoft Anal ysi s Servi ces OLAP engi ne i nstead.
Metaphor
Another creati ve product of the earl y 1980s was Metaphor. Thi s was ai med at marketi ng
professi onal s i n consumer goods compani es. Thi s too i ntroduced many new concepts that
became popul ar onl y i n the 1990s, l i ke cl i ent/server computi ng, mul ti di mensi onal processi ng
on rel ati onal data, workgroup processi ng and object-ori ented devel opment. Unfortunatel y,
the standard PC hardware of the day was not capabl e of del i veri ng the response and human
factors that Metaphor requi red, so the vendor was forced to create total l y propri etary PCs
and network technol ogy. Subsequentl y, Metaphor struggl ed to get the product to work
successful l y on non-propri etary hardware and ri ght to the end i t never used a standard
GUI .
Eventual l y, Metaphor formed a marketi ng al l i ance wi th I BM, whi ch went to acqui re
the company. By mi d 1994, I BM had deci ded to i ntegrate Metaphor’s uni que technol ogy
(renamed DI S) wi th future I BM technol ogy and to di sband the subsi di ary, al though customer
protests l ed to the conti nui ng support for the product. The product conti nues to be supported
for i ts remai ni ng l oyal customers, and I BM rel aunched i t under the I DS name but hardl y
promoted i t. However, Metaphor’s creati ve concepts have not gone and the former I nformati on
Advantage, Bri o, Sagent, Mi croStrategy and Genti a are exampl es of vendors covered i n The
OLAP Report that have obvi ousl y been i nfl uenced by i t.
Another survi vi ng Metaphor tradi ti on i s the unprofi tabi l i ty of i ndependent ROLAP
vendors: no ROLAP vendor has ever made a cumul ati ve profi t, as demonstrated by Metaphor,
Mi croStrategy, Mi neShare, Whi teLi ght, STG, I A and Prodea. The natural market for ROLAPs
seems to be just too smal l , and the depl oyments too l abor i ntensi ve, for there to be a
sustai nabl e busi ness model for more than one or two ROLAP vendors. Mi croStrategy may
be the onl y si zeabl e l ong-term ROLAP survi vor.
EIS
By the mi d 1980s, the term EI S (Executi ve I nformati on System) had been born. The
i dea was to provi de rel evant and ti mel y management i nformati on usi ng a new, much si mpl er
212 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
user i nterface than had previ ousl y been avai l abl e. Thi s used what was then a revol uti onary
concept, a graphi cal user i nterface runni ng on DOC PCs, and usi ng touch screens or mouse.
For executi ve use, the PCs were often hi dden away i n custom cabi nets and desks, as few
seni or executi ves of the day wanted to be seen as nerdy PC users.
The fi rst expl i ci t EI S product was Pi l ot’s Command Center though there had been EI S
appl i cati ons i mpl emented by I RI and Comshare earl i er i n the decade. Thi s was a cooperati ve
processi ng product, an archi tecture that woul d now be cal l ed cl i ent/server. Because of the
l i mi ted power of mi d 1980s PCs, i t was very server-centri c, but that approach came back
i nto fashi on agai n wi th products l i ke EUREKA: Strategy and Hol os and the Web. Command
Center i s no l onger on sal e, but i t i ntroduced many concepts that are recogni zabl e i n today’s
OLAP products, i ncl udi ng automati c ti me seri es handl i ng, mul ti di mensi onal cl i ent/server
processi ng and si mpl i fi ed human factors (sui tabl e for touch screen or mouse use). Some of
these concepts were re-i mpl emented i n Pi l ot’s Anal ysi s Server product, whi ch i s now al so i n
the autumn of i ts l i fe, as i s Pi l ot, whi ch changed hands agai n i n August 2000, when i t was
bought by Accrue Software. However, rather unexpectedl y, i t reappeared as an i ndependent
company i n mi d 2002, though i t has kept a l ow profi l e si nce.
Spreadsheets and OLAP
By the l ate 1980s, the spreadsheet was al ready becomi ng domi nant i n end-user anal ysi s,
so the fi rst mul ti di mensi onal spreadsheet appeared i n the form of Compete. Thi s was ori gi nal l y
marketed as a very expensi ve speci al i st tool , but the vendor coul d not generate the vol umes
to stay i n busi ness, and Computer Associ ates acqui red i t, al ong wi th a number of other
spreadsheet products i ncl udi ng SuperCal c and 20/20. The mai n effect of CA’s acqui si ti on of
Compete was that the pri ce was sl ashed, the copy protecti on removed and the product was
heavi l y promoted. However, i t was sti l l not a success, a trend that was to be repeated wi th
CA’s other OLAP acqui si ti ons. For a few years, the ol d Compete was sti l l occasi onal l y found,
bundl ed i nto a heavi l y di scounted bargai n pack. Later, Compete formed the basi s for CA’s
versi on 5 of SuperCal c, but the mul ti di mensi onal i ty aspect of i t was not promoted.
Lotus was the next to attempt to enter the mul ti di mensi onal spreadsheet market wi th
I mprov. Bravel y, thi s was l aunched on the NeXT machi ne. Thi s at l east guaranteed that i t
coul d not take sal es away from 1-2-3, but when i t was eventual l y ported to Wi ndows, Excel
was al ready too bi g a threat to 1-2-3 for I mprov’s sal es to make any di fference. Lotus, l i ke
CA wi th Compete, moved I mprov down market, but thi s was sti l l not enough for market
success, and new devel opment was soon di sconti nued. I t seems that personal computer
users l i ked thei r spreadsheets to be supersets of the ori gi nal 1-2-3, and were not i nterested
i n new mul ti di mensi onal repl acements i f these were not al so ful l y compati bl e wi th thei r ol d,
macro dri ven worksheets. Al so, the concept of a smal l mul ti di mensi onal spreadsheet, sol d
as a personal producti vi ty appl i cati on, cl earl y does not fi t i n wi th the real busi ness worl d.
Mi crosoft went thi s way, by addi ng Pi votTabl es to Excel . Al though onl y a smal l mi nori ty of
Excel users take advantage of the feature, thi s i s probabl y the si ngl e most wi del y used
mul ti di mensi onal anal ysi s capabi l i ty i n the worl d, si mpl y because there are so many users
of Excel . Excel 2000 i ncl uded a more sophi sti cated versi on of Pi votTabl es, capabl e of acti ng
as both a desktop OLAP, and as a cl i ent to Mi crosoft Anal ysi s Servi ces. However, the OLAP
features even i n Excel 2003 are i nferi or to those i n OLAP add-i ns, so there i s sti l l a good
opportuni ty for thi rd-party opti ons as wel l .
I NTRODUCTI ON 213
By the l ate 1980s, Si nper had entered the mul ti di mensi onal spreadsheet worl d, ori gi nal l y
wi th a propri etary DOS spreadsheet, and then by l i nki ng to DOS 1-2-3. I t entered the
Wi ndows era by turni ng i ts (then named) TM/1 product i nto a mul ti di mensi onal back-end
server for standard Excel and 1-2-3. Sl i ghtl y l ater, Arbor di d the same thi ng, al though i ts
new Essbase product coul d then onl y work i n cl i ent/server mode, whereas Si nper’s coul d
al so work on a stand-al one PC. Thi s approach to bri ngi ng mul ti di mensi onal i ty to spreadsheet
users has been far more popul ar wi th users. So much so, i n fact, that tradi ti onal vendors
of propri etary front-ends have been forced to fol l ow sui t, and products l i ke Express, Hol os,
Genti a, Mi neShar e, Power Pl ay, MetaCube and Whi teLi ght al l pr oudl y offer ed hi ghl y
i ntegrated spreadsheet access to thei r OLAP servers. I roni cal l y, for i ts fi rst si x months,
Mi crosoft OLAP Servi ce was one of the few OLAP servers not to have a vendor-devel oped
spreadsheet cl i ent, as Mi crosoft’s (very basi c) offeri ng onl y appeared i n June 1999 i n Excel
2000. However, the (then) OLAP@Work Excel add-i n fi l l ed the gap, and sti l l (under i ts new
snappy name, Busi ness Query MD for Excel ) provi des much better expl oi tati on of the server
than does Mi crosoft’s own Excel i nterface. Si nce then there have been at l east ten other
thi rd party Excel add-i ns devel oped for Mi crosoft Anal ysi s Servi ces, al l offeri ng capabi l i ti es
not avai l abl e even i n Excel 2003. However, Busi ness Objects’ acqui si ti on of Crystal Deci si ons
has l ed to the phasi ng out of Busi nessQuery MD for Excel , to be repl aced by technol ogy from
Crystal .
There was a rush of new OLAP Excel add-i ns i n 2004 from Busi ness Objects, Cognos,
Mi crosoft, Mi croStrategy and Oracl e. Perhaps wi th users di si l l usi oned by di sappoi nti ng Web
capabi l i ti es, the vendors redi scovered that many numerate users woul d rather have thei r BI
data di spl ayed vi a a fl exi bl e Excel -based i nterface rather than i n a dumb Web page or PDF.
ROLAP and DOLAP
A few users demanded mul ti di mensi onal appl i cati ons that were much too l arge to be
handl ed i n mul ti di mensi onal databases, and the rel ati onal OLAP tool s evol ved to meet thi s
need. These presented the usual mul ti di mensi onal vi ew to users, someti mes even i ncl udi ng
a spreadsheet front-end, even though al l the data was stored i n an RDBMS. These have a
much hi gher cost per user, and l ower performance than speci al i zed mul ti di mensi onal tool s,
but they are a way of provi di ng thi s popul ar form of anal ysi s even to data not stored i n a
mul ti di mensi onal structure. Most have not survi ved.
Other vendors expanded i nto what i s someti mes cal l ed desktop OLAP: smal l cubes,
generated from l arge databases, but downl oaded to PCs for processi ng (even though, i n Web
i mpl ementati ons, the cubes usual l y resi de on the server). These have proved very successful
i ndeed, and the one vendor that sel l s both a rel ati onal query tool and a mul ti di mensi onal
anal ysi s tool (Cognos, wi th I mpromptu and PowerPl ay) reports that the l atter i s much more
popul ar wi th end-users than i s the former.
Now, even the rel ati onal database vendors have embraced mul ti di mensi onal anal ysi s,
wi th Oracl e, I BM, Mi crosoft, the former I nformi x, CA and Sybase al l devel opi ng or marketi ng
products i n thi s area. I roni cal l y, havi ng l argel y i gnored mul ti di mensi onal i ty for so many
years, i t seemed for a whi l e that Oracl e, Mi crosoft and I BM mi ght be the new ‘OLAP tri ad’,
wi th l arge OLAP market shares, based on sel l i ng mul ti di mensi onal products they di d not
i nvent. I n the event, Oracl e and I BM fai l ed to achi eve thi s status, but Mi crosoft i s now the
l argest OLAP vendor.
214 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Lessons
So, what l essons can we draw from thi s 35-years of hi story?
• Mul ti di mensi onal i ty i s here to stay. Even hard to use, expensi ve, sl ow and el i ti st
mul ti di mensi onal products survi ve i n l i mi ted ni ches; when these restri cti ons are
removed, i t booms. We are about to see the bi ggest-ever growth of mul ti di mensi onal
appl i cati ons.
• End-users wi l l not gi ve up thei r general -purpose spreadsheets. Even when accessi ng
mul ti di mensi onal databases, spreadsheets are the most popul ar cl i ent pl atform,
and there are numerous thi rd-party Excel add-i ns for Mi crosoft Anal ysi s Servi ces
to fi l l the gaps i n Mi crosoft’s own offeri ng. Most other BI vendors now al so offer
Excel add-i ns as al ternati ve front-ends. Stand-al one mul ti di mensi onal spreadsheets
are not successful unl ess they can provi de ful l upwards compati bi l i ty wi th tradi ti onal
spreadsheets, somethi ng that I mprov and Compete fai l ed to do.
• Most peopl e fi nd i t easy to use mul ti di mensi onal appl i cati ons, but bui l di ng and
mai ntai ni ng them takes a parti cul ar apti tude – whi ch has stopped them from
becomi ng mass-market products. But, usi ng a combi nati on of si mpl i ci ty, pri ci ng
and bundl i ng, Mi crosoft now seems determi ned to prove that i t can make OLAP
servers al most as wi del y used as rel ati onal databases.
• Mul ti di mensi onal appl i cati ons are often qui te l arge and are usual l y sui tabl e for
workgroups, rather than i ndi vi dual s. Al though there i s a rol e for pure si ngl e-user
mul ti di mensi onal products, the most successful i nstal l ati ons are mul ti -user, cl i ent/
server appl i cati ons, wi th the bul k of the data downl oaded from feeder systems once
rather than many ti mes. There usual l y needs to be some I T support for thi s, even
i f the appl i cati on i s dri ven by end-users.
• Si mpl e, cheap OLAP products are much more successful than powerful , compl ex,
expensi ve products. Buyers general l y opt for the l owest cost, si mpl est product that
wi l l meet most of thei r needs; i f necessary, they often compromi se thei r requi rements.
Projects usi ng compl ex products al so have a hi gher fai l ure rate, probabl y because
there i s more opportuni ty for thi ngs to go wrong.
OLAP Milestones
Year Event Comment
1962 Publ i cati on of Fi rst mul ti di mensi onal l anguage; used Greek
A Programming symbol s for operators. Became avai l abl e on
Language by Ken I BM mai nframes i n the l ate 1960s and sti l l
I verson used to a l i mi ted extent today. APL woul d not
count as a modern OLAP tool , but many of i ts
i deas l i ve on i n today’s al together l ess el i ti st
products, and some appl i cati ons (e.g. Cognos
Pl anni ng Anal yst and Cognos Consol i dati on,
the former Lex 2000) sti l l use APL i nternal l y.
1970 Express avai l abl e on Fi rst mul ti di mensi onal tool ai med at
ti meshari ng (fol l owed by marketi ng appl i cati ons; now owned by Oracl e,
I NTRODUCTI ON 215
i n-house versi ons l ater and sti l l one of the market l eaders (after
i n the 1970s) several rewri tes and two changes of owner -
shi p). Al though the code i s much changed, the
concepts and the data model are not. The
modern versi on of thi s engi ne i s now shi ppi ng
as the MOLAP engi ne i n Oracl e9i Rel ease 2
OLAP Opti on.
1982 Comshare System W Fi rst OLAP tool ai med at fi nanci al appl i cati ons
avai l abl e on ti meshari ng No l onger marketed, but sti l l i n l i mi ted use
(and i n-house mai nframes on I BM mai nframes; i ts Wi ndows descendent
the fol l owi ng year) i s marketed as the pl anni ng component of
Comshare MPC. The l ater Essbase product
used many of the same concepts, and l i ke
System W, suffers from database expl osi on.
1984 Metaphor l aunched Fi rst ROLAP. Sal es of thi s Mac cousi n were
di sappoi nti ng, partl y because of propri etary
hardware and hi gh pri ces (the start-up cost
for an ei ght-wor kstati on system, i ncl udi ng
72Mb fi l e server, database server and software
was $64,000). But, l i ke Mac users, Metaphor
users remai ned fi ercel y l oyal .
1985 Pi l ot Command Center Fi rst cl i ent/server EI S styl e OLAP; used a
l aunched ti me-seri es approach runni ng on VAX servers
and standard PCs as cl i ents.
1990 Cognos PowerPl ay Thi s became both the fi rst desktop and the
l aunched fi rst Wi ndows OLAP and now l eads the
“desktop” sector. Though we sti l l cl ass thi s as
a desktop OLAP on functi onal grounds, most
customers now i mpl ement the much more
scal abl e cl i ent/server and Web versi ons.
1991 I BM acqui red Metaphor The fi rst of many OLAP products to change
hands. Metaphor became part of the doomed
Appl e-I BM Tal i gent joi nt ventur e and was
renamed I DS, but there are unl i kel y to be any
remai ni ng si tes.
1992 Essbase l aunched Fi r st wel l -mar keted OLAP pr oduct, whi ch
went on to become the market l eadi ng OLAP
server by 1997.
1993 Codd whi te paper coi ned Thi s whi te paper, commi ssi oned by Arbor
the OLAP term Software, brought mul ti di mensi onal anal ysi s
to the attenti on of many more peopl e than
ever before. However, the Codd OLAP rules
were soon forgotten (unl i ke hi s i nfl uenti al and
respected rel ati onal rul es).
216 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
1994 Mi croStrategy DSS Agent Fi rst ROLAP to do wi thout a mul ti di mensi onal
l aunched engi ne, wi th al most al l processi ng bei ng
performed by mul ti -pass SQL – an appropri ate
approach for very l arge databases, or those
wi th very l arge di mensi ons, but suffers from a
severe performance penal ty. The modern
Mi croStrategy 7i has a more conventi onal
three-ti er hybri d OLAP archi tecture.
1995 Hol os 4.0 rel eased Fi r s t h y br i d OL AP, al l owi n g a s i n gl e
appl i cati on to access both r el ati onal and
mul ti di mensi onal databases si mul taneousl y.
Many other OLAP tool s are now usi ng thi s
appr oach. Hol os was acqui r ed by Cr ystal
Deci s i on s i n 1996, bu t h as n ow been
di sconti nued.
1995 Oracl e acqui red Express Fi rst i mportant OLAP takeover. Arguabl y, i t
was thi s event that put OLAP on the map,
and i t al most certai nl y tri ggered the entry of
the other database vendors. Express has now
become a hybri d OLAP and competes wi th both
mul ti di mensi onal and rel ati onal OLAP tool s.
Oracl e soon promi sed that Express woul d be
ful l y i ntegrated i nto the rest of i ts product
l i ne but, al most ten years l ater, has sti l l fai l ed
to del i ver on thi s promi se.
1996 Busi ness Objects 4.0 Fi rst tool to provi de seaml ess mul ti di mensi onal
l aunched and rel ati onal reporti ng from desktop cubes
dynami cal l y bui l t from rel ati onal data. Earl y
rel eases had probl ems, now l argel y resol ved,
but Busi ness Objects has al ways struggl ed to
del i ver a true Web versi on of thi s desktop
OLAP archi tecture. I t i s expected fi nal l y to
achi eve thi s by usi ng the former Crystal
Enterpri se as the base.
1997 Mi crosoft announced Thi s project was code-named Tensor, and
OLE DB for OLAP became the ‘i ndustry standard’ OLAP API
before even a si ngl e product supporti ng i t
shi pped. Many thi rd-party products now
support thi s API , whi ch i s evol vi ng i nto the
more modern XML for Anal ysi s.
1998 I BM DB2 OLAP Thi s versi on of Essbase stored al l data i n a
Server rel eased form of rel ati onal star schema, i n DB2 or other
rel ati onal databases, but i t was more l i ke a
sl ow MOLAP than a scal abl e ROLAP. I BM
I NTRODUCTI ON 217
l ater abandoned i ts “enhancements”, and now
shi ps the standard versi on of Essbase as DB2
OLAP Server. Despi te the name, i t remai ns
non-i ntegrated wi th DB2.
1998 Hyperi on Sol uti ons Arbor and Hyperi on Software ‘merged’ i n the
formed fi rst l arge consol i dati on i n the OLAP market.
Despi te the name, thi s was more of a takeover
of Hyperi on by Arbor than a merger, and was
probabl y i ni ti ated by fears of Mi crosoft’s entry
to the OLAP market. Li ke most other OLAP
acqui si ti ons, thi s went badl y. Not unti l 2002
di d the merged company begi n to perform
competi ti vel y.
1999 Mi crosoft OLAP Servi ces Thi s project was code-named Pl ato and then
shi pped named Deci si on Support Servi ces i n earl y pre-
rel ease versi ons, before bei ng renamed OLAP
Servi ces on rel ease. I t used technol ogy
acqui red from Panorama Software Systems i n
1996. Thi s soon became the OLAP server
vol ume market l eader through ease of
depl oyment, sophi sti cated storage archi tecture
(ROLAP/MOLAP/Hybri d), huge thi rd-party
support, l ow pri ces and the Mi crosoft
marketi ng machi ne.
1999 CA starts moppi ng up CA acqui red the former Prodea Beacon, vi a
fai l ed OLAP servers Pl ati num, i n 1999 and renamed i t Deci si on -
Base. I n 2000 i t al so acqui red the former I A
Eureka, vi a Sterl i ng. Thi s col l ecti on of fai l ed
OLAPs seems desti ned to grow, though the
products are soon snuffed-out under CA’s hard-
nosed ownershi p, and as The OLAP Survey 2,
the remai ni ng CA Cl everpath OLAP customers
were very unhappy by 2002. By the The OLAP
Survey 4 i n 2004, there seemed to be no
remai ni ng users of the product.
2000 Mi crosoft renames OLAP Mi crosoft renamed the second rel ease of i ts
Servi ces to Anal ysi s OLAP server for no good reason, thus confusi ng
Servi ces much of the market. Of course, many
references to the two previ ous names remai n
wi thi n the product.
2000 XML for Anal ysi s Thi s i ni ti ati ve for a mul ti -vendor, cross-
announced pl atform, XML-based OLAP API i s l ed by
Mi crosoft (l ater joi ned by Hyperi on and then
SAS I nsti tute). I t i s, i n effect, an XML
218 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
i mpl ementati on of OLE DB for OLAP. XML
for Anal ysi s usage i s unl i kel y to take off unti l
2006.
2001 Oracl e begi ns shi ppi ng Si x years after acqui ri ng Express, whi ch has
the successor to Express been i n use si nce 1970, Oracl e began shi ppi ng
Oracl e9i OLAP, expected eventual l y to succeed
Express. However, the fi rst rel ease of the new
generati on Oracl e OLAP was i ncompl ete and
unusabl e. The ful l repl acement of the
technol ogy and appl i cati ons i s not expected
unti l wel l i nto 2003, some ei ght years after
Oracl e acqui red Express.
2001 Mi croStrategy abandons Strategy.com was part of Mi croStrategy’s
Strategy.com grand strategy to become the next Mi crosoft.
I nstead, i t very nearl y bankrupted the
company, whi ch fi nal l y shut the subsi di ary
down i n l ate 2001.
2002 Oracl e shi ps i ntegrated Oracl e9i Rel ease 2 OLAP Opti on shi pped i n
OLAP server mi d 2002, wi th a MOLAP server (a moderni zed
Express), cal l ed the Anal yti cal Workspace,
i ntegrated wi thi n the database. Thi s was the
cl osest i ntegrati on yet between a MOLAP
server and an RDBMS. But i t i s sti l l not a
compl ete sol uti on, l acki ng competi ti ve front-
end tool s and appl i cati ons.
2003 The year of consol i dati on Busi ness Objects purchases Crystal Deci si ons,
Hyperi on Sol uti ons Bri o Software, Cognos
Adaytum, and Geac Comshare.
2004 Excel add-i ns go Busi ness Objects, Cognos, Mi crosoft,
mai nstream Mi croStrategy and Oracl e al l rel ease new Excel
add-i ns for accessi ng OLAP data, whi l e Sage
buys one of the l eadi ng Anal ysi s Servi ces Excel
add-i n vendors, I ntel l i gentApps.
2004 Essbase database Hyperi on rel eases Essbase 7X whi ch i ncl uded
expl osi on curbed the resul ts of Project Ukrai ne: the Aggregate
Storage Opti on. Thi s fi nal l y cured Essbase’s
notori ous database explosion syndrome,
maki ng the product sui tabl e for marketi ng, as
wel l as fi nanci al , appl i cati ons.
2004 Cognos buys i ts Cognos buys Frango, the Swedi sh consol i dati on
second Frango system. Less wel l known i s the fact that
Adaytum, whi ch Cognos bought i n the previ ous
year, had i ts ori gi ns i n I BM’s Frango project
from the earl y 1980s.
I NTRODUCTI ON 219
2005 Mi crosoft to shi p the Ori gi nal l y pl anned for rel ease i n 2003,
much-del ayed SQL Mi crosoft may just manage to shi p the major
Server 2005? ‘Yukon’ versi on before the end of 2005.
17.4 WHAT’S IN A NAME?
Everyone wants to be (or at l east l ook l i ke) a wi nner, and thi s i s at l east as true of the
dozens of OLAP vendors, as i t woul d be for any other competi ti ve group – so they al l try
to be percei ved as l eaders or pi oneers of thei r chosen ni ches. They work hard to come up
wi th di sti nct posi ti oni ng statements that can make them appear to domi nate si gni fi cant
porti ons of the market. Thi s requi res them to defi ne i ngeni ousl y subtl e vari ati ons on common
themes, because there aren’t enough sui tabl e descri pti ve words to go round. Thi s sel f-
generated cl assi fi cati on system makes the vendors’ marketi ng peopl e feel good, but the
resul ts can be very confusi ng for buyers.
The phr ases business intelligence and decision support cr op up most often, but
performance management seems to have repl aced the short-l i ved EBI or e-BI fashi on. OLAP
i s used much l ess often than mi ght be expected, and the ol d EI S term appears al most to
have di sappeared. The most popul ar sel f-descri pti ons are pioneer and world/ global leading
(or leader). Some even cl ai m to be the “onl y” provi ders of thei r type of offeri ng. I roni cal l y,
however, few of the l argest vendors use grandi ose words l i ke these to descri be themsel ves.
However, one change si nce researchers began moni tori ng sl ogans i s that vendors have
become more cauti ous i n cal l i ng themsel ves “worl d market l eaders”: fewer of the obvi ousl y
smal l compani es now make thi s outl andi sh cl ai m (i n fact, few of these vendors even make
i t i nto our mar ket shar es tabl e). Pr esumabl y l awyer s now check mor e pr ess r el eases,
parti cul arl y i n the publ i c compani es?
Someone who attempted to cl assi fy the products i nto competi ng subgroups woul d have
real troubl e doi ng so i f thei r mai n source of i nformati on was vendors’ press rel eases. For
exampl e, Bri o, Busi ness Objects and Cognos are al l di rect competi tors (and cannot al l be the
l eader of the same segment), but thi s i s hardl y apparent from thei r favored descri pti ons of
themsel ves.
Another probl em i s that i ndustry watchers al l have thei r own contri ved categori es and
each refuses to use cl assi fi cati ons coi ned by others. The smal l er vendors often try to curry
favor wi th whi chever i ndustry anal yst they currentl y regard as most i nfl uenti al and therefore
adopt essenti al l y meani ngl ess descri pti ons l i ke “enterpri se busi ness i ntel l i gence”. Several
qui te di fferent products that are i n no way competi ti ve wi th each other can therefore
apparentl y fal l i nto the same category, whereas di rect competi tors mi ght opt for qui te
di fferent anal yst groupi ngs – none of whi ch hel ps the buyer.
The resul t i s that someone who wi shes to sel ect sui tabl e OLAP products cannot even
begi n the search by usi ng vendors’ own cl assi fi cati ons of themsel ves. I f, for exampl e, a si te
wants to i mpl ement typi cal OLAP appl i cati ons l i ke a management reporti ng system, a
budgeti ng and pl anni ng system or front-end software for a data warehouse, a text search
of press rel eases and Web si tes l ooki ng for these words i s l i kel y to produce a very unrel i abl e
l i st of suppl i ers.
220 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Company Self-description
Adaytum “the l eadi ng provi der of enterpri se busi ness
planning (EBP) solutions”
AlphaBlox “a leading global provider of customer analytics
and busi ness pl anni ng software that rapi dl y
transforms information into knowledge”
Applix TM1 “the leading business intelligence analytical engine
that powers a suite of planning, reporting and
analysis solutions used by Global 2000 enterprises”
Brio Software “the leading provider of next-generation business
intelligence tools that help Global 3000 companies
achieve breakthrough business performance”
Business Objects “the worl d’s l eadi ng provi der of busi ness
intelligence (BI ) solutions”
Cartesis “a l eadi ng provi der of worl d-cl ass strategi c
financial software”
Cognos “the world leader in business intelligence (BI )”
Comshare “a l eadi ng provi der of software that hel ps
companies implement and execute strategy”
CorVu “a gl obal provi der of Enterpri se Busi ness
Performance Management, e-Business I ntelligence
and Balanced Scorecard Solutions”
Crystal Decisions “i s one of the worl d’s l eadi ng i nformati on
management software companies”
Dimensional Insight “a pioneer in developing and marketing multi-
dimensional data visualization, analysis, and
reporting solutions that put you in command of
your business”
Gentia Software “a l eadi ng suppl i er of i ntel l i gent anal yti cal
appl i cati ons for enterpri se performance
management and customer rel ati onshi p
management”
Hummingbird “a fully integrated, scalable enterprise business
Communications intelligence solution”
Hyperion Solutions “a gl obal l eader i n busi ness performance
management software”
Longview Solutions “a leading provider of fully-integrated financial
analysis and decision support solutions”
MicroStrategy “a l eadi ng worl dwi de provi der of busi ness
intelligence software”
I NTRODUCTI ON 221
Oracle Express “a powerful, multidimensional OLAP analysis
environment”
ProClarity “del i vers anal yti c software and servi ces that
accelerate the speed organizations make informed
decisions to optimize business performance”
Sagent Technology “provides a complete software platform for business
intelligence”
SAS Institute “the market leader in business intelligence and
decision support”
ShowCase “provi des a compl ete spectrum of busi ness
intelligence (BI ) solutions”
Speedware Corporation “a global provider of I nternet tools and services,
developing innovative solutions for a wide variety
of pl atforms and markets, i ncl udi ng busi ness
intelligence solutions”
White Light Systems “the leading provider of next-generation analytic
applications”
17.5 MARKET ANALYSIS
Due to the consol i dati on i n the market, i t i s now l ess rel evant than i t once was. There
are now fewer si gni fi cant suppl i ers, and the mature products have expanded to the extent
that they are harder to cl assi fy neatl y than they once were.
Al though there are over many OLAP suppl i ers, they are not al l di rect competi tors. I n
fact, they can be grouped i nto four pri nci pal categori es, wi th rel ati vel y l i ttl e overl ap between
them. I t i s conveni ent to show the categori es i n the form of a square, because thi s neatl y
encapsul ates the rel ati onshi ps they have wi th each other. Where there are si mi l ari ti es, they
are wi th vendors on adjacent si des of the square, and not wi th those on the opposi te si de.
Vendors can be expected to be qui te wel l i nformed – though not compl i mentary – about
thei r di rect competi tors i n the same segment (al though they are someti mes al armi ngl y
mi si nformed), and somewhat l ess knowl edgeabl e about those i n the adjacent si des of the
square. They are usual l y rel ati vel y i gnorant, and often deri si ve, about vendors on the opposi te
si de of the square. More surpri si ngl y, the same i s often true of hardware suppl i ers, consul tants
and systems i ntegrators, whose experti se i s usual l y concentrated on a few products. Thi s
means that i t can be mi sl eadi ng to assume that someone who cl earl y has expert knowl edge
about a coupl e of OLAP products wi l l be equal l y wel l -i nformed about others, parti cul arl y i f
they are i n other di fferent segments.
Thi s i s more of a vendor than a product categori zati on, al though i n most cases i t
amounts to the same thi ng. Even when vendors choose to occupy posi ti ons on more than one
si de of the square (wi th di fferent products or by unbundl i ng components), thei r atti tudes
and styl e often anchor them to one posi ti on. The architectural differences between the
products are descri bed separatel y i n The OLAP Report.
222 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
ROLAP
A
p
p
l
i
c
a
t
i
o
n
O
L
A
P
D
e
s
k
t
o
p
O
L
A
P
MOLAP
A potenti al buyer shoul d create a shortl i st of OLAP vendors for detai l ed consi derati on
that fal l l argel y i nto a single one of the four categori es. There i s somethi ng wrong wi th a
shortl i st that i ncl udes products from opposi te si des of the square.
Bri efl y, the four sectors can be descri bed as:
1. Application OLAP i s the l ongest establ i shed sector (whi ch exi sted two decades
before the OLAP term was coi ned), and i ncl udes many vendors. Thei r products are
sol d ei ther as compl ete appl i cati ons, or as very functi onal , compl ete tool ki ts from
whi ch compl ex appl i cati ons can be bui l t. Nearl y al l appl i cati on OLAP products
i ncl ude a mul ti di mensi onal database, al though a few al so work as hybri d or rel ati onal
OLAPs. Someti mes the bundl ed mul ti di mensi onal engi nes ar e not especi al l y
competi ti ve i n performance terms, and there i s now a tendency for vendors i n thi s
sector to use thi rd-party mul ti di mensi onal engi nes. The typi cal strengths i ncl ude
i ntegrati on and hi gh functi onal i ty whi l e the weaknesses may be compl exi ty and
hi gh cost per user. Vendors used to i ncl ude Oracl e, Hyperi on Sol uti ons, Comshare,
Adaytum, formerl y Seagate Software, Pi l ot Software, Genti a Software, SAS I nsti tute,
Whi teLi ght, Sagent, Speedware, Kenan and I nformati on Bui l ders – but many of
these have been acqui red, and some have di sappeared enti rel y. Numerous speci al i st
vendors bui l d appl i cati ons on OLAP servers from Hyperi on, Mi crosoft, Mi croStrategy
and Appl i x.
Because many of these vendors and thei r appl i cati ons are ai med at parti cul ar
verti cal (e.g., retai l , manufacturi ng, and banki ng) or hori zontal (e.g., budgeti ng,
fi nanci al consol i dati on, sal es anal ysi s) markets, there i s room for many vendors i n
thi s segment as most wi l l not be competi ng di rectl y wi th each other. There may be
room onl y for two or three i n each narrow ni che, however.
2. MOLAP (Multidimensional Database OLAP) has been i denti fi ed si nce the mi d
1990s, al though the sector has exi sted si nce the l ate 1980s. I t consi sts of products
that can be bought as unbundl ed, hi gh performance mul ti di mensi onal or hybri d
databases. These products often i ncl ude mul ti -user data updati ng. I n some cases,
the vendor s ar e speci al i sts, who do nothi ng el se, whi l e other s al so sel l other
appl i cati on components. As these are sol d as best of breed sol uti ons, they have to
del i ver hi gh performance, sophi sti cated mul ti di mensi onal cal cul ati on functi onal i ty
I NTRODUCTI ON 223
and openness, and are measured i n terms of the range of compl ementary products
avai l abl e from thi rd-parti es. General l y speaki ng, these products do not handl e
appl i cati ons as l arge as those that are possi bl e i n the ROLAP products, al though
thi s i s changi ng as these products evol ve i nto hybri d OLAPs. There are two speci al i st
MOLAP vendors, Hyperi on (Essbase) and Appl i x (TM1), who provi de unbundl ed
hi gh performance, engi nes, but are open to thei r users purchasi ng add-on tool s
from partners.
Mi crosoft joi ned thi s sector wi th i ts OLAP Servi ces modul e i n SQL Server 7.0,
whi ch i s dri vi ng other OLAP technol ogy vendors i nto the Appl i cati on OLAP segment;
Arbor acqui red Hyperi on Software to become Hyperi on Sol uti ons and Appl i x i s al so
movi ng i nto the appl i cati ons busi ness. Vendors i n thi s segment are competi ng
di rectl y wi th each other, and i t i s unl i kel y that more than two or three vendors can
survi ve wi th thi s strategy i n the l ong term. Wi th SQL Server 2000 Anal ysi s Servi ces,
rel eased i n September 2000, Mi crosoft i s a formi dabl e competi tor i n thi s segment.
3. DOLAP (Desktop OLAP) has exi sted si nce 1990, but has onl y become recogni zed
i n the mi d 1990s. These are cl i ent-based OLAP products that are easy to depl oy
and have a l ow cost per seat. They normal l y have good database l i nks, often to both
rel ati onal as wel l as mul ti di mensi onal servers, as wel l as l ocal PC fi l es. I t i s not
nor mal l y necessar y to bui l d an appl i cati on. They usual l y have ver y l i mi ted
functi onal i ty and capaci ty compared to the more speci al i zed OLAP products. Cognos
(wi th PowerPl ay) i s the l eader, but Busi ness Objects, Bri o Technol ogy, Crystal
Deci si ons and Hummi ngbi rd are al so contenders. Oracl e i s ai mi ng at thi s sector
(and some of i ts peopl e i nsi st that i t i s al ready i n i t) wi th Di scoverer, al though i t
has far too l i ttl e functi onal i ty for i t to be a contender (Di scoverer al so has l ess Web
functi onal i ty than the real desktop OLAPs). I t l acks the abi l i ty to work off-l i ne or
to perform even qui te tri vi al mul ti di mensi onal cal cul ati ons, both of whi ch are
prerequi si tes; i t i s al so unabl e to access OLAP servers, even i ncl udi ng Oracl e’s own
Express Server. Crystal Enterpri se al so ai ms at thi s sector, but i t too fal l s short of
the ful l functi onal i ty of the mature desktop OLAPs.
The Web versi ons of desktop OLAPs i ncl ude a mi d-ti er server that repl aces some
or al l of the cl i ent functi onal i ty. Thi s bl urs the di sti ncti on between desktop OLAPs
and other server based OLAPs, but the di fferi ng functi onal i ty sti l l di sti ngui shes
them.
Vendors i n thi s sector are di rect competi tors, and once the market growth sl ows
down, a shake-out seems i nevi tabl e. Successful desktop OLAP vendor s ar e
characteri zed by huge numbers of al l i ances and resel l er deal s, and many of thei r
sal es are made through OEMs and VARs who bundl e l ow cost desktop OLAP
products wi th thei r appl i cati ons as a way of del i veri ng i ntegrated mul ti di mensi onal
anal ysi s capabi l i ti es. As a resul t, the l eaders i n thi s sector ai m to have hundreds
of thousands of ‘seats’, but most users probabl y onl y do very si mpl e anal yses. There
i s al so a l ot of ‘shel fware’ i n the l arge desktop OLAP si tes, because buyers are
encouraged to purchase far more seats than they need.
4. ROLAP (Relational OLAP) i s another sector that has exi sted si nce Metaphor i n
the earl y 1980s (so none of the current vendors can honestl y cl ai m to have i nvented
224 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
i t, but that doesn’t stop them tryi ng), but i t was recogni zed i n 1994. I t i s by far the
smal l est of the OLAP sectors, but has had a great deal of publ i ci ty thanks to some
astute marketi ng. Despi te confi dent predi cti ons to the contrary, ROLAP products
have compl etel y fai l ed to domi nate the OLAP market, and not one ROLAP vendor
has made a profi t to date.
Products i n thi s sector draw al l thei r data and metadata i n a standard RDBMS,
wi th none bei ng stored i n any external fi l es. They are capabl e of deal i ng wi th very
l arge data vol umes, but are compl ex and expensi ve to i mpl ement, have a sl ow
query performance and are i ncapabl e of performi ng compl ex fi nanci al cal cul ati ons.
I n operati on, they work more as batch report wri ters than i nteracti ve anal ysi s tool s
(typi cal l y, responses to compl ex queri es are measured i n mi nutes or even hours
rather than seconds). They are sui tabl e for read-onl y reporti ng appl i cati ons, and
are most often used for sal es anal ysi s i n the retai l , consumer goods, tel ecom and
fi nanci al servi ces i ndustri es (usual l y for compani es whi ch have very l arge numbers
of customers, products or both, and wi sh to report on and someti mes anal yze sal es
i n detai l ).
Probabl y because the ni che i s so smal l , no ROLAP products have succeeded i n
al most 20 years of tryi ng. The pi oneer ROLAP, Metaphor, fai l ed to bui l d a sound
busi ness and was eventual l y acqui red by I BM, whi ch cal l s i t I DS. Now Mi croStrategy
domi nates what remai ns of thi s sector, havi ng defeated I nformati on Advantage i n
the marketpl ace. Sterl i ng acqui red the fai l i ng I A i n 1999, and CA acqui red Sterl i ng
i n 2000, and the former I A ROLAP server has, under CA’s ownershi p, effecti vel y
di sappear ed fr om the mar ket. The for mer Pr odea was acqui r ed by Pl ati num
technol ogy, whi ch was i tsel f acqui red by CA i n 1999, and the Beacon product (now
renamed Deci si onBase) has al so di sappeared from the market. I nformi x has al so
fai l ed wi th i ts MetaCube product, whi ch i t acqui red i n 1995, but abandoned before
the end of 1999. The l oss-maki ng Whi teLi ght al so competes i n thi s area, al though
i ts archi tecture i s cl oser to hybri d OLAP, and i t posi ti ons i tsel f as an appl i cati ons
pl atform rather than a ROLAP server; i t al so has a ti ny market share. Mi neShare
and Sagent fai l ed to make any i nroads at al l i nto the ROLAP market. Even Mi cro-
Strategy, the most voci ferous promoter of the ROLAP concept, has moved to more
of hybri d OLAP archi tecture wi th i ts l ong-del ayed and much i mproved 7.0 versi on,
whi ch fi nal l y shi pped i n June 2000. Despi te bei ng the most successful ROLAP
vendor by far, Mi croStrategy has al so made by far the l argest l osses ever i n the
busi ness i ntel l i gence i ndustry.
Si mi l arl y, customers of hybri d products that al l ow both ROLAP and MOLAP modes,
l i k e Mi cr osoft Anal ysi s Ser vi ces, Or acl e Expr ess and Cr ystal Hol os (now
di sconti nued), al most al ways choose the MOLAP mode.
17.6 OLAP ARCHITECTURES
Much confusi on, some of i t del i berate, abounds about OLAP archi tectures, wi th terms
l i ke ROLAP, HOLAP, MOLAP and DOLAP (wi th more than one defi ni ti on) prol i ferati ng at
one stage, though the l ast of these i s used l ess these days. I n fact, there are a number of
opti ons for where OLAP data coul d be stored, and where i t coul d be processed. Most vendors
onl y offer a subset of these, and some then go on to attempt to ‘prove’ that thei r approach
I NTRODUCTI ON 225
i s the onl y sensi bl e one. Thi s i s, of course, nonsense. However, qui te a few products can
operate i n more than one mode, and vendors of such products tend to be l ess stri dent i n
thei r archi tectural arguments.
There are many subtl e vari ati ons, but i n pri nci pl e, there are onl y three pl aces where
the data can be stored and three where the majori ty of the mul ti di mensi onal cal cul ati ons
can be performed. Thi s means that, i n theory, there are possi bl e ni ne basi c archi tectures,
al though onl y si x make any sense.
Data Staging
Most data i n OLAP appl i cati ons or i gi nates i n other systems. However , i n some
appl i cati ons (such as pl anni ng and budgeti ng), the data mi ght be captured di rectl y by the
OLAP appl i cati on. When the data comes from other appl i cati ons, i t i s usual l y necessary for
the acti ve data to be stored i n a separate, dupl i cated form for the OLAP appl i cati on. Thi s
may be referred to as a data warehouse or, more commonl y today, as a data mart. For those
not fami l i ar wi th the reasons for thi s dupl i cati on, thi s i s a summary of the mai n reasons:
• Performance: OLAP appl i cati ons are often l arge, but are neverthel ess used for
unpredi ctabl e i nteracti ve anal ysi s. Thi s requi res that the data be accessed very
rapi dl y, whi ch usual l y di ctates that i t i s kept i n a separate, opti mi zed structure
whi ch can be accessed wi thout damagi ng the response from the operati onal systems.
• Multiple Data Sources: Most OLAP appl i cati ons requi re data sourced from mul ti pl e
feeder systems, possi bl y i ncl udi ng external sources and even desktop appl i cati ons.
The process of mergi ng these mul ti pl e data feeds can be very compl ex, because the
under l yi ng systems pr obabl y use di ffer ent codi ng systems and may al so have
di fferent peri odi ci ti es. For exampl e, i n a mul ti nati onal company, i t i s rare for
subsi di ari es i n di fferent countri es to use the same codi ng system for suppl i ers and
customers, and they may wel l al so use di fferent ERP systems, parti cul arl y i f the
group has grown by acqui si ti on.
• Cleansing Data: I t i s depressi ngl y common for transacti on systems to be ful l of
erroneous data whi ch needs to be ‘cl eansed’ before i t i s ready to be anal yzed. Apart
from the smal l percentage of acci dental l y mi s-coded data, there wi l l al so be exampl es
of opti onal fi el ds that have not been compl eted. For exampl e, many compani es
woul d l i ke to anal yze thei r busi ness i n terms of thei r customers’ verti cal markets.
Thi s requi res that each customer (or even each sal e) be assi gned an i ndustry code;
however, thi s takes a certai n amount of effort on the part of those enteri ng the
data, for whi ch they get l i ttl e return, so they are l i kel y, at the very l east, to cut
corners. There may even be del i berate di storti on of the data i f sal es peopl e are
rewarded more for some sal es than others: they wi l l certai nl y respond to thi s di rect
temptati on by ‘adjusti ng’ (i .e. di storti ng) the data to thei r own advantage i f they
thi nk they can get away wi th i t.
• Adjusting Data: There are many reasons why data may need adjusti ng before i t
can be used for anal ysi s. I n order that thi s can be done wi thout affecti ng the
transacti on systems, the OLAP data needs to be kept separate. Exampl es of reasons
for adjusti ng the data i ncl ude:
226 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Forei gn subsi di ari es may operate under di fferent accounti ng conventi ons or
have di fferent year-ends, so the data may need modi fyi ng before i t can be
used.
• The source data may be i n mul ti pl e currenci es that must be transl ated.
• The management, operati onal and l egal structures of a company may be
di fferent.
• The source appl i cati ons may use di fferent codes for products and customers.
• I nter-company tradi ng effects may need to be el i mi nated, perhaps to measure
true added val ue at each stage of tradi ng.
• Some data may need obscuri ng or changi ng for reasons of confi denti al i ty.
• There may be anal ysi s di mensi ons that are not part of the operati onal data
such as ver ti cal mar kets, tel evi si on adver ti si ng r egi ons or demogr aphi c
characteri sti cs.
• Timing: I f the data i n an OLAP appl i cati on comes from mul ti pl e feeder systems,
i t i s very l i kel y that they are updated on di fferent cycl es. At any one ti me, therefore,
the feeder appl i cati ons may be at di fferent stages of update. For exampl e, the
month-end updates may be compl ete i n one system, but not i n another and a thi rd
system may be updated on a weekl y cycl e. I n order that the anal ysi s i s based on
consi stent data, the data needs to be staged, wi thi n a data warehouse or di rectl y
i n an OLAP database.
• History: The majori ty of OLAP appl i cati ons i ncl ude ti me as a di mensi on, and
many useful resul ts are obtai ned from ti me seri es anal ysi s. But for thi s to be useful
i t may be necessary to hol d several years’ data on-l i ne i n thi s way – somethi ng that
the operati onal systems feedi ng the OLAP appl i cati on are very unl i kel y to do. Thi s
requi res an i ni ti al effort to l ocate the hi stori cal data, and usual l y to adjust i t
because of changes i n organi zati onal and product structures. The resul ti ng data i s
then hel d i n the OLAP database.
• Summaries: Operati onal data i s necessari l y very detai l ed, but most deci si on-maki ng
acti vi ti es requi re a much hi gher l evel vi ew. I n the i nterests of effi ci ency, i t i s
usual l y necessary to store merged, adjusted i nformati on at summary l evel , and thi s
woul d not be feasi bl e i n a transacti on processi ng system.
• Data Updating: I f the appl i cati on al l ows users to al ter or i nput data, i t i s obvi ousl y
essenti al that the appl i cati on has i ts own separate database that does not over-
wri te the ‘offi ci al ’ operati onal data.
Storing Active OLAP Data
Gi ven the necessi ty to store acti ve OLAP data i n an effi ci ent, dupl i cated form, there are
essenti al l y thr ee opti ons. Many pr oducts can use mor e than one of these, someti mes
si mul taneousl y. Note that ‘store’ i n thi s context means hol di ng the data i n a persi stent form
(for at l east the durati on of a sessi on, and often shared between users), not si mpl y for the
ti me requi red to process a si ngl e query.
• Relational Database: Thi s i s an obvi ous choi ce, parti cul arl y i f the data i s sourced
from an RDBMS (ei ther because a data warehouse has been i mpl emented usi ng an
I NTRODUCTI ON 227
RDBMS or because the oper ati onal systems themsel ves hol d thei r data i n an
RDBMS). I n most cases, the data woul d be stored i n a denormal i zed structure such
as a star schema, or one of i ts vari ants, such as snowfl ake; a normal i zed database
woul d not be appropri ate for performance and other reasons. Often, summary data
wi l l be hel d i n aggregate tabl es.
• Multidimensional Database: I n thi s case, the acti ve data i s stor ed i n a
mul ti di mensi onal database on a ser ver . I t may i ncl ude data extr acted and
summari zed from l egacy systems or rel ati onal databases and from end-users. I n
most cases, the database i s stored on di sk, but some products al l ow RAM based
mul ti di mensi onal data structures for greater performance. I t i s usual l y possi bl e
(and someti mes compul sory) for aggregates and other cal cul ated i tems to be pre-
computed and the resul ts stored i n some form of array structure. I n a few cases,
the mul ti di mensi onal database al l ows concurrent mul ti -user read-wri te access, but
thi s i s unusual ; many products al l ow si ngl e-wri te/mul ti -read access, whi l e the rest
are l i mi ted to read-onl y access.
• Client-based Files: I n thi s case, rel ati vel y smal l extracts of data are hel d on
cl i ent machi nes. They may be di stri buted i n advance, or created on demand (possi bl y
vi a the Web). As wi th mul ti di mensi onal databases on the server, acti ve data may
be hel d on di sk or i n RAM, and some products al l ow onl y read access.
These three l ocati ons have di fferent capaci ti es, and they are arranged i n descendi ng
order. They al so have di fferent performance characteri sti cs, wi th rel ati onal databases bei ng
a great deal sl ower than the other two opti ons.
Processing OLAP Data
Just as there are three possi bl e l ocati ons for OLAP data, exactl y the same three
opti ons are avai l abl e for processi ng the data. As wi l l be seen, the mul ti di mensi onal cal cul ati ons
do not need to occur i n the pl ace where the data i s stored.
• SQL: Thi s i s far from bei ng an obvi ous choi ce to perform compl ex mul ti di mensi onal
cal cul ati ons, even i f the l i ve OLAP data i s stored i n an RDBMS. SQL does not have
the abi l i ty to per for m mul ti di mensi onal cal cul ati ons i n si ngl e statements, and
compl ex mul ti -pass SQL i s necessar y to achi eve mor e than the most tr i vi al
mul ti di mensi onal functi onal i ty. Neverthel ess, thi s has not stopped vendors from
tryi ng. I n most cases, they do a l i mi ted range of sui tabl e cal cul ati ons i n SQL, wi th
the resul ts then bei ng used as i nput by a mul ti di mensi onal engi ne, whi ch does
most of the work, ei ther on the cl i ent or i n a mi d-ti er server. There may al so be
a RAM resi dent cache whi ch can hol d data used i n more than one query: thi s
i mproves response dramati cal l y.
• Multidimensional Server Engine: Thi s i s an obvi ous and popul ar pl ace to perform
mul ti di mensi onal cal cul ati ons i n cl i ent/server OLAP appl i cati ons, and i t i s used i n
many products. Performance i s usual l y good, because the engi ne and the database
can be opti mi zed to work together, and the avai l abi l i ty of pl enty of memory on a
server can mean that l arge scal e array cal cul ati ons can be performed very effi ci entl y.
• Client Multidimensional Engine: On the assumpti on that most user s have
rel ati vel y powerful PCs, many vendors ai m to take advantage of thi s power to
228 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
perform some, or most, of the mul ti di mensi onal cal cul ati ons. Wi th the expected ri se
i n popul ari ty of thi n cl i ents, vendors wi th thi s archi tecture have to move most of
the cl i ent based processi ng to new Web appl i cati on servers.
The OLAP Architectural Matrix
Thr ee pl aces to stor e mul ti di mensi onal data, and the same thr ee l ocati ons for
mul ti di mensi onal engi nes: combi ni ng these gi ves ni ne possi bl e storage/processi ng opti ons.
But some of these are nonsensi cal : i t woul d be absurd to store data i n a mul ti di mensi onal
database, but do mul ti di mensi onal processi ng i n an RDBMS, so onl y the si x opti ons on or
bel ow the di agonal make sense.
Multidimensional data storage options
Multidimensional
processing RDBMS Multidimensional Client files
options databaser server
1
Multi-pass SQL Cartesi s Magni tude
Mi croStrategy
2 4
SAS CFO Vi si on
Crystal Holos (ROLAP Crystal Holos
mode) Geac MPC
Hyperion Essbase Hyperion Essbase
Multidimensional Longvi ew Khal i x Oracle Express
server engine Speedware Media/ MR Oracl e OLAP
Microsoft Analysis Opti on AW
Services Microsoft Analysis
Services
Oracle Express PowerPl ay
(ROLAP mode) Enterpri se Server
Oracl e OLAP Opti on Pilot Analysis
(ROLAP mode) Server Appl i x
Pilot Analysis Server TM1
Whi teLi ght
3 5 6
Client Oracl e Di scoverer Comshare FDC Hyperi on
multidimensional Di mensi onal I nsi ght I ntel l i gence
engine Hyperi on Enterpri se Busi ness Objects
Hyperi on Pi l l ar Cognos
Power Pl ay
Personal Express
TM1 Perspecti ves
I NTRODUCTI ON 229
The wi del y used (and mi sused) nomencl ature i s not parti cul arl y hel pful , but roughl y
speaki ng:
• Rel ati onal OLAP (ROLAP) products are i n squares 1, 2 and 3
• MDB (al so known as MOLAP) products are i n squares 4 and 5
• Desktop OLAP products are i n square 6
• Hybri d OLAP products are those that are i n both squares 2 and 4 (shown i n
italics)
The fact that several products are i n the same square, and therefore have si mi l ar
archi tectures, does not mean that they are necessari l y very si mi l ar products. For i nstance,
DB2 OLAP Server and Eureka are qui te di fferent products that just happen to share certai n
storage and processi ng characteri sti cs.
Is there an ideal choice?
Each of these opti ons has i ts own strengths and weaknesses, and there i s no si ngl e
opti mum choi ce. I t i s perfectl y reasonabl e for si tes to use products from more than one of
the squares, and even more than one from a si ngl e square i f they are speci al i zed products
used for di fferent appl i cati ons. As mi ght be expected, the squares contai ni ng the most
products are al so the most wi del y used archi tectures, and vi ce versa. The choi ce of archi tecture
does affect the performance, capaci ty, functi onal i ty and parti cul arl y the scal abi l i ty of an
OLAP sol uti on and thi s i s di scussed el sewhere i n The OLAP Report.
17.7 DIMENSIONAL DATA STRUCTURES
Thi s i s one of the ol der anal yses i n The OLAP Report, and i t refers to many products
that are no l onger on the market. For hi stori cal accuracy, we have l eft i n these references.
The si mpl est vi ew of the data for a mul ti di mensi onal appl i cati on i s that i t exi sts i n a
l arge Cartesi an space bounded by al l the di mensi ons of the appl i cati on. Some mi ght cal l thi s
a data space, emphasi zi ng that i t i s a l ogi cal rather than a physi cal concept. Whi l e i t woul d
be perfectl y possi bl e to bui l d a product that worked l i ke thi s, wi th al l the data hel d i n a
si mpl e uncompressed array, no l arge-scal e product coul d be as basi c as thi s.
I n fact, as shown i n Fi gure 17.1, mul ti di mensi onal data i s al ways sparse, and often
‘cl umpy’: that i s, data cel l s are not di stri buted evenl y through the mul ti di mensi onal space.
Pockets of data cel l s are cl ustered together i n pl aces and ti me peri ods when a busi ness event
has occurred.
The desi gners of mul ti di mensi onal products al l know thi s, and they adopt a vari ety of
techni cal strategi es for deal i ng wi th thi s sparse, but cl ustered data. They then choose one
of two pri nci pal ways of presenti ng thi s mul ti di mensi onal data to the user. These are not
just vi ewi ng opti ons: they al so determi ne how the data i s processed, and, i n parti cul ar, how
many cal cul ati ons are permi tted.
230 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Figure 17.1. Data i n Mul ti di mensi onal Appl i cati ons Tends to be Cl ustered i nto Rel ati vel y Dense
Bl ocks, wi th Large Gaps i n Between – Rather Li ke Pl anets and Stars i n Space.
Hypercube
Some l arge scal e products do present a si mpl e, si ngl e-cube l ogi cal structure, even though
they use a di fferent, more sophi sti cated, underl yi ng model to compress the sparse data.
They usual l y al l ow data val ues to be entered for every combi nati on of di mensi on members,
and al l parts of the data space have i denti cal di mensi onal i ty. I n thi s report, we cal l thi s data
structure a hypercube, wi thout any suggesti on that the di mensi ons are al l equal si zed or
that the number of di mensi ons i s l i mi ted to any parti cul ar val ue. We use thi s term i n a
speci fi c way, and not as a general term for mul ti di mensi onal structures wi th more than
three di mensi ons. However, the use of the term does not mean that data i s stored i n any
parti cul ar format, and i t can appl y equal l y to both mul ti di mensi onal databases and rel ati onal
OLAPs. I t i s al so i ndependent of the degree to whi ch the product pre-cal cul ates the data.
Purveyors of hypercube products emphasi ze thei r greater si mpl i ci ty for the end-user. Of
course, the apparentl y si mpl e hypercube i s not how the data i s usual l y stored and there are
extensi ve techni cal strategi es, di scussed el sewhere i n The OLAP Report, to manage the
sparsi ty effi ci entl y. Essbase (at l east up to versi on 4.1) i s an exampl e of a modern product
that used the hypercube approach. I t i s not surpri si ng that Comshare chose to work wi th
Arbor i n usi ng Essbase, as Comshare has adopted the hypercube approach i n several previ ous
generati ons of l ess sophi sti cated mul ti di mensi onal products goi ng back to System W i n
1981. I n turn, Thorn-EMI Computer Software, the company that Arbor’s founders previ ousl y
worked for, adopted the i denti cal approach i n FCS-Mul ti , so Essbase coul d be descri bed as
a thi rd generati on hypercube product, wi th l ogi cal (though not techni cal ) roots i n the System
W fami l y.
Many of the si mpl er products al so use hypercubes and thi s i s parti cul arl y true of the
ROLAP appl i cati ons that use a si ngl e fact tabl e star schema: i t behaves as a hypercube. I n
practi ce, therefore, most mul ti di mensi onal query tool s l ook at one hypercube at a ti me.
Some exampl es of hyper cube pr oducts i ncl ude Essbase (and ther efor e Analyzer and
I NTRODUCTI ON 231
Comshare/Geac Decision), Executive Viewer, CFO Vision, BI/Analyze (the former
P
a
BLO) and PowerPlay.
There i s a vari ant of the hypercube that we cal l the fringed hypercube. Thi s i s a
dense hypercube, wi th a smal l number of di mensi ons, to whi ch addi ti onal anal ysi s di mensi ons
can be added for parts of the structure. The most obvi ous products i n thi s report to have
thi s structure are Hyperion Enterprise and Fi nanci al Management, CLIME and Comshare
(now Geac) FDC.
Multicubes
The other, much more common, approach i s what we cal l the multicube structure. I n
mul ti cube pr oducts, the appl i cati on desi gner segments the database i nto a set of
mul ti di mensi onal structures each of whi ch i s composed of a subset of the overal l number of
di mensi ons i n the database. I n thi s report, we cal l these smal l er structures subcubes; the
names used by the vari ous products i ncl ude vari abl es (Express, Pi l ot and Acumate), structures
(Hol os), cubes (TM1 and Mi crosoft OLAP Servi ces) and i ndi cators (Medi a). They mi ght be,
for exampl e, a set of vari abl es or accounts, each di mensi oned by just the di mensi ons that
appl y to that vari abl e. Exponents of these systems emphasi ze thei r greater versati l i ty and
potenti al l y greater effi ci ency (parti cul arl y wi th sparse data), di smi ssi ng hypercubes as si mpl y
a subset of thei r own approach. Thi s di smi ssal i s unfounded, as hypercube products al so
break up the database i nto subcubes under the surface, thus achi evi ng many of the effi ci enci es
of mul ti cubes, wi thout the user-vi si bl e compl exi ti es. The ori gi nal exampl e of the mul ti cube
approach was Express, but most of the newer products al so use modern vari ants of thi s
approach.
Because Express has al ways used mul ti cubes, thi s can be regarded as the l ongest
establ i shed OLAP structure, predati ng hypercubes by at l east a decade. I n fact, the mul ti cube
approach goes back even further, to the ori gi nal mul ti di mensi onal product, APL from the
1960s, whi ch worked i n preci sel y thi s way.
ROLAP products can al so be mul ti cubes i f they can handl e mul ti pl e base fact tabl es,
each wi th di fferent di mensi onal i ty. Most seri ous ROLAPs, l i ke those from Informix, CA
and MicroStrategy, have thi s capabi l i ty. Note that the onl y l arge-scal e ROLAPs sti l l on
sal e are Mi croStrategy and SAP BW.
I t i s al so possi bl e to i denti fy two mai n types of mul ti cube: the block multicube (as
used by Business Objects, Gentia, Holos, Mi crosoft Analysis Services and TM1) and
the series multicube (as used by Acumate, Express, Media and Pilot). Note that several
of these products have now been di sconti nued.
Bl ock mul ti cubes use orthogonal di mensi ons, so there are no speci al di mensi ons at the
data l evel . A cube may consi st of any number of the defi ned di mensi ons, and both measures
and ti me are treated as ordi nary di mensi ons, just l i ke any other. Seri es mul ti cubes treat
each vari abl e as a separate cube (often a ti me seri es), wi th i ts own set of other di mensi ons.
However, these di sti ncti ons are not hard and fast, and the vari ous mul ti cube i mpl ementati ons
do not necessari l y fal l cl eanl y i nto one type or the other.
The bl ock mul ti cubes are more fl exi bl e, because they make no assumpti ons about
di mensi ons and al l ow mul ti pl e measures to be handl ed together, but are often l ess conveni ent
for reporti ng, because the mul ti di mensi onal vi ewers can usual l y onl y l ook at one bl ock at
232 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
a ti me (an exampl e of thi s can be seen i n TM1’s i mpl ementati on of the former OLAP
Counci l ’s APB-1 benchmark). The seri es mul ti cubes do not have thi s restri cti on. Mi crosoft
Anal ysi s Servi ces, Genti aDB and especi al l y Hol os get round the probl em by al l owi ng cubes
to be joi ned, thus presenti ng a si ngl e cube vi ew of data that i s actual l y processed and stored
physi cal l y i n mul ti pl e cubes. Hol os al l ows vi ews of vi ews, somethi ng that not al l products
support. Essbase transparent parti ti ons are a l ess sophi sti cated versi on of thi s concept.
Seri es mul ti cubes are the ol der form, and onl y one product (Speedware Medi a) whose
devel opment fi rst started after the earl y 1980s has used them. The bl ock mul ti cubes came
al ong i n the mi d 1980s, and now seem to be the most popul ar choi ce. There i s one other
form, the atomi c mul ti cube, whi ch was menti oned i n the fi rst edi ti on of The OLAP Report,
but i t appears not to have caught on.
Which is better?
We do not take si des on thi s matter. We have seen wi despread, successful use of both
hypercube and both fl avors of mul ti cube products by customers i n al l i ndustri es. I n general ,
mul ti cubes are more effi ci ent and versati l e, but hypercubes are easi er to understand. End-
users wi l l rel ate better to hypercubes because of thei r hi gher l evel vi ew; MI S professi onal s
wi th mul ti di mensi onal experi ence prefer mul ti cubes because of thei r greater tunabi l i ty and
fl exi bi l i ty. Mul ti cubes are a more effi ci ent way of stori ng very sparse data and they can
reduce the pre-cal cul ati on database expl osi on effect, so l arge, sophi sti cated products tend to
use mul ti cubes. Pre-bui l t appl i cati ons al so tend to use mul ti cubes so that the data structures
can be more fi nel y adjusted to the known appl i cati on needs.
One rel ati vel y recent devel opment i s that two products so far have i ntroduced the
abi l i ty to de-coupl e the storage, processi ng and presentati on l ayers. The now di sconti nued
GentiaDB and the Holos compound OLAP architecture al l ow data to be stored physi cal l y
as a mul ti cube, but cal cul ati ons to be defi ned as i f i t were a hypercube. Thi s approach
potenti al l y del i vers the si mpl i ci ty of the hypercube, wi th the more tunabl e storage of a
mul ti cube. Mi crosoft’s Analysis Services al so has si mi l ar concepts, wi th parti ti ons and
vi rtual cubes.
In Summary
I t i s easy to understand the OLAP defi ni ti on i n fi ve keywords – Fast Anal ysi s of Shared
Mul ti di mensi onal I nformati on (FASMI ). OLAP i ncl udes 18 rul es, whi ch are categori zed i nto
four features: Basi c Feature, Speci al Feature, Reporti ng Feature, and Di mensi onal Control .
A potenti al buyer shoul d create a shortl i st of OLAP vendors for detai l ed consi derati on that
fal l l argel y i nto a si ngl e one of the four categori es: Appl i cati on OLAP, MOLAP, DOLAP,
ROLAP.
233
We defi ne OLAP as Fast Anal ysi s of Shared Mul ti di mensi onal I nformati on–FASMI.
There are many appl i cati ons where thi s approach i s rel evant, and thi s secti on descri bes the
characteri sti cs of some of them. I n an i ncreasi ng number of cases, speci al i st OLAP appl i cati ons
have been pre-bui l t and you can buy a sol uti on that onl y needs l i mi ted customi zi ng; i n
others, a general -purpose OLAP tool can be used. A general -purpose tool wi l l usual l y be
versati l e enough to be used for many appl i cati ons, but there may be much more appl i cati on
devel opment requi red for each. The overal l software costs shoul d be l ower, and ski l l s are
transferabl e, but i mpl ementati on costs may ri se and end-users may get l ess ad hoc fl exi bi l i ty
i f a more techni cal product i s used. I n general , i t i s probabl y better to have a general -
purpose product whi ch can be used for mul ti pl e appl i cati ons, but some appl i cati ons, such as
fi nanci al reporti ng, are suffi ci entl y compl ex that i t may be better to use a pre-bui l t appl i cati on,
and there are several avai l abl e.
We woul d advi se users never to engi neer fl exi bi l i ty out of thei r appl i cati ons–the onl y
thi ng you can predi ct about the future i s that i t wi l l not be what you predi cted. Try not to
hard code any more than you have to.
OLAP appl i cati ons have been most commonl y used i n the fi nanci al and marketi ng
areas, but as we show here, thei r uses do extend to other functi ons. Data ri ch i ndustri es
have been the most typi cal users (consumer goods, retai l , fi nanci al servi ces and transport)
for the obvi ous reason that they had l arge quanti ti es of good qual i ty i nternal and external
data avai l abl e, to whi ch they needed to add val ue. However, there i s al so scope to use OLAP
technol ogy i n other i ndustri es. The appl i cati ons wi l l often be smal l er, because of the l ower
vol umes of data avai l abl e, whi ch can open up a wi der choi ce of products (because some
products cannot cope wi th very l arge data vol umes).
18.1 MARKETING AND SALES ANALYSIS
Most commerci al compani es requi re thi s appl i cati on, and most products are capabl e of
handl i ng i t to some degree. However, l arge-scal e versi ons of thi s appl i cati on occur i n three
i ndustri es, each wi th i ts own pecul i ari ti es.
OLAF AFFL¡CAT¡ON$
18
CHAFTER
234 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Consumer goods i ndustri es often have l arge numbers of products and outl ets, and
a hi gh rate of change of both. They usual l y anal yze data monthl y, but someti mes
i t may go down to weekl y or, very occasi onal l y, dai l y. There are usual l y a number
of di mensi ons, none especi al l y l arge (rarel y over 100,000). Data i s often very sparse
because of the number of di mensi ons. Because of the competi ti veness of these
i ndustri es, data i s often anal yzed usi ng more sophi sti cated cal cul ati ons than i n
other i ndustri es. Often, the most sui tabl e technol ogy for these appl i cati ons i s one
of the hybri d OLAPs, whi ch combi ne hi gh anal yti cal functi onal i ty wi th reasonabl y
l arge data capaci ty.
• Retai l ers, thanks to EPOS data and l oyal ty cards, now have the potenti al to anal yze
huge amounts of data. Large retai l ers coul d have over 100,000 products (SKUs)
and hundreds of branches. They often go down to weekl y or dai l y l evel , and may
someti mes track spendi ng by i ndi vi dual customers. They may even track sal es by
ti me of day. The data i s not usual l y very sparse, unl ess customer l evel detai l i s
tracked. Rel ati vel y l ow anal yti cal functi onal i ty i s usual l y needed. Someti mes, the
vol umes are so l arge that a ROLAP sol uti on i s requi red, and thi s i s certai nl y true
of appl i cati ons where i ndi vi dual pri vate consumers are tracked.
• The fi nanci al servi ces i ndustry (i nsurance, banks etc) i s a rel ati vel y new user of
OLAP technol ogy for sal es anal ysi s. Wi th an i ncr easi ng need for pr oduct and
customer profi tabi l i ty, these compani es are now someti mes anal yzi ng data down to
i ndi vi dual customer l evel , whi ch means that the l argest di mensi on may have mi l l i ons
of members. Because of the need to moni tor a wi de vari ety of ri sk factors, there
may be l arge numbers of attri butes and di mensi ons, often wi th very fl at hi erarchi es.
Most of the data wi l l usual l y come from the sal es l edger(s), but there may be customer
databases and some external data that has to be merged. I n some i ndustri es (for exampl e,
pharmaceuti cal s and CPG), l arge vol umes of market and even competi tor data i s readi l y
avai l abl e and thi s may need to be i ncorporated.
Getti ng the ri ght data may not be easy. For exampl e, most compani es have probl ems
wi th i ncorrect codi ng of data, especi al l y i f the organi zati on has grown through acqui si ti on,
wi th di fferent codi ng systems i n use i n di fferent subsi di ari es. I f there have been mul ti pl e
di fferent contracts wi th a customer, then the same si ngl e company may appear as mul ti pl e
di fferent enti ti es, and i t wi l l not be easy to measure the total busi ness done wi th i t. Another
compl i cati on mi ght come wi th cal cul ati ng the correct revenues by product. I n many cases,
customers or resel l ers may get di scounts based on cumul ati ve busi ness duri ng a peri od. Thi s
di scount may appear as a retrospecti ve credi t to the account, and i t shoul d then be factored
agai nst the mul ti pl e transacti ons to whi ch i t appl i es. Thi s work may have been done i n the
tr ansacti on pr ocessi ng systems or a data war ehouse; i f not, the OLAP tool wi l l have
to do i t.
There are anal yses that are possi bl e al ong every di mensi on. Here are a dozen of the
questi ons that coul d be answered usi ng a good marketi ng and sal es anal ysi s system:
1. Are we on target to achi eve the month-end goal s, by product and by regi on?
2. Are our back orders at the ri ght l evel s to meet next month’s goal s? Do we have
adequate producti on capaci ty and stocks to meet anti ci pated demand?
OLAP APPLI CATI ONS 235
3. Are new products taki ng off at the ri ght rate i n al l areas?
4. Have some new products fai l ed to achi eve thei r expected penetrati on, and shoul d
they be wi thdrawn?
5. Are al l areas achi evi ng the expected product mi x, or are some groups fai l i ng to sel l
some otherwi se popul ar products?
6. I s our adverti si ng budget properl y al l ocated? Do we see a ri se i n sal es for products
and i n areas where we run campai gns?
7. What average di scounts are bei ng gi ven, by di fferent sal es groups or channel s?
Shoul d commi ssi on structures be al tered to refl ect thi s?
8. I s there a correl ati on between promoti ons and sal es growth? Are the pri ces out of
l i ne wi th the market? Are some sal es groups achi evi ng thei r monthl y or quarterl y
targets by excessi ve di scounti ng?
9. Are new product offeri ngs bei ng i ntroduced to establ i shed customers?
10. I s the revenue per head the same i n al l parts of the sal es force? Why ?
11. Do outl ets wi th si mi l ar demographi c characteri sti cs perform i n the same way, or
are some doi ng much worse than others? Why?
12. Based on hi story and known product pl ans, what are real i sti c, achi evabl e targets
for each product, ti me peri od and sal es channel ?
The benefi ts of a good marketi ng and sal es anal ysi s system i s that resul ts shoul d be
more predi ctabl e and manageabl e, opportuni ti es wi l l be spotted more readi l y and sal es
forces shoul d be more producti ve.
18.2 CLICKSTREAM ANALYSIS
Thi s i s one of the l atest OLAP appl i cati ons. Commerci al Web si tes generate gi gabytes
of data a day that descri be every acti on made by every vi si tor to the si te. No bri cks and
mortar retai l er has the same l evel of detai l avai l abl e about how vi si tors browse the offeri ngs,
the route they take and even where they abandon transacti ons. A l arge si te has an al most
i mpossi bl e vol ume of data to anal yze, and a mul ti di mensi onal framework i s possi bl y the best
way of maki ng sense of i t. There are many di mensi ons to thi s anal ysi s, i ncl udi ng where the
vi si tors came from, the ti me of day, the route they take through the si te, whether or not
they started/compl eted a transacti on, and any demographi c data avai l abl e about customer
vi si tors.
Unl i ke a conventi onal retai l er, an e-commerce si te has the abi l i ty–al most an obl i gati on,
i t woul d seem to be redesi gned regul arl y and thi s shoul d be based, at l east i n part, on a
sci enti fi c anal ysi s of how wel l the si te serves i ts vi si tors and whether i t i s achi evi ng i ts
busi ness objecti ves, rather than a desi re merel y to refl ect the l atest desi gn and technol ogy
fashi ons. Thi s means that i t i s necessary to have detai l ed i nformati on on the popul ari ty and
success of each component of a si te.
But the Web si te shoul d not be vi ewed i n i sol ati on. I t i s onl y one facet of an organi zati on’s
busi ness, and i deal l y, the Web stati sti cs shoul d be combi ned wi th other busi ness data,
i ncl udi ng product profi tabi l i ty, customer hi story and fi nanci al i nformati on. OLAP i s an i deal
way of bri ngi ng these conventi onal and new forms of data together. Thi s woul d al l ow, for
236 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
i nstance, Web si tes to be targeted not si mpl y to maxi mi ze transacti ons, but to generate
profi tabl e busi ness and to appeal to customers l i kel y to create such busi ness. OLAP can al so
be used to assi st i n personal i zi ng Web si tes.
I n the great rush to move busi ness to the Web, many compani es have managed to
i gnore anal yti cs, just as the ERP craze i n the mi d and l ate 1990s di d. But the sheer vol ume
of data now avai l abl e, and the shorter ti mes i n whi ch to anal yze i t, make excepti on and pro-
acti ve reporti ng far more i mportant than before. Fai l i ng to do so i s an i nvi tati on for a qui ck
di saster.
Figure 18.1. An exampl e of a graphi cal l y i ntense cl i ckstream anal ysi s appl i cati on that uses OLAP
i nvi si bl y at i ts core: eBi zi nsi ghts from Vi sual I nsi ghts. Thi s anal ysi s shows vi si tor segmentati on
(browsers, abandoners, buyers) for vari ous promoti onal acti vi ti es at hourl y i nterval s. Thi s appl i cati on
automati cal l y bui l ds four standard OLAP server cubes usi ng a total of 24 di mensi ons for the anal yses
Many of the i ssues wi th cl i ckstream anal ysi s come l ong before the OLAP tool . The
bi ggest i ssue i s to correctl y i denti fy real user sessi ons, as opposed to hi ts. Thi s means
el i mi nati ng the many crawl er bots that are constantl y searchi ng and i ndexi ng the Web, and
then groupi ng sets of hi ts that consti tute a sessi on. Thi s cannot be done by I P address al one,
as Web proxi es and NAT (network address transl ati on) mask the true cl i ent I P address, so
techni ques such as sessi on cooki es must be used i n the many cases where surfers do not
i denti fy themsel ves by other means. I ndeed, vendors such as Vi sual I nsi ghts charge much
more for upgrades to the data capture and conversi on features of thei r products than they
do for the reporti ng and anal ysi s components, even though the l atter are much more vi si bl e.
18.3 DATABASE MARKETING
Thi s i s a speci al i zed marketi ng appl i cati on that i s not normal l y thought of as an OLAP
appl i cati on, but i s now taki ng advantage of mul ti di mensi onal anal ysi s, combi ned wi th other
OLAP APPLI CATI ONS 237
stati sti cal and data mi ni ng technol ogi es. The purpose of the appl i cati on i s to determi ne who
are the best customers for targeted promoti ons for parti cul ar products or servi ces based on
the di sparate i nformati on from vari ous di fferent systems.
Database marketi ng professi onal s ai m to:
• Determi ne who the preferred customers are, based on thei r purchase of profi tabl e
products. Thi s can be done wi th brute force data mi ni ng techni ques (whi ch are sl ow
and can be hard to i nterpret), or by experi enced busi ness users i nvesti gati ng hunches
usi ng OLAP cubes (whi ch i s qui cker and easi er).
• Work to bui l d l oyal ty packages for preferred customers vi a correct offeri ngs. Once
the preferred customers have been i denti fi ed, l ook at thei r product mi x and buyi ng
profi l e to see i f there are denser cl usters of product purchases over parti cul ar ti me
peri ods. Agai n, thi s i s much easi er i n a mul ti di mensi onal envi ronment. These can
then form the basi s for speci al offers to i ncrease the l oyal ty of profi tabl e customers.
• Determi ne a customer profi l e and use i t to ‘cl one’ the best customers. Look for
customer s who have some, but not al l of the char acter i sti cs of the pr efer r ed
customers, and target appropri ate promoti onal offers at them.
I f these goal s are met, both parti es profi t. The customers wi l l have a company that
knows what they want and provi des i t. The company wi l l have l oyal customers that generate
suffi ci ent revenue and profi ts to conti nue a vi abl e busi ness.
Database marketi ng speci al i sts try to model (usi ng stati sti cal or data mi ni ng techni ques)
whi ch pi eces of i nformati on are most rel evant for determi ni ng l i kel i hood of subsequent
purchases, and how to wei ght thei r i mportance. I n the past, pure marketers have l ooked for
tri ggers, whi ch works, but onl y i n one di mensi on. But a wel l establ i shed company may have
hundr eds of pi eces of i nfor mati on about customer s, pl us year s of tr ansacti on data, so
mul ti di mensi onal structures are a great way to i nvesti gate rel ati onshi ps qui ckl y, and narrow
down the data whi ch shoul d be consi dered for model i ng.
Once thi s i s done, the customers can be scored usi ng the wei ghted combi nati on of
vari abl es whi ch compose the model . A measure can then be created, and cubes set up whi ch
mi x and match across mul ti di mensi onal vari abl es to determi ne opti mal product mi x for
customers. The users can determi ne the best product mi x to market to the ri ght customers
based on segments created from a combi nati on of the product scores, the several demographi c
di mensi ons, and the transacti onal data i n aggregate.
Fi nal l y, i n a more si mpl i sti c setti ng, the users can break the worl d i nto segments based
on combi nati ons of di mensi ons that are rel evant to targeti ng. They can then cal cul ate a
return on i nvestment on these combi nati ons to determi ne whi ch segments have been profi tabl e
i n the past, and whi ch have not. Mai l i ngs can then be made onl y to those profi tabl e segments.
Products l i ke Express al l ows the users to fi ne tune the di mensi ons qui ckl y to bui l d one-off
promoti ons, determi ne how to structure profi tabl e combi nati ons of di mensi ons i nto segments,
and rank them i n order of desi rabi l i ty.
18.4 BUDGETING
Thi s i s a pai nful saga that every organi zati on has to endure at l east once a year. Not
onl y i s thi s bal anci ng act di ffi cul t and tedi ous, but most contri butors to the process get l i ttl e
238 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
feedback and l ess sati sfacti on. I t i s al so the forum for many pol i ti cal games, as sal es managers
try and mani pul ate the system to get l ow targets and cost center managers try and hi de
pockets of unal l ocated resources. Thi s i s i nevi tabl e, and bal anci ng these i nsti ncti ve pressures
agai nst the top down goal s of the organi zati on usual l y means that setti ng a budget for a
l arge organi zati on i nvol ves a number of arduous i terati ons that can l ast for many months.
We have even come across cases where setti ng the annual budget took more than a year,
so before next year’s budget had been set, the budgeti ng department had started work on
the fol l owi ng year’s budget!
Some compani es try the top down approach. Thi s i s qui ck and easy, but often l eads to
the setti ng of unachi evabl e budgets. Managers l ower down have no commi tment to the
numbers assi gned to them and make no seri ous effort to adhere to them. Very soon, the
budget i s di scredi ted and most peopl e i gnore i t. So, others try the bottom up al ternati ve.
Thi s i nvol ves al most every manager i n the company and i s an i mmense di stracti on from
thei r normal duti es. The resul ti ng ‘budget’ i s usual l y mi l es off bei ng acceptabl e, so orders
come down from on hi gh to tri m costs and boost revenue. Thi s can take several cycl es before
the costs and revenues are i n bal ance wi th the strategi c pl an. Thi s process can take months
and i s frustrati ng to al l concerned, but i t can l ead to good qual i ty, achi evabl e budgets. Doi ng
thi s wi th a compl ex, mul ti -spreadsheet system i s a fraught process, and the remai ni ng
mai nframe based systems are usual l y far too i nfl exi bl e.
Ul ti matel y, budgeti ng needs to combi ne the di sci pl i ne of a top-down budget wi th the
commi tment of a bottom-up process, preferabl y wi th a mi ni mum of i terati ons. An OLAP tool
can hel p here by provi di ng the anal ysi s capaci ty, combi ned wi th the actual database, to
provi de a good, real i sti c starti ng poi nt. I n order to speed up the process, thi s coul d be done
as a central l y generated, top down exerci se. I n order to al l ow for sl i ppage, thi s fi rst pass of
the budget woul d be desi gned to over achi eve the requi red goal s. These ‘suggested’ budget
numbers are then provi ded to the l ower l evel managers to revi ew and al ter. Al terati ons
woul d requi re justi fi cati on.
There woul d have to be some matchi ng of the revenue projecti ons from marketi ng
(whi ch wi l l probabl y be based on sal es by product) and from sal es (whi ch wi l l probabl y be
based on sal es terri tori es), together wi th the manufacturi ng and procurement pl ans, whi ch
need to be i n l i ne wi th expected sal es. Agai n, the OLAP approach al l ows al l the data to be
vi ewed from any perspecti ve, so di screpanci es shoul d be i denti fi ed earl i er. I t i s as dangerous
to budget for unachi evabl e sal es as i t i s to go too l ow, as cost budgets wi l l be authori zed
based on phantom revenues.
I t shoul d al so be possi bl e to bui l d the system so that data need not al ways be entered
at the l owest possi bl e l evel . For exampl e, cost data may run at a standard monthl y rate, so
i t shoul d be possi bl e to enter i t at a quarterl y or even annual rate, and al l ow the system
to phase i t usi ng a standard profi l e. Many revenue streams mi ght have a standard seasonal i ty,
and systems l i ke Cognos Pl anni ng and Geac MPC are abl e to appl y thi s automati cal l y.
There are many other cal cul ati ons that are not just aggregati ons, because to make the
budgeti ng process as pai nl ess as possi bl e, as many l i nes i n the budget schedul es as possi bl e
shoul d be cal cul ated usi ng standard formul ae rather than bei ng entered by hand.
An OLAP based budget wi l l sti l l be pai nful , but the process shoul d be faster, and the
abi l i ty to spot out of l i ne i tems wi l l make i t harder for astute managers to hi de pockets of
OLAP APPLI CATI ONS 239
costs or get away wi th unreasonabl y l ow revenue budgets. The greater percei ved fai rness
of such a process wi l l make the process more tol erabl e, even i f i t i s sti l l unpopul ar. The
greater accuracy and rel i abi l i ty of such a budget shoul d reduce the l i kel i hood of havi ng to
do a mi d year budget revi si on. But i f ci rcumstances change and a re-budget i s requi red, an
OLAP based system shoul d make i t a faster and l ess pai nful process.
18.5 FINANCIAL REPORTING AND CONSOLIDATION
Every medi um and l arge organi zati on has onerous responsi bi l i ti es for produci ng fi nanci al
reports for i nternal (management) consumpti on. Publ i cl y quoted compani es or publ i c sector
bodi es al so have to produce other, l egal l y requi red, reports.
Accountants and fi nanci al anal ysts were earl y adopters of mul ti di mensi onal software.
Even the si mpl est fi nanci al consol i dati on consi sts of at l east three di mensi ons. I t must have
a chart of accounts (general OLAP tool s often refer to these as facts or measures, but to
accountants they wi l l al ways be accounts), at l east one organi zati on structure pl us ti me.
Usual l y i t i s necessary to compare di fferent versi ons of data, such as actual , budget or
forecast. Thi s makes the model four di mensi onal . Often l i ne of busi ness segmentati on or
product l i ne anal ysi s can add fi fth or si xth di mensi ons. Even back i n the 1970s, when
consol i dati on systems were typi cal l y run on ti me-shari ng mai nframe computers, the dedi cated
consol i dati on products typi cal l y had a four or more di mensi onal feel to them. Several of
today’s OLAP tool s can trace thei r roots di rectl y back to thi s ancestry and many more have
i nheri ted desi gn attri butes from these earl y products.
To address thi s speci fi c market, certai n vendors have devel oped speci al i st products.
Al though they are not generi c OLAP tool s (typi cal l y, they have speci fi c di mensi onal i ty), we
have i ncl uded some of them i n thi s report because they coul d be vi ewed as conformi ng to
our defi ni ti on of OLAP (Fast Anal ysi s of Shared Mul ti di mensi onal I nformati on) and they
represent a si gni fi cant proporti on of the market. The market l eader i n thi s segment i s
Hyperi on Sol uti ons. Other pl ayers i ncl ude Cartesi s, Geac, Longvi ew Khal i x, and SAS I nsti tute,
but many smal l er pl ayers al so suppl y pre-bui l t appl i cati ons for fi nanci al reporti ng. I n addi ti on
to these speci al i st products, there i s a school of thought that says that these types of
probl ems can be sol ved by bui l di ng extra functi onal i ty on top of a generi c OLAP tool .
There are several factors that di sti ngui sh thi s speci al i st sector from the general OLAP
area. They are:
Special Dimensionality
Normal l y we woul d not l ook favorabl y on a product that restri cted the di mensi onal i ty
of the model s that coul d be bui l t wi th i t. I n thi s case, however, there are some compel l i ng
reasons why thi s becomes advantageous.
I n a fi nanci al consol i dati on certai n di mensi ons possess speci al attri butes. For exampl e
the chart of accounts contai ns detai l l evel and aggregati on accounts. The detai l l evel accounts
typi cal l y wi l l sum to zero for one enti ty for one ti me peri od. Thi s i s because thi s uni t of data
represents a tri al bal ance (so cal l ed because, before the days of computers you di d a tri al
extracti on of the bal ances from the l edger to ensure that they bal anced). Si nce a bal anced
accounti ng transacti on consi sts of debi ts and credi ts of equal val ue whi ch are normal l y
posted to one enti ty wi thi n one ti me peri od, the resul tant array of account val ues shoul d
240 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
al so sum to zer o. Thi s may seem r el ati vel y uni mpor tant to non-accountants, but to
accountants, thi s feature i s a cri ti cal control to ensure that the data i s ‘i n bal ance’.
Another di mensi on that possesses speci al trai ts i s the enti ty di mensi on. For an accurate
consol i dati on i t i s i mportant that enti ti es are aggregated i nto a consol i dati on once and onl y
once. The omi ssi on or i ncl usi on of an enti ty twi ce i n the same consol i dated numbers woul d
i nval i date the whol e consol i dati on.
As wel l as the speci al aggregati on rul es that such di mensi ons possess, there are certai n
known attri butes that the members of certai n di mensi ons possess. I n the case of accounts,
i t i s useful to know:
• I s the account a debi t or a credi t?
• I s i t an asset, l i abi l i ty, i ncome or expense or some other speci al account, such as
an exchange gai n or l oss account?
• I s i t used to tr ack i nter -company i tems (whi ch must be tr eated speci al l y on
consol i dati on)?
For the enti ty (or cost center or company) di mensi on:
• What currency does the company report i n?
• I s thi s a normal company submi tti ng resul ts or an el i mi nati on company (whi ch
onl y serves to hol d entri es used as part of the consol i dati on)?
By pre-defi ni ng these di mensi ons to understand that thi s i nformati on needs to be
captured, not onl y can the user be spared the troubl e of knowi ng what di mensi ons to set up,
but al so cer tai n compl ex oper ati ons, such as cur r ency tr ansl ati on and i nter -company
el i mi nati ons can be compl etel y automated. Vari ance reporti ng can al so be si mpl i fi ed through
the system’s knowl edge of whi ch i tems are debi ts and whi ch are credi ts.
Controls
Control s are a very i mportant part of any consol i dati on system. I t i s cruci al that control s
are avai l abl e to ensure that once an enti ty i s i n bal ance i t can onl y be updated by posti ng
bal anced journal entri es and keepi ng a comprehensi ve audi t trai l of al l updates and who
posted them. I t i s i mportant that reports cannot be produced from outdated consol i dated
data that i s no l onger consi stent wi th the detai l data because of updates. When fi nanci al
statements are converted from source currency to reporti ng currency, i t i s cri ti cal that the
basi s of transl ati on conforms to the general l y accepted accounti ng pri nci pl es i n the country
where the data i s to be reported. I n the case of a mul ti nati onal company wi th a sophi sti cated
reporti ng structure, there may be a need to report i n several currenci es, potenti al l y on
di ffer ent bases. Al though ther e has been consi der abl e i nter nati onal har moni zati on of
accounti ng standards for currency transl ati on and other accounti ng pri nci pl es i n recent
years, there are sti l l di fferences, whi ch can be si gni fi cant.
Special Transformations of Data
Some non-accountants di smi ss consol i dati on as si mpl e aggregati on of numbers. I ndeed
the basi s of a fi nanci al consol i dati on i s that the fi nanci al statements of more than one
company are aggregated so as to produce fi nanci al statements that meani ngful l y present the
resul ts of the combi ned operati on. However, there are several reasons why consol i dati on i s
a l ot more compl ex than si mpl y addi ng up the numbers.
OLAP APPLI CATI ONS 241
Resul ts are often i n di fferent currenci es. Transl ati ng statements from one currency to
another i s not as si mpl e as mul ti pl yi ng a l ocal currency val ue by an exchange rate to yi el d
a reporti ng currency val ue. Si nce the bal ance sheet shows a posi ti on at a poi nt i n ti me and
the profi t and l oss account shows acti vi ty over a peri od of ti me i t i s normal to mul ti pl y the
bal ance sheet by the cl osi ng rate for the peri od and the P&L account by the average rate
for the peri od. Si nce the tri al bal ance bal ances i n l ocal currency the transl ati on i s al ways
goi ng to i nduce an i mbal ance i n the reporti ng currency. I n si mpl e terms thi s i mbal ance i s
the gai n or l oss on exchange. Thi s report i s not desi gned to be an accounti ng pri mer, but
i t i s i mportant to understand that i t i s not a tri vi al task even as descri bed so far. When you
al so take i nto account certai n non current assets bei ng transl ated at hi stori c rates and
exchange gai ns and l osses bei ng cal cul ated separatel y for current and non current assets,
you can appreci ate the compl exi ty of the task.
Transacti ons between enti ti es wi thi n the consol i dati on must be el i mi nated (see Fi gure
18.2).
Purchases:
$500
Sales:
$1000
Purchases:
%200
Sales:
$400
C sells $100
to B
True net sales: $1400
True net purchases: $700
A common consolidation error
Figure 18.2. Company A owns Compani es B and C. Company C sel l s $100 worth of product to
Company B so C’s accounts show sal es of $500 (i ncl udi ng the $100 that i t sol d to Company B) and
purchases of $200. B shows sal es of $1000 and purchases of $600 (i ncl udi ng $100 that i t bought from
C). A si mpl e consol i dati on woul d show consol i dated sal es of $1500 and consol i dated purchases of
$800. But i f we consi der A (whi ch i s purel y a hol di ng company and therefore has no sal es and
purchases i tsel f) both i ts sal es and i ts purchases from the outsi de worl d are overstated by $100. The
$100 of sal es and pur chases become an i nter -company el i mi nati on. Thi s becomes much mor e
compl i cated when the parent owns l ess than 100 percent of subsi di ari es and there are many subsi di ari es
tradi ng i n mul ti pl e currenci es.
Even though we tal ked about the probl ems of el i mi nati ng i nter-company entri es on
consol i dati on, the probl em does not stop there. How to ensure that the sal es of $100 reported
by C actual l y agrees wi th the correspondi ng $100 purchase by B? What i f C erroneousl y
reports the sal es as $90? Consol i dati on systems have i nter-company reconci l i ati on modul es
whi ch wi l l report any i nter-company transacti ons (usual l y i n total between any two enti ti es)
that do not agree, on an excepti on basi s. Thi s i s not a tri vi al task when you consi der that
the two enti ti es wi l l often be tradi ng i n di fferent currenci es and transl ati on wi l l typi cal l y
242 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
yi el d mi nor roundi ng di fferences. Al so accountants typi cal l y i gnore bal ances whi ch are l ess
than a speci fi ed materi al i ty factor. Despi te thei r green eyeshade i mage, accountants do not
l i ke to spend thei r days chasi ng penni es!
18.6 MANAGEMENT REPORTING
I n most organi zati ons, management reporti ng i s qui te di sti nct from formal fi nanci al
reporti ng. I t wi l l usual l y have more emphasi s on the P&L and possi bl e cash fl ow, and l ess
on the bal ance sheet. I t wi l l probabl y be done more often–usual l y monthl y, rather than
annual l y and quarterl y. There wi l l be l ess detai l but more anal ysi s. More users wi l l be
i nterested i n vi ewi ng and anal yzi ng the resul ts. The emphasi s i s on faster rather than more
accurate reporti ng and there may be regul ar changes to the reporti ng requi rements. Users
of OLAP based systems consi stentl y report faster and more fl exi bl e reporti ng, wi th better
anal ysi s than the al ternati ve sol uti ons. One popul ar sayi ng i s that “what gets measured,
gets managed,” so seni or management wi l l often use a reporti ng system to gi ve (subtl e or
not) di recti on to subordi nates.
Many organi zati ons have grown by acqui si ti on, and may have two or more organi zati onal
structures. There wi l l be a l egal structure, whi ch wi l l i ncl ude dormant and non-tradi ng
subsi di ari es, and wi l l often be l argel y i gnored i n the management reporti ng. There may al so
be a di fferent busi ness structure, whi ch mi ght be based on products or market sectors, but
may bl ur the di sti ncti on between subsi di ari es; i t wi l l be based on the company’s management
structure, whi ch may be qui te di fferent to the l egal structure. There may al so be a marketi ng
structure, whi ch coul d refl ect a vi rtual (matri x) organi zati on, crossi ng both the l egal and the
management structures. Someti mes the same reporti ng tool wi l l be expected to produce al l
three sets of reports.
Management reporti ng usual l y i nvol ves the cal cul ati on of numerous busi ness rati os,
compari ng performance agai nst hi story and budget. There i s al so an advantage to be gai ned
from compari ng product groups or channel s or markets agai nst each other. Sophi sti cated
excepti on detecti on i s i mportant here, because the whol e poi nt of management reporti ng i s
to manage the busi ness by taki ng deci si ons.
The new Mi crosoft OLAP Servi ces product and the many new cl i ent tool s and appl i cati ons
bei ng devel oped for i t wi l l cer tai nl y dr i ve down ‘per seat’ pr i ces for gener al -pur pose
management reporti ng appl i cati ons, so that i t wi l l be economi cal l y possi bl e to depl oy good
sol uti ons to many more users. Web depl oyments shoul d make these easi er to admi ni ster.
18.7 EIS
EI S i s a branch of management reporti ng. The term became popul ar i n the mi d 1980s,
when i t was defi ned to mean Executi ve I nformati on Systems; some peopl e al so used the
term ESS (Executi ve Support System). Si nce then, the ori gi nal concept has been di scredi ted,
as the earl y systems were very propri etary, expensi ve, hard to mai ntai n and general l y
i nfl exi bl e. Fewer peopl e now use the term, but the acronym has not enti rel y di sappeared;
al ong the way, the l etters have been redefi ned many ti mes. Here are some of the suggesti ons
(you can combi ne the words i n any way you prefer):
OLAP APPLI CATI ONS 243
Wi th thi s prol i ferati on of descri pti ons, the meani ng of the term i s now i rretri evabl y
bl urred. I n essence, an EI S i s a more hi ghl y customi zed, easi er to use management reporti ng
system, but i t i s probabl y now better recogni zed as an attempt to provi de i ntui ti ve ease of
use to those managers who do not have ei ther a computer background or much pati ence.
There i s no reason why al l users of an OLAP based management reporti ng system shoul d
not get consi stentl y fast performance, great ease of use, rel i abi l i ty and fl exi bi l i ty–not just
top executi ves, who wi l l probabl y use i t much l ess than mi d l evel managers.
The basi c phi l osophy of EI S was that “what gets reported gets managed,” so i f executi ves
coul d have fast, easy access to a number of key performance i ndi cators (KPI s) and cri ti cal
success factors (CSFs), they woul d be abl e to manage thei r organi zati ons better. But there
i s l i ttl e evi dence that thi s worked for the buyers, and i t certai nl y di d not work for the
software vendors who speci al i zed i n thi s fi el d, most of whi ch suffered from a very poor
fi nanci al performance.
18.8 BALANCED SCORECARD
The bal anced scorecard i s a 1990s management methodol ogy that i n many respects
attempts to del i ver the benefi ts that the 1980s executi ve i nformati on systems promi sed, but
rarel y produced. The concept was ori gi nated by Robert Kapl an and Davi d Norton based on
a 1990 study sponsored by the Nol an Norton I nsti tute, the research arm of KPMG. The
resul ts were summari zed i n an arti cl e enti tl ed, ‘The Bal anced Scorecard–Measures That
Dri ve Performance’ (Harvard Busi ness Revi ew, Jan/Feb 1992). Other HBR arti cl es and a
book (The Balanced Scorecard, publ i shed by Harvard Busi ness School Press i n 1996) fol l owed.
Kapl an i s sti l l a professor at the Harvard Busi ness School and Norton, who was previ ousl y
presi dent of the strategy group i n Renai ssance Worl dwi de, i s now presi dent of the Bal anced
Scorecard Col l aborati ve.
Renai ssance says, “a Bal anced Scorecard i s a prescri pti ve framework that focuses on
sharehol der, customer, i nternal and l earni ng requi rements of a busi ness i n order to create
a system of l i nked objecti ves, measures, targets and i ni ti ati ves whi ch col l ecti vel y descri be
the str ategy of an or gani zati on and how that str ategy can be achi eved. A Bal anced
Management System i s a governance system whi ch uses the Bal anced Scorecard as the
centerpi ece i n order to communi cate an organi zati on’s strategy, create strategi c l i nkage,
focus busi ness pl anni ng, and provi de feedback and l earni ng agai nst the company’s strategy.”
244 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
“If I succeed in my
vision, how wi ll I l ook
to my shareholders?”
“If I succeed in my
vision, how wi ll I l ook
to my customers?”
“What processes do I
need to f ocus on to
meet customer
expectati ons?”
“What do I have to
invest i n to support my
internal and customer
obj ectives?”
Measurement is the
language that gives clarity
to vague concepts
Measurement is used to
communicate, not simply
to control
Building the Scorecard
develops consensus and
teamwork throughout the
organisation
Figure 18.3. The Bal anced Scorecard Provi des a Framework to Transl ate Strategy i nto
Operati onal Terms (source: Renai ssance Worl dwi de)
The basi c i dea of the bal anced scorecard i s that tradi ti onal hi stori c fi nanci al measure
i s an i nadequate way of measuri ng an organi zati on’s performance. I t ai ms to i ntegrate the
strategi c vi si on of the organi zati on’s executi ves wi th the day to day focus of managers. The
scorecard shoul d take i nto account the cause and effect rel ati onshi ps of busi ness acti ons,
i ncl udi ng esti mates of the response ti mes and si gni fi cance of the l i nkages among the scorecard
measures. I ts ai m i s to be at l east as much forward as backward l ooki ng.
The scorecard i s composed of four perspecti ves:
• Fi nanci al
• Customer
• Learni ng and growth
• I nternal busi ness process.
For each of these, there shoul d be objecti ves, measures, targets and i ni ti ati ves that
need to be tracked and reported. Many of the measures are soft and non-fi nanci al , rather
than just accounti ng data, and users have a two-way i nteracti on wi th the system (i ncl udi ng
enteri ng comments and expl anati ons). The objecti ves and measures shoul d be sel ected usi ng
a formal top-down process, rather than si mpl y choosi ng data that i s readi l y avai l abl e from
the operati onal systems. Thi s i s l i kel y to i nvol ve a si gni fi cant amount of management
consul tancy, both before any software i s i nstal l ed, and conti nui ng afterwards as wel l ; i ndeed,
the software may pl ay a si gni fi cant part i n supporti ng the consul ti ng acti vi ti es. For the
process to succeed, i t must be strongl y endorsed by the top executi ves and al l the management
team; i t cannot just be an i ni ti ati ve sponsored by the I T department (unl ess i t i s onl y to be
depl oyed i n I T), and i t must not be regarded as just another reporti ng appl i cati on.
However, we have noti ced that whi l e an i ncreasi ng number of OLAP vendors are
l aunchi ng bal anced scorecard appl i cati ons, some of them seem to be l i ttl e more than rebadged
OLAP APPLI CATI ONS 245
EI Ss, wi th no seri ous attempt to refl ect the busi ness processes that Kapl an and Norton
advocate. Such appl i cati ons are unl i kel y to be any more successful than the many run-of-
the-mi l l executi ve i nformati on systems, but i n the process, they are al ready di l uti ng the
meani ng of the bal anced scorecard term.
18.9 PROFITABILITY ANALYSIS
Thi s i s an appl i cati on, whi ch i s gr owi ng i n i mpor tance. Even hi ghl y pr ofi tabl e
organi zati ons ought to know where the profi ts are comi ng from; l ess profi tabl e organi zati ons
have to know where to cut back.
Pr ofi tabi l i ty anal ysi s i s i mpor tant i n setti ng pr i ces (and di scounts), deci di ng on
promoti onal acti vi ti es, sel ecti ng areas for i nvestment or di si nvestment and anti ci pati ng
competi ti ve pressures. Deci si ons i n these areas are made every day by many i ndi vi dual s i n
l arge organi zati ons, and thei r deci si ons wi l l be l ess effecti ve i f they are not wel l i nformed
about the di fferi ng l evel s of profi tabi l i ty of the company’s products and customers. Profi tabi l i ty
fi gures may be used to bi as acti ons, by basi ng remunerati on on profi tabi l i ty goal s rather
than revenue or vol ume.
Wi th deregul ati on i n many i ndustri es, pri vati zati on and the reducti on i n trade barri ers,
l arge new competi tors are more l i kel y to appear than i n the past. And, wi th new technol ogy
and the appear ance of ‘vi r tual cor por ati ons’, new smal l competi tor s wi thout expensi ve
i nfrastructures can successful l y chal l enge the gi ants. I n each case, the smart newcomers
wi l l chal l enge the i ncumbents i n thei r most profi tabl e areas, because thi s i s where the new
competi tor can afford to offer l ower pri ces or better qual i ty and sti l l be profi tabl e. Often they
wi l l focus on added val ue or better servi ces, because a l arge i ncumbent wi l l fi nd i t harder
to i mprove these than to cut pri ces. Thus, once a newcomer has become establ i shed i n areas
that were formerl y l ucrati ve, the establ i shed pl ayer may fi nd i t hard to respond effi ci entl y
–so the answer i s to be proacti ve, rei nforci ng the vul nerabl e areas before they are under
attack.
Thi s takes anal ysi s, wi thout knowl edge of whi ch customer s or pr oducts ar e most
profi tabl e, a l arge suppl i er may not real i ze that a pri ci ng umbrel l a i s bei ng created for new
competi tors to get establ i shed. However, i n more and more i ndustri es, the “di rect” l abor and
materi al cost of produci ng a product i s becomi ng a l ess and l ess si gni fi cant part of the total
cost. Wi th R&D, marketi ng, sal es, admi ni strati on and di stri buti on costs, i t i s often hard to
know exactl y whi ch costs rel ate to whi ch product or customer. Many of these costs are
rel ati vel y fi xed, and apporti oni ng them to the ri ght revenue generati ng acti vi ty i s hard, and
can be arbi trary. Someti mes, i t can even be di ffi cul t to correctl y assi gn revenues to products,
as descri bed above i n the marketi ng and sal es appl i cati on.
One popul ar way to assi gn costs to the ri ght products or servi ces i s to use acti vi ty based
costi ng. Thi s i s much more sci enti fi c than si mpl y al l ocati ng overhead costs i n proporti on to
revenues or fl oor space. I t attempts to measure resources that are consumed by acti vi ti es,
i n terms of cost dri vers. Typi cal l y costs are grouped i nto cost pool s whi ch are then appl i ed
to products or customers usi ng cost dri vers, whi ch must be measured. Some cost dri vers
may be cl earl y based on the vol ume of acti vi ti es, others may not be so obvi ous. They may,
for exampl e, be connected wi th the i ntroducti on of new products or suppl i ers. Others may
be connected wi th the compl exi ty of the organi zati on (the vari ety of customers, products,
246 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
suppl i ers, producti on faci l i ti es, markets etc). There are al so i nfrastructure-sustai ni ng costs
that cannot real i sti cal l y be appl i ed to acti vi ti es. Even i gnori ng these, i t i s l i kel y that the
costs of suppl yi ng the l east profi tabl e customers or products exceeds the revenues they
generate. I f these are known, the company can make changes to pri ces or other factors to
remedy the si tuati on–possi bl y by wi thdrawi ng from some markets, droppi ng some products
or decl i ni ng to bi d for certai n contracts.
Ther e ar e speci al i st ABC pr oducts on the mar ket and these have many FASMI
characteri sti cs. I t i s al so possi bl e to bui l d ABC appl i cati ons i n OLAP tool s, al though the
appl i cati on functi onal i ty may be l ess than coul d be achi eved through the use of a good
speci al i st tool .
18.10 QUALITY ANALYSIS
Al though qual i ty i mprovement programs are l ess i n vogue than they were i n the earl y
1990s, the need for consi stent qual i ty and rel i abi l i ty i n goods and servi ces i s as i mportant
as ever. The measures shoul d be objecti ve and customer rather than producer focused. The
systems are just as rel evant i n servi ce organi zati ons as i n the publ i c sector. I ndeed, many
publ i c sector servi ce organi zati ons have speci fi c servi ce targets.
These systems are used not just to moni tor an organi zati on’s own output, but al so that
of i ts suppl i ers. There may be for exampl e, servi ce l evel agreements that affect contract
extensi ons and payments.
Qual i ty systems can often i nvol ve mul ti di mensi onal data i f they moni tor numer i c
measures across di fferent producti on faci l i ti es, products or servi ces, ti me, l ocati ons and
customers. Many of the measures wi l l be non-fi nanci al , but they may be just as i mportant
as tradi ti onal fi nanci al measures i n formi ng a bal anced vi ew of the organi zati on. As wi th
fi nanci al measures, they may need anal yzi ng over ti me and across the functi ons of the
organi zati on; many organi zati ons are commi tted to conti nuous i mprovement, whi ch requi res
that there be formal measures that are quanti fi abl e and tracked over l ong peri ods; OLAP
tool s provi de an excel l ent way of doi ng thi s, and of spotti ng di sturbi ng trends before they
become too seri ous.
In Summary
OLAP has a wi de area of appl i cati ons, whi ch are mai nl y: Marketi ng and Sal es Anal ysi s,
Budgeti ng, Fi nanci al Reporti ng and Consol i dati on, EI S etc.
VOLUME II
DATA MINING
This page
intentionally left
blank
Chapter 1: INTRODUCTION
I n thi s chapter, a bri ef i ntroducti on to Data Mi ni ng i s outl i ned.
The di scussi on i ncl udes the defi ni ti ons of Data Mi ni ng; stages
i denti fi ed i n Data Mi ni ng Process, Model s, and i t al so address
the bri ef descri pti on on Data Mi ni ng methods and some of
the appl i cati ons and exampl es of Data Mi ni ng.
This page
intentionally left
blank
251
1.1 WHAT IS DATA MINING?
The past two decades have seen a dramati c i ncrease i n the amount of i nformati on or
data bei ng stored i n el ectroni c format. Thi s accumul ati on of data has taken pl ace at an
expl osi ve rate. I t has been esti mated that the amount of i nformati on i n the worl d doubl es
every 20 months and the si ze and number of databases are i ncreasi ng even faster. The
i ncrease i n use of el ectroni c data gatheri ng devi ces such as poi nt-of-sal e or remote sensi ng
devi ces has contri buted to thi s expl osi on of avai l abl e data. Fi gure 1, from the Red Bri ck
company i l l ustrates the data expl osi on.
Volume
of Data
1970 1980 1990 2000
Figure 1.1. The Growi ng Base of Data
Data storage became easi er as the avai l abi l i ty of l arge amount of computi ng power at
l ow cost i .e. the cost of processi ng power and storage i s fal l i ng, made data cheap. There was
al so the i ntroducti on of new machi ne l earni ng methods for knowl edge representati on based
on l ogi c programmi ng etc. i n addi ti on to tradi ti onal stati sti cal anal ysi s of data. The new
methods tend to be computati onal l y i ntensi ve hence a demand for more processi ng power.
Havi ng concentrated so much attenti on on the accumul ati on of data the probl em was
what to do wi th thi s val uabl e resource? I t was recogni zed that i nformati on i s at the heart
164,7+61
1
+0)26-4
252 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
of busi ness operati ons and that deci si on-makers coul d make use of the data stored to gai n
val uabl e i nsi ght i nto the busi ness. Database Management systems gave access to the data
stored but thi s was onl y a smal l part of what coul d be gai ned from the data. Tradi ti onal on-
l i ne transacti on processi ng systems, OLTPs, are good at putti ng data i nto databases qui ckl y,
safel y and effi ci entl y but are not good at del i veri ng meani ngful anal ysi s i n return. Anal yzi ng
data can provi de further knowl edge about a busi ness by goi ng beyond the data expl i ci tl y
stored to deri ve knowl edge about the busi ness. Thi s i s where Data Mi ni ng or Knowl edge
Di scovery i n Databases (KDD) has obvi ous benefi ts for any enterpri se.
1.2 DEFINITIONS
The term data mi ni ng has been stretched beyond i ts l i mi ts to appl y to any form of data
anal ysi s. Some of the numerous defi ni ti ons of Data Mi ni ng, or Knowl edge Di scovery i n
Databases are:
Data Mining, or Knowledge Discovery in Databases (KDD) as it is also known, is
the nontrivial extraction of implicit, previously unknown, and potentially useful
information from data. This encompasses a number of different technical approaches,
such as clustering, data summarization, learning classification rules, finding
dependency net works, analyzing changes, and detecting anomalies.
Wi l l i am J Frawl ey, Gregory Pi atetsky-Shapi ro and Chri stopher J Matheus
Data mining is the search for relationships and global patterns that exist in large
databases but are ‘hidden’ among the vast amount of data, such as a relationship
between patient data and their medical diagnosis. These relationships represent
valuable knowledge about the database and the objects in the database and, if the
database is a faithful mirror, of the real world registered by the database.
Marcel Hol shemi er & Arno Si ebes (1994)
The anal ogy wi th the mi ni ng process i s descri bed as:
Data mining refers to “using a variety of techniques to identify nuggets of information
or decision-making knowledge in bodies of data, and extracting these in such a way
that they can be put to use in the areas such as decision support, prediction,
forecasting and estimation. The data is often voluminous, but as it stands of low
value as no direct use can be made of it; it is the hidden information in the data
that is useful”
Cl ementi ne User Gui de, a data mi ni ng tool ki t
Basically data mining is concerned with the analysis of data and the use of software
techniques for finding patterns and regularities in sets of data. I t is the computer,
which is responsible for finding the patterns by identifying the underlying rules and
features in the data. The idea is that it is possible to strike gold in unexpected places
as the data mining software extracts patterns not previously discernable or so obvious
that no one has noticed them before.
I NTRODUCTI ON 253
1.3 DATA MINING PROCESS
Data mi ni ng anal ysi s tends to work from the data up and the best techni ques are those
devel oped wi th an ori entati on towards l arge vol umes of data, maki ng use of as much of the
col l ected data as possi bl e to arri ve at rel i abl e concl usi ons and deci si ons. The anal ysi s process
starts wi th a set of data, uses a methodol ogy to devel op an opti mal representati on of the
structure of the data duri ng whi ch ti me knowl edge i s acqui red. Once knowl edge has been
acqui red thi s can be extended to l arger sets of data worki ng on the assumpti on that the
l arger data set has a structure si mi l ar to the sampl e data. Agai n thi s i s anal ogous to a
mi ni ng operati on where l arge amounts of l ow grade materi al s are si fted through i n order
to fi nd somethi ng of val ue.
The fi gure 1.2, summari zes the some of the stages/processes i denti fi ed i n data mi ni ng
and knowl edge di scovery by Usama Fayyad & Evangel os Si moudi s, two of l eadi ng exponents
of thi s area.
Selection
Preprocessing
Transformation
Data Mining
Interpretation
and Evaluation
Target Data
Preprocessed
Data
Fattems
Data
Transformed
Data
Knowledge
Figure 1.2. Stages/Process I denti fi ed i n Data Mi ni ng
The phases depi cted start wi th the raw data and fi ni sh wi th the extracted knowl edge,
whi ch was acqui red as a resul t of the fol l owi ng stages:
• Selection–sel ecti ng or segmenti ng the data accordi ng to some cri teri a e.g. al l those
peopl e who own a car; i n thi s way subsets of the data can be determi ned.
• Preprocessing–thi s i s the data cl eansi ng stage where certai n i nformati on i s removed
whi ch i s deemed unnecessary and may sl ow down queri es, for exampl e unnecessary
to note the sex of a pati ent when studyi ng pregnancy. Al so the data i s reconfi gured
to ensure a consi stent format as there i s a possi bi l i ty of i nconsi stent formats because
the data i s drawn from several sources e.g. sex may recorded as f or m and al so
as 1 or 0.
254 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Transformation– the data i s not merel y transferred across but transformed i n that
overl ays may be added such as the demographi c overl ays commonl y used i n market
research. The data i s made useabl e and navi gabl e.
• Data mining–thi s stage i s concerned wi th the extracti on of patterns from the data.
A pattern can be defi ned as gi ven a set of facts(data) F, a l anguage L, and some
measure of certai nty C, a pattern i s a statement S i n L that descri bes rel ati onshi ps
among a subset Fs of F wi th a certai nty C such that S i s si mpl er i n some sense
than the enumerati on of al l the facts i n Fs.
• I nterpretation and evaluation–the patterns i denti fi ed by the system are i nterpreted
i nto knowl edge whi ch can then be used to support human deci si on-maki ng e.g.
predi cti on and cl assi fi cati on tasks, summari zi ng the contents of a database or
expl ai ni ng observed phenomena.
1.4 DATA MINING BACKGROUND
Data mi ni ng research has drawn on a number of other fi el ds such as i nducti ve l earni ng,
machi ne l earni ng and stati sti cs etc.
Inductive Learning
I nducti on i s the i nference of i nformati on from data and i nducti ve l earni ng i s the model
bui l di ng process where the envi ronment i .e. database i s anal yzed wi th a vi ew to fi ndi ng
patterns. Si mi l ar objects are grouped i n cl asses and rul es formul ated whereby i t i s possi bl e
to predi ct the cl ass of unseen objects. Thi s process of cl assi fi cati on i denti fi es cl asses such
that each cl ass has a uni que pattern of val ues whi ch forms the cl ass descri pti on. The nature
of the envi ronment i s dynami c hence the model must be adapti ve i .e. shoul d be abl e to l earn.
General l y, i t i s onl y possi bl e to use a smal l number of properti es to characteri ze objects,
so we make abstracti ons i n that objects whi ch sati sfy the same subset of properti es mapped
to the same i nternal representati on.
I nducti ve l ear ni ng wher e the system i nfer s knowl edge i tsel f fr om obser vi ng i ts
envi ronment has two mai n strategi es:
• Supervised learning–thi s i s l earni ng from exampl es where a teacher hel ps the
system construct a model by defi ni ng cl asses and suppl yi ng exampl es of each cl ass.
The system has to fi nd a descri pti on of each cl ass i .e. the common properti es i n the
exampl es. Once the descri pti on has been formul ated the descri pti on and the cl ass
form a cl assi fi cati on rul e whi ch can be used to predi ct the cl ass of previ ousl y
unseen objects. Thi s i s si mi l ar to di scri mi nate anal ysi s as i n stati sti cs.
• Unsupervised learning–thi s i s l earni ng from observati on and di scovery. The data
mi ne system i s suppl i ed wi th objects but no cl asses are defi ned so i t has to observe
the exampl es and recogni ze patterns (i .e. cl ass descri pti on) by i tsel f. Thi s system
resul ts i n a set of cl ass descri pti ons, one for each cl ass di scovered i n the envi ronment.
Agai n thi s i s si mi l ar to cl uster anal ysi s as i n stati sti cs.
I nducti on i s therefore the extracti on of patterns. The qual i ty of the model produced by
i nducti ve l earni ng methods i s such that the model coul d be used to predi ct the outcome of
I NTRODUCTI ON 255
future si tuati ons, i n other words not onl y for states encountered but rather for unseen states
that coul d occur. The probl em i s that most envi ronments have di fferent states, i .e. changes
wi thi n, and i t i s not al ways possi bl e to veri fy a model by checki ng i t for al l possi bl e si tuati ons.
Gi ven a set of exampl es the system can construct mul ti pl e model s some of whi ch wi l l
be si mpl er than others. The si mpl er model s are more l i kel y to be correct i f we adhere to
Ockhams Razor , whi ch states that i f ther e ar e mul ti pl e expl anati ons for a par ti cul ar
phenomena i t makes sense to choose the si mpl est because i t i s more l i kel y to capture the
nature of the phenomenon.
Statistics
Stati sti cs has a sol i d theoreti cal foundati on but the resul ts from stati sti cs can be
overwhel mi ng and di ffi cul t to i nterpret, as they requi re user gui dance as to where and how
to anal yze the data. Data mi ni ng however al l ows the expert’s knowl edge of the data and the
advanced anal ysi s techni ques of the computer to work together.
Stati sti cal anal ysi s systems such as SAS and SPSS have been used by anal ysi s to detect
unusual patterns and expl ai n patterns usi ng stati sti cal model s such as l i near model s. Stati sti cs
have a rol e to pl ay and data mi ni ng wi l l not repl ace such anal ysi s but rather they can act
upon more di rected anal ysi s based on the resul ts of data mi ni ng. For exampl e stati sti cal
i nducti on i s somethi ng l i ke the average rate of fai l ure of machi nes.
Machine Learning
Machi ne l earni ng i s the automati on of a l earni ng process and l earni ng i s tantamount
to the constructi on of rul es based on observati ons of envi ronmental states and transi ti ons.
Thi s i s a broad fi el d whi ch i ncl udes not onl y l earni ng from exampl es, but al so rei nforcement
l ear ni ng, l ear ni ng wi th teacher , etc. A l ear ni ng al gor i thm takes the data set and i ts
accompanyi ng i nformati on as i nput and returns a statement e.g. a concept representi ng the
resul ts of l earni ng as output. Machi ne l earni ng exami nes previ ous exampl es and thei r
outcomes and l earns how to reproduce these and make general i zati ons about new cases.
General l y a machi ne l earni ng system does not use si ngl e observati ons of i ts envi ronment
but an enti re fi ni te set cal l ed the trai ni ng set at once. Thi s set contai ns exampl es i .e.
observati ons coded i n some machi ne readabl e form. The trai ni ng set i s fi ni te hence not al l
concepts can be l earned exactl y.
Differences between Data Mining and Machine Learning
Knowl edge Di scovery i n Databases (KDD) or Data Mi ni ng, and the part of Machi ne
Learni ng (ML) deal i ng wi th l earni ng from exampl es overl ap i n the al gori thms used and the
probl ems addressed.
The mai n di fferences are:
• KDD i s concerned wi th fi ndi ng understandabl e knowl edge, whi l e ML i s concerned
wi th i mprovi ng performance of an agent. So trai ni ng a neural network to bal ance
a pol e i s part of ML, but not of KDD. However, there are efforts to extract knowl edge
from neural networks whi ch are very rel evant for KDD.
256 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• KDD i s concerned wi th very l arge, real -worl d databases, whi l e ML typi cal l y (but
not al ways) l ooks at smal l er data sets. So effi ci ency questi ons are much more
i mportant for KDD.
• ML i s a broader fi el d whi ch i ncl udes not onl y l earni ng from exampl es, but al so
rei nforcement l earni ng, l earni ng wi th teacher, etc.
KDD i s that part of ML whi ch i s concerned wi th fi ndi ng understandabl e knowl edge i n
l arge sets of real -worl d exampl es. When i ntegrati ng machi ne l earni ng techni ques i nto database
systems to i mpl ement KDD some of the databases requi re:
• More effi ci ent l earni ng al gori thms because real i sti c databases are normal l y very
l arge and noi sy. I t i s usual that the database i s often desi gned for purposes di fferent
from data mi ni ng and so properti es or attri butes that woul d si mpl i fy the l earni ng
task are not present nor can they be requested from the real worl d. Databases are
usual l y contami nated by errors so the data mi ni ng al gori thm has to cope wi th noi se
whereas ML has l aboratory type exampl es i .e. as near perfect as possi bl e.
• More expressi ve representati ons for both data, e.g. tupl es i n rel ati onal databases,
whi ch represent i nstances of a probl em domai n, and knowl edge, e.g. rul es i n a rul e-
based system, whi ch can be used to sol ve users’ probl ems i n the domai n, and the
semanti c i nformati on contai ned i n the rel ati onal schemata.
Practi cal KDD systems are expected to i ncl ude three i nterconnected phases:
• Transl ati on of standard database i nformati on i nto a form sui tabl e for use by l earni ng
faci l i ti es;
• Usi ng machi ne l earni ng techni ques to produce knowl edge bases from databases;
and
• I nterpreti ng the knowl edge produced to sol ve users’ probl ems and/or reduce data
spaces, data spaces bei ng the number of exampl es.
1.5 DATA MINING MODELS
I BM has i denti fi ed two types of model or modes of operati on, whi ch may be used to
unearth i nformati on of i nterest to the user.
Verification Model
The veri fi cati on model takes a hypothesi s from the user and tests the val i di ty of i t
agai nst the data. The emphasi s i s wi th the user who i s responsi bl e for formul ati ng the
hypothesi s and i ssui ng the query on the data to affi rm or negate the hypothesi s.
I n a marketi ng di vi si on for exampl e wi th a l i mi ted budget for a mai l i ng campai gn to
l aunch a new product i t i s i mportant to i denti fy the secti on of the popul ati on most l i kel y to
buy the new product. The user formul ates a hypothesi s to i denti fy potenti al customers and
the characteri sti cs they share. Hi stori cal data about customer purchase and demographi c
i nformati on can then be queri ed to reveal comparabl e purchases and the characteri sti cs
shared by those purchasers, whi ch i n turn can be used to target a mai l i ng campai gn. The
I NTRODUCTI ON 257
whol e operati on can be refi ned by ‘dri l l i ng down’ so that the hypothesi s reduces the ‘set’
returned each ti me unti l the requi red l i mi t i s reached.
The probl em wi th thi s model i s the fact that no new i nformati on i s created i n the
retri eval process but rather the queri es wi l l al ways return records to veri fy or negate the
hypothesi s. The search process here i s i terati ve i n that the output i s revi ewed, a new set
of questi ons or hypothesi s formul ated to refi ne the search and the whol e process repeated.
The user i s di scoveri ng the facts about the data usi ng a vari ety of techni ques such as
queri es, mul ti di mensi onal anal ysi s and vi sual i zati on to gui de the expl orati on of the data
bei ng i nspected.
Discovery Model
The di scovery model di ffers i n i ts emphasi s i n that i t i s the system automati cal l y
di scoveri ng i mportant i nformati on hi dden i n the data. The data i s si fted i n search of frequentl y
occurri ng patterns, trends and general i zati ons about the data wi thout i nterventi on or gui dance
from the user. The di scovery or data mi ni ng tool s ai m to reveal a l arge number of facts
about the data i n as short a ti me as possi bl e.
An exampl e of such a model i s a bank database, whi ch i s mi ned to di scover the many
groups of customers to target for a mai l i ng campai gn. The data i s searched wi th no hypothesi s
i n mi nd other than for the system to gr oup the customer s accor di ng to the common
characteri sti cs found.
1.6 DATA MINING METHODS
Fi gure 1.3 shows a two-di mensi onal arti fi ci al dataset consi sti ng 23 cases. Each poi nt on
the fi gure presents a person who has been gi ven a l oan by a parti cul ar bank at some ti me
i n the past. The data has been cl assi fi ed i nto two cl asses: persons who have defaul ted on
thei r l oan and persons whose l oans are i n good status wi th the bank.
The two pri mary goal s of data mi ni ng i n practi ce tend to be predi cti on and descri pti on.
Prediction i nvol ves usi ng some vari abl es or fi el ds i n the database to predi ct unknown or
future val ues of other vari abl es of i nterest. Description focuses on fi ndi ng human i nterpretabl e
patterns descri bi ng the data. The rel ati ve i mportance of predi cti on and descri pti on for
parti cul ar data mi ni ng appl i cati ons can vary consi derabl y.
Debt
Income
have defaulted
on their loans
good status
with the bank
Figure 1.3. A Si mpl e Data Set wi th Two Cl asses Used for I l l ustrati ve Purpose
258 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Classification i s l earni ng a functi on that maps a data i tem i nto one of several
predefi ned cl asses. Exampl es of cl assi fi cati on methods used as part of knowl edge
di scovery appl i cati ons i ncl ude cl assi fyi ng trends i n fi nanci al markets and automated
i denti fi cati on of objects of i nterest i n l arge i mage databases. Fi gure 1.4 and Fi gure
1.5 show cl assi fi cati ons of the l oan data i nto two cl ass regi ons. Note that i t i s not
possi bl e to separate the cl asses perfectl y usi ng a l i near deci si on boundary. The
bank mi ght wi sh to use the cl assi fi cati on regi ons to automati cal l y deci de whether
future l oan appl i cants wi l l be gi ven a l oan or not.


Income
Debt
Figure 1.4. Cl assi fi cati on Boundari es for a Nearest Nei ghbor

Income
Debt
Figure 1.5. An Exampl e of Cl assi fi cati on Boundari es Learned by a Non-Li near Cl assi fi er (such
as a neural network) for the Loan Data Set.
• Regression i s l earni ng a functi on that maps a data i tem to a real -val ued predi cti on
vari abl e. Regressi on appl i cati ons are many, e.g., predi cti ng the amount of bi omass
present i n a forest gi ven remotel y-sensed mi crowave measurements, esti mati ng the
probabi l i ty that a pati ent wi l l di e gi ven the resul ts of a set of di agnosti c tests,
pr edi cti ng consumer demand for a new pr oduct as a functi on of adver ti si ng
expendi ture, and ti me seri es predi cti on where the i nput vari abl es can be ti me-
l agged versi ons of the predi cti on vari abl e. Fi gure 1.6 shows the resul t of si mpl e
I NTRODUCTI ON 259
l i near regressi on where “total debt” i s fi tted as a l i near functi on of “i ncome”: the
fi t i s poor si nce there i s onl y a weak correl ati on between the two vari abl es.
• Clustering i s a common descri pti ve task where one seeks to i denti fy a fi ni te set of
categori es or cl usters to descri be the data. The categori es may be mutual l y excl usi ve
and exhausti ve, or consi st of a r i cher r epr esentati on such as hi er ar chi cal or
overl appi ng categori es. Exampl es of cl usteri ng i n a knowl edge di scovery context
i ncl ude di scover i ng homogeneous sub-popul ati ons for consumer s i n mar keti ng
databases and i denti fi cati on of sub-categor i es of spectr a fr om i nfr ar ed sky
measurements. Fi gure 1.7 shows a possi bl e cl usteri ng of the l oan data set i nto 3
cl usters: note that the cl usters overl ap al l owi ng data poi nts to bel ong to more than
one cl uster. The ori gi nal cl ass l abel s (denoted by two di fferent col ors) have been
repl aced by “no col or” to i ndi cate that the cl ass membershi p i s no l onger assumed.

Income
Debt
Regression Line
Figure 1.6. A Si mpl e Li near Regressi on for the Loan Data Set


Income
Debt
Cluster 1
Cluster 2
Cluster 3
Figure 1.7. A Si mpl e Cl usteri ng of the Loan Data Set i nto Three Cl usters
• Summarization i nvol ves methods for fi ndi ng a compact description for a subset of
data. A si mpl e exampl e woul d be tabul ati ng the mean and standard devi ati ons for
al l fi el ds. More sophi sti cated methods i nvol ve the deri vati on of summary rul es,
mul ti vari ate vi sual i zati on techni ques, and the di scovery of functi onal rel ati onshi ps
between var i abl es. Summar i zati on techni ques ar e often appl i ed to i nter acti ve
expl oratory data anal ysi s and automated report generati on.
260 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Dependency Model i ng consi sts of fi ndi ng a model that descri bes si gni fi cant
dependencies between variables. Dependency model s exi st at two l evel s: the structural
l evel of the model speci fi es (often i n graphi cal form) whi ch vari abl es are l ocal l y
dependent on each other, whereas the quantitative l evel of the model speci fi es the
strengths of the dependenci es usi ng some numeri cal scal e. For exampl e, probabi l i sti c
dependency networks use condi ti onal i ndependence to speci fy the structural aspect
of the model and pr obabi l i ti es or cor r el ati on to speci fy the str engths of the
dependenci es. Pr obabi l i sti c dependency networ k s ar e i ncr easi ngl y fi ndi ng
appl i cati ons i n areas as di verse as the devel opment of probabi l i sti c medi cal expert
systems from databases, i nformati on retri eval , and model i ng of the human genome.
• Change and Deviation Detection focuses on di scoveri ng the most si gni fi cant changes
i n the data from previ ousl y measured val ues.
• Model Representation i s the l anguage L for descri bi ng di scoverabl e patterns. I f the
representati on i s too l i mi ted, then no amount of trai ni ng ti me or exampl es wi l l
produce an accurate model for the data. For exampl e, a deci si on tree representati on,
usi ng uni vari ate (si ngl e-fi el d) node-spl i ts, parti ti ons the i nput space i nto hyperpl anes
that are paral l el to the attri bute axes. Such a deci si on-tree method cannot di scover
from data the formul a x = y no matter how much trai ni ng data i t i s gi ven. Thus,
i t i s i mportant that a data anal yst ful l y comprehend the representational assumptions
that may be inherent to a parti cul ar method. I t i s equal l y i mportant that an al gori thm
desi gner cl earl y state which representational assumptions are being made by a
parti cul ar al gori thm.
• Model Evaluation esti mates how wel l a par ti cul ar patter n (a model and i ts
parameters) meets the cri teri a of the KDD process. Eval uati on of predi cti ve accuracy
(val i di ty) i s based on cross val i dati on. Eval uati on of descri pti ve qual i ty i nvol ves
predi cti ve accuracy, novel ty, uti l i ty, and understandabi l i ty of the fi tted model . Both
l ogi cal and stati sti cal cri teri a can be used for model eval uati on. For exampl e, the
maxi mum l i kel i hood pri nci pl e chooses the parameters for the model that yi el d the
best fi t to the trai ni ng data.
• Search Method consi sts of two components: Parameter Search and Model Search.
I n parameter search the al gori thm must search for the parameters that opti mi ze
the model eval uati on cri teri a gi ven observed data and a fi xed model representati on.
Model Sear ch occur s as a l oop over the par ameter sear ch method: the model
representati on i s changed so that a fami l y of model s i s consi dered. For each speci fi c
model representati on, the parameter search method i s i nstanti ated to eval uate the
qual i ty of that parti cul ar model . I mpl ementati ons of model search methods tend to
use heuri sti c search techni ques si nce the si ze of the space of possi bl e model s often
prohi bi ts exhausti ve search and cl osed form sol uti ons are not easi l y obtai nabl e.
1.7 DATA MINING PROBLEMS/ISSUES
Data mi ni ng systems rel y on databases to suppl y the raw data for i nput and thi s rai ses
probl ems i n the databases that tend to be dynami c, i ncompl ete, noi sy, and l arge. Other
probl ems ari se as a resul t of the adequacy and rel evance of the i nformati on stored.
I NTRODUCTI ON 261
Limited Information
A database i s often desi gned for purposes di fferent from data mi ni ng and someti mes
the properti es or attri butes that woul d si mpl i fy the l earni ng task are not present nor can
they be requested from the real worl d. I nconcl usi ve data causes probl ems because i f some
attri butes essenti al to knowl edge about the appl i cati on domai n are not present i n the data
i t may be i mpossi bl e to di scover si gni fi cant knowl edge about a gi ven domai n. For exampl e
cannot di agnose mal ari a from a pati ent database i f that database does not contai n the
pati ent’s red bl ood cel l count.
Noise and Missing Values
Databases are usual l y contami nated by errors so i t cannot be assumed that the data
they contai n i s enti r el y cor r ect. Attr i butes, whi ch r el y on subjecti ve or measur ement
judgements, can gi ve ri se to errors such that some exampl es may even be mi s-cl assi fi ed.
Error i n ei ther the val ues of attri butes or cl ass i nformati on are known as noi se. Obvi ousl y
where possi bl e i t i s desi rabl e to el i mi nate noi se from the cl assi fi cati on i nformati on as thi s
affects the overal l accuracy of the generated rul es.
Mi ssi ng data can be treated by di scovery systems i n a number of ways such as;
• Si mpl y di sregard mi ssi ng val ues
• Omi t the correspondi ng records
• I nfer mi ssi ng val ues from known val ues
• Treat missing data as a special value to be included additionally in the attribute domain
• Average over the mi ssi ng val ues usi ng Bayesi an techni ques.
Noi sy data i n the sense of bei ng i mpreci se i s characteri sti c of al l data col l ecti on and
typi cal l y fi t a regul ar stati sti cal di stri buti on such as Gaussi an whi l e wrong val ues are data
entry errors. Stati sti cal methods can treat probl ems of noi sy data, and separate di fferent
types of noi se.
Uncertainty
Uncertai nty refers to the severi ty of the error and the degree of noi se i n the data. Data
preci si on i s an i mportant consi derati on i n a di scovery system.
Size, Updates, and Irrelevant Fields
Databases tend to be l arge and dynami c i n that thei r contents are ever-changi ng as
i nformati on i s added, modi fi ed or removed. The probl em wi th thi s from the data mi ni ng
perspecti ve i s how to ensure that the rul es are up-to-date and consi stent wi th the most
current i nformati on. Al so the l earni ng system has to be ti me-sensi ti ve as some data val ues
vary over ti me and the di scovery system i s affected by the ‘ti mel i ness’ of the data.
Another i ssue i s the rel evance or i rrel evance of the fi el ds i n the database to the current
focus of di scovery, for exampl e post codes are fundamental to any studi es tryi ng to establ i sh
a geographi cal connecti on to an i tem of i nterest such as the sal es of a product.
262 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
1.8 POTENTIAL APPLICATIONS
Data mi ni ng has many and vari ed fi el ds of appl i cati on some of whi ch are l i sted bel ow.
Retail/Marketing
• I denti fy buyi ng patterns from customers
• Fi nd associ ati ons among customer demographi c characteri sti cs
• Predi ct response to mai l i ng campai gns
• Market basket anal ysi s
Banking
• Detect patterns of fraudul ent credi t card use
• I denti fy ‘l oyal ’ customers
• Predi ct customers l i kel y to change thei r credi t card affi l i ati on
• Determi ne credi t card spendi ng by customer groups
• Fi nd hi dden correl ati ons between di fferent fi nanci al i ndi cators
• I denti fy stock tradi ng rul es from hi stori cal market data
Insurance and Health Care
• Cl ai ms anal ysi s - i .e. whi ch medi cal procedures are cl ai med together
• Predi ct whi ch customers wi l l buy new pol i ci es
• I denti fy behavi or patterns of ri sky customers
• I denti fy fraudul ent behavi or
Transportation
• Determi ne the di stri buti on schedul es among outl ets
• Anal yze l oadi ng patterns
Medicine
• Characteri ze pati ent behavi or to predi ct offi ce vi si ts
• I denti fy successful medi cal therapi es for di fferent i l l nesses
1.9 DATA MINING EXAMPLES
Bass Brewers
Bass Brewers i s the l eadi ng beer producer i n the UK and has a 23% of the market. The
company has a reputati on for great brands and good servi ce but real i zed the i mportance of
i nformati on i n order to mai ntai n a l ead i n the UK beer market.
I NTRODUCTI ON 263
We’ve been brewing beer since 1777, with increased competition comes
a demand to make faster, better informed decisions.
Mi ke Fi sher, I S di rector, Bass Brewers
Bass deci ded to gather the data i nto a data warehouse on a system so that the users
i .e. the deci si on-makers coul d have consi stent, rel i abl e, onl i ne i nformati on. Pri or to thi s,
users coul d expect a turn around of 24 hours but wi th the new system the answers shoul d
be returned i nteracti vel y.
For the first time, people will be able to do data mining - ask questions
we never dreamt we could get the answers to, look for patterns among
the data we could never recognize before.
Ni gel Rowl ey, I nformati on I nfrastructure manager
Thi s commi tment to data mi ni ng has gi ven Bass a competi ti ve edge when i t comes to
i denti fyi ng market trends and taki ng advantage of thi s.
Northern Bank
A subsi di ary of the Nati onal Austral i a Group, the Northern Bank has a major new
appl i cati on based upon Hol os from Hol i sti c Systems now bei ng used i n each of the 107
br anches i n the Pr ovi nce. The new system i s desi gned to del i ver fi nanci al and sal es
i nformati on such as vol umes, margi ns, revenues, overheads and profi ts as wel l as quanti ti es
of product hel d, sol d, cl osed etc.
The appl i cati on consi sts of two l oosel y coupl ed systems;
• a system to i ntegrate the mul ti pl e data sources i nto a consol i dated database,
• another system to del i ver that i nformati on to the users i n a meani ngful way.
The Northern i s addressi ng the need to convert data i nto i nformati on as thei r products
need to be measured outl et by outl et, and over a peri od of ti me.
The new system delivers management information in electronic form to
the branch network. The information is now more accessible, paperless
and timely. For the first time, all the various income streams are
attributed to the branches which generate the business.
Mal col m Longri dge, MI S devel opment team l eader, Northern Bank
TSB Group PLC
The TSB Group i s al so usi ng Hol os suppl i ed by Hol i sti c Systems because of
its flexibility and its excellent multidimensional functionality, which it
provides without the need for a separate multidimensional database
Andrew Scott, End-User Computi ng Manager at TSB
264 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
The four major appl i cati ons whi ch have been devel oped are:
• a company-wi de budget and forecasti ng model , BAF, for the fi nance department -
has 50 potenti al users taki ng i nformati on out of the Mi l l enni um general l edger and
enabl es anal ysts to study thi s data usi ng 40 mul ti di mensi onal model s wri tten usi ng
Hol os;
• a mortgage management i nformati on system for mi ddl e and seni or management i n
TSB Home l oans;
• a sui te of actuari al model s for TSB I nsurance; and
• Busi ness Anal ysi s Project, BAP, used as an EI S type system by TSB I nsurance to
obtai n a better understandi ng of the company’s actual ri sk exposures for feedi ng
back i nto the actuari al model i ng system.
BeursBase, Amsterdam
BeursBase, a real -ti me on-l i ne stock/exchange rel ati onal data base (RDB) fed wi th
i nformati on from the Amsterdam Stock Exchange i s now avai l abl e for general access by
i nsti tuti ons or i ndi vi dual s. I t wi l l be augmented by data fr om the Eur opean Opti on
Exchange befor e Januar y 1996. Al l stock, opti on and futur e pr i ces and vol umes ar e
bei ng warehoused.
BeursBase has been i n operati on for about a year and contai ns approxi matel y 1.8
mi l l i on stock pri ces, over hal f a mi l l i on quotes and about a mi l l i on stock trade vol umes. The
AEX (Amsterdam EOE I ndex) or the Dutch Dow Jones, based upon the 25 most acti ve
securi ti es traded (measured over a year) i s refreshed vi a the database approxi matel y every
30 seconds.
The RDB empl oys SQL/DS on a VM system and DB2/6000 on an AI X RS/6000 cl uster.
A paral l el edi ti on of DB2/6000 wi l l soon be ready for data mi ni ng purposes, data qual i ty
measurement, pl us a vari ety of other compl ex queri es.
The project was founded by Marti n P. Mi sseyer, assi stant professor on the facul ty of
economi cs, busi ness admi ni strati on and econometri cs at Vri je Uni versi ty. BeursBase uni que
i n i ts ki nd i s characteri zed by the fol l owi ng features: fi rst BeursBase contai ns both real ti me
and hi stori cal data. Secondl y, al l data retri eved from ASE are stored, rather than a subset,
al l broadcasted trade data are stored. Thi rdl y, the data, BeursBase i tsel f and subsequent
appl i cati ons form the basi s for many research, educati on and publ i c rel ati ons acti vi ti es. A
si mi l ar data l i nk wi th the Amsterdam European Opti on Exchange (EOE) wi l l be establ i shed
as wel l .
Delphic Universities
The Del phi c uni versi ti es are a group of 24 uni versi ti es wi thi n the MAC i ni ti ati ve who
have adopted Hol os for thei r management i nformati on system, MI S, needs. Hol os provi des
compl ex model i ng for I T l i terate users i n the pl anni ng departments whi l e al so gi vi ng the
seni or management a user-fri endl y EI S.
I NTRODUCTI ON 265
Real value is added to data by multidimensional manipulation (being
able to easily compare many different views of the available information
in one report) and by modeling. I n both these areas spreadsheets and
query-based tools are not able to compete with fully-fledged management
information systems such as Holos. These two features turn raw data
into useable information.
Mi chael O’Hara, chai rman of the MI S Appl i cati on Group at Del phi c
Harvard - Holden
Harvard Uni versi ty has devel oped a central l y operated fund-rai si ng system that al l ows
uni versi ty i nsti tuti ons to share fund-rai si ng i nformati on for the fi rst ti me.
The new Sybase system, cal l ed HOLDEN (Harvard Onl i ne Devel opment Network), i s
expected to maxi mi ze the funds gener ated by the Har var d Devel opment Offi ce fr om
the current donor pool by more preci sel y targeti ng exi sti ng resources and el i mi nati ng wasted
efforts and redundanci es across the uni versi ty. Through thi s streaml i ni ng, HOLDEN wi l l
al l ow Harvard to pursue one of the most ambi ti ous fund-rai si ng goal s ever set by an Ameri can
i nsti tuti on to rai se $2 bi l l i on i n fi ve years.
Harvard Uni versi ty has enjoyed the nati on’s premi er uni versi ty
endowment since 1636. Sybase technology has allowed us to develop
an information system that will preserve this legacy into the twenty-
first century
Ji m Conway, di rector of devel opment computi ng servi ces, Harvard Uni versi ty
J.P. Morgan
Thi s l eadi ng fi nanci al company was one of the fi rst to empl oy data mi ni ng/forecasti ng
appl i cati ons usi ng I nformati on Harvester software on the Convex Exampl ar and C seri es.
The promise of data mining tools like I nformation Harvester is that
they are able to quickly wade through massive amounts of data to
identify relationships or trending information that would not have
been available without the tool
Charl es Bonomo, vi ce presi dent of advanced technol ogy for J.P. Morgan
The fl exi bi l i ty of the I nformati on Harvesti ng i nducti on al gori thm enabl es i t to adapt to
any system. The data can be i n the form of numbers, dates, codes, categori es, text or any
combi nati on thereof. I nformati on Harvester i s desi gned to handl e faul ty, mi ssi ng and noi sy
data. Large vari ati ons i n the val ues of an i ndi vi dual fi el d do not hamper the anal ysi s.
I nformati on Harvester has uni que abi l i ti es to recogni ze and i gnore i rrel evant data fi el ds
when searchi ng for patterns. I n ful l -scal e paral l el -processi ng versi ons, I nformati on Harvester
can handl e mi l l i ons of rows and thousands of vari abl es.
266 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
In Summary
Basi cal l y Data Mi ni ng i s concerned wi th the anal ysi s of data and the use of software
techni ques for fi ndi ng patterns and regul ari ti es i n sets of data. The stages/processes i n data
mi ni ng and Knowl edge di scovery i ncl udes: Sel ecti on, preprocessi ng, transformati on of data,
i nterpretati on and eval uati on. Data Mi ni ng methods are cl assi fi ed by the functi ons they
perform, whi ch i ncl ude: Cl assi fi cati on, Cl usteri ng, and Regressi on etc.
DATA MI NI NG WI TH DECI SI ON TREES 267
Chapter 2: DATA MINING WITH
DECISION TREES
Deci si on tree i s a Cl assi fi cati on scheme and i s used to fi nd
the descri pti on of several predefi ned cl asses and cl assi fy a
data i tem i nto one of them. The mai n topi cs, whi ch are covered
i n thi s chapter, are:
• How the Deci si on Tree Works.
• Constructi on of Deci si on Trees.
• I ssues i n Data Mi ni ng wi th Deci si on Tree.
• Vi sual i zati on of Deci si on tree i n CABRO System.
• Strengths and Weaknesses.
This page
intentionally left
blank
269
INTRODUCTION
Deci si on trees are powerful and popul ar tool s for cl assi fi cati on and predi cti on. The
attracti veness of tree-based methods i s due i n l arge part to the fact that, i n contrast to
neural networks, deci si on trees represent rules. Rul es can readi l y be expressed so that we
humans can understand them or i n a database access l anguage l i ke SQL, the records fal l i ng
i nto a parti cul ar category may be retri eved.
I n some appl i cati ons, the accuracy of a cl assi fi cati on or predi cti on i s the onl y thi ng that
matters; i f a di rect mai l fi rm obtai ns a model that can accuratel y predi ct whi ch members
of a prospect pool are most l i kel y to respond to a certai n sol i ci tati on, they may not care how
or why the model works. I n other si tuati ons, the abi l i ty to expl ai n the reason for a deci si on
i s cruci al . I n heal th i nsurance underwri ti ng, for exampl e, there are l egal prohi bi ti ons agai nst
di scri mi nati on based on certai n vari abl es. An i nsurance company coul d fi nd i tsel f i n the
posi ti on of havi ng to demonstrate to the sati sfacti on of a court of l aw that i t has not used
i l l egal di scri mi natory practi ces i n granti ng or denyi ng coverage. There are a vari ety of
al gori thms for bui l di ng deci si on trees that share the desi rabl e trai t of expl i cabi l i ty. Most
notabl y are two methods and systems CART and C4.5 (See5/C5.0) that are gai ni ng popul ari ty
and are now avai l abl e as commerci al software.
2.1 HOW A DECISION TREE WORKS
Decision tree i s a cl assi fi er i n the form of a tree structure where each node i s ei ther:
• a leaf node, i ndi cati ng a cl ass of i nstances, or
• a decision node that speci fi es some test to be carri ed out on a si ngl e attri bute val ue,
wi th one branch and sub-tree for each possi bl e outcome of the test.
A deci si on tree can be used to cl assi fy an i nstance by starti ng at the root of the tree
and movi ng through i t unti l a l eaf node, whi ch provi des the cl assi fi cati on of the i nstance.
Example: Decision making in the London stock market
Suppose that the major factors affecti ng the London stock market are:
DATA M¡N¡NG W¡TH DEC¡$¡ON TREE$
2
CHAFTER
270 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• what i t di d yesterday;
• what the New York market i s doi ng today;
• bank i nterest rate;
• unempl oyment rate;
• Engl and’s prospect at cri cket.
Tabl e 2.1 i s a smal l i l l ustrati ve dataset of si x days about the London stock market. The
l ower part contai ns data of each day accordi ng to fi ve questi ons, and the second row shows
the observed resul t (Yes (Y) or No (N) for “I t ri ses today”). Fi gure 2.1 i l l ustrates a typi cal
l earned deci si on tree from the data i n Tabl e 2.1.
Table 2.1. Examples of a Small Dataset on the London Stock Market
I nstance No. 1 2 3 4 5 6
I t ri ses today Y Y Y N N N
I t rose yesterday Y Y N Y N N
NY ri ses today Y N N N N N
Bank rate hi gh N Y N Y N Y
Unempl oyment hi gh N Y Y N N N
Engl and i s l osi ng Y Y Y Y Y Y
is unemployment hi gh?
YES NO
The London market
wi ll rise t oday {2,3}
is the New York market
ris ing today?
YES NO
The London market
wi ll rise t oday {1}
The London market
wi ll not ri se today {4, 5, 6}
is unemployment high?
is the New York market
rising today?
The London market
will rise today {2, 3}
The London market
will rise today {1}
The London market
will not rise today {4, 5, 6}
YES NO
NO YES
Figure 2.1. A Deci si on Tree for the London Stock Market
The process of predi cti ng an i nstance by thi s deci si on tree can al so be expressed by
answeri ng the questi ons i n the fol l owi ng order:
I s unempl oyment hi gh?
YES: The London market wi l l ri se today.
NO: I s the New York market ri si ng today?
YES: The London market wi l l ri se today.
NO: The London market wi l l not ri se today.
DATA MI NI NG WI TH DECI SI ON TREES 271
Deci si on tr ee i nducti on i s a typi cal i nducti ve appr oach to l ear n knowl edge on
cl assi fi cati on. The key requi rements to do mi ni ng wi th deci si on trees are:
• Attribute-value description: object or case must be expressi bl e i n terms of a fi xed
col l ecti on of properti es or attri butes.
• Predefined classes: The categori es to whi ch cases are to be assi gned must have been
establ i shed beforehand (supervi sed data).
• Discrete classes: A case does or does not bel ong to a parti cul ar cl ass, and there must
be for more cases than cl asses.
• Sufficient data: Usual l y hundreds or even thousands of trai ni ng cases.
• “Logical” classification model: Cl assi fi er that can onl y be expressed as deci si on
trees or set of producti on rul es
2.2 CONSTRUCTING DECISION TREES
The Basic Decision Tree Learning Algorithm. Most al gori thms that have been devel oped
for l earni ng deci si on trees are vari ati ons on a core al gori thm that empl oys a top-down,
greedy search through the space of possi bl e deci si on trees. Deci si on tree programs construct
a deci si on tree T from a set of trai ni ng cases. The ori gi nal i dea of constructi on of deci si on
trees goes back to the work of Hovel and and Hunt on concept learning systems (CLS) i n the
l ate 1950s. Tabl e 2.2 bri efl y descri bes thi s CLS scheme that i s i n fact a recursi ve top-down
di vi de-and-conquer al gori thm. The al gori thm consi sts of fi ve steps.
Table 2.2. CLS Algorithm
1. T ← the whol e trai ni ng set. Create a T node.
2. I f al l exampl es i n T are posi ti ve, create a ‘P’ node wi th T as i ts parent and stop.
3. I f al l exampl es i n T are negati ve, create an ‘N’ node wi th T as i ts parent and stop.
4. Sel ect an attri bute X wi th val ues v
1
, v
2
, …, v
N
and parti ti on T i nto subsets T
1
, T
2
,
…, T
N
accordi ng to thei r val ues on X. Create N nodes T
i
(i = 1,..., N) wi th T as thei r
parent and X =v
i
as the l abel of the branch from T to T
i
.
5. For each T
i
do: T ← T
i
and goto step 2.
We pr esent her e the basi c al gor i thm for deci si on tr ee l ear ni ng, cor r espondi ng
approxi matel y to the I D3 al gori thm of Qui nl an, and i ts successors C4.5, See5/C5.0 . To
i l l ustrate the operati on of I D3, consi der the l earni ng task represented by trai ni ng exampl es
of Tabl e 2.3. Here the target attri bute PlayTennis (al so cal l ed cl ass attri bute), whi ch can
have val ues yes or no for di fferent Saturday morni ngs, i s to be predi cted based on other
attri butes of the morni ng i n questi on.
Table 2.3. Training Examples for the Target Concept Pl ay Tenni s
Day Outlook Temperature Humidity Wind Play Tennis?
D1 Sunny Hot Hi gh Weak No
D2 Sunny Hot Hi gh Strong No
D3 Overcast Hot Hi gh Weak Yes
(Contd.)
272 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
D4 Rai n Mi l d Hi gh Weak Yes
D5 Rai n Cool Normal Weak Yes
D6 Rai n Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mi l d Hi gh Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rai n Mi l d Normal Weak Yes
D11 Sunny Mi l d Normal Strong Yes
D12 Overcast Mi l d Hi gh Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rai n Mi l d Hi gh Strong No
Which Attribute is the Best Classifier?
The central choi ce i n the I D3 al gori thm sel ects whi ch attri bute to test at each node i n
the tree, accordi ng to the fi rst task i n step 4 of the CLS al gori thm. We woul d l i ke to sel ect
the attri bute that i s most useful for cl assi fyi ng exampl es. What i s a good quanti tati ve
measure of the worth of an attri bute? We wi l l defi ne a stati sti cal property cal l ed information
gain that measures how wel l a gi ven attri bute separates the trai ni ng exampl es accordi ng
to thei r target cl assi fi cati on. I D3 uses thi s i nformati on gai n measure to sel ect among the
candi date attri butes at each step whi l e growi ng the tree.
Entropy Measures Homogeneity of Examples. I n or der to defi ne i nfor mati on gai n
preci sel y, we begi n by defi ni ng a measure commonl y used i n i nformati on theory, cal l ed
entropy, that characteri zes the i mpuri ty of an arbi trary col l ecti on of exampl es. Gi ven a
col l ecti on S, contai ni ng posi ti ve and negati ve exampl es of some target concept, the entropy
of S rel ati ve to thi s Bool ean cl assi fi cati on i s
Entropy(S) = –p

l og
2
p

– p
O−
l og
2
p
O−
...(2.1)
where p

i s the proporti on of posi ti ve exampl es i n S and p i s the proporti on of negati ve
exampl es i n S. I n al l cal cul ati ons i nvol vi ng entropy we defi ne 0l og0 to be 0.
To i l l ustrate, suppose S i s a col l ecti on of 14 exampl es of some Bool ean concept, i ncl udi ng
9 posi ti ve and 5 negati ve exampl es (we adopt the notati on [9+, 5–] to summari ze such a
sampl e of data). Then the entropy of S rel ati ve to thi s Bool ean cl assi fi cati on i s
Entropy([9+, 5–]) = –(9/14) l og
2
(9/14) – (5/14) l og
2
(5/14) = 0.940 ...(2.2)
Noti ce that the entropy i s 0 i f al l members of S bel ong to the same cl ass. For exampl e,
i f al l members are posi ti ve (p

= 1 ), then p
O−
i s 0, and Entropy(S) = –1 × l og
2
(1) – 0 × l og
2
0
= –1 × 0 – 0 × l og
2
0 = 0. Note the entropy i s 1 when the col l ecti on contai ns an equal number
of posi ti ve and negati ve exampl es. I f the col l ecti on contai ns unequal numbers of posi ti ve and
negati ve exampl es, the entropy i s between 0 and 1. Fi gure 2.2 shows the form of the entropy
functi on rel ati ve to a Bool ean cl assi fi cati on, as p

vari es between 0 and 1.
One i nterpretati on of entropy from i nformati on theory i s that i t speci fi es the mi ni mum
number of bi ts of i nformati on needed to encode the cl assi fi cati on of an arbi trary member of
DATA MI NI NG WI TH DECI SI ON TREES 273
S (i .e., a member of S drawn at random wi th uni form probabi l i ty). For exampl e, i f p

i s 1,
the recei ver knows the drawn exampl e wi l l be posi ti ve, so no message needs to be sent, and
the entropy i s 0. On the other hand, i f p

i s 0.5, one bi t i s requi red to i ndi cate whether
the drawn exampl e i s posi ti ve or negati ve. I f p

i s 0.8, then a col l ecti on of messages can
be encoded usi ng on average l ess than 1 bi t per message by assi gni ng shorter codes to
col l ecti ons of posi ti ve exampl es and l onger codes to l ess l i kel y negati ve exampl es.
0.0
0.5
1.0
1.0 0.5
Figure 2.2. The Entropy Functi on Rel ati ve to a Bool ean Cl assi fi cati on, as the Proporti on of
Posi ti ve Exampl es p

Vari es between 0 and 1.
Thus far we have di scussed entropy i n the speci al case where the target cl assi fi cati on
i s Bool ean. More general l y, i f the target attri bute can take on c di fferent val ues, then the
entropy of S rel ati ve to thi s c-wi se cl assi fi cati on i s defi ned as
Entropy(S) =

=

p p
i i
i
c
l og
2
1
...(2.3)
where p
i
i s the proporti on of S bel ongi ng to cl ass i. Note the l ogari thm i s sti l l base 2 because
entropy i s a measure of the expected encodi ng l ength measured i n bi ts. Note al so that i f the
target attri bute can take on c possi bl e val ues, the entropy can be as l arge as l og
2
c.
Information Gain Measures the Expected Reduction in Entropy. Gi ven entropy as a
measure of the i mpuri ty i n a col l ecti on of trai ni ng exampl es, we can now defi ne a measure
of the effecti veness of an attri bute i n cl assi fyi ng the trai ni ng data. The measure we wi l l use,
cal l ed information gain, i s si mpl y the expected reducti on i n entropy caused by parti ti oni ng
the exampl es accordi ng to thi s attri bute. More preci sel y, the i nformati on gai n, Gain (S, A)
of an attri bute A, rel ati ve to a col l ecti on of exampl es S, i s defi ned as
Gai n (S, A) = Entropy ( ) Entropy (
Val ues ( )
S
S
S
S
v
v
v A



| |
| |
) ...(2.4)
where Values (A) i s the set of al l possi bl e val ues for attri bute A, and S
v
i s the subset of S
for whi ch attri bute A has val ue v (i .e., S
v
= {s ∈ S | A(s) =v}). Note the fi rst term i n Equati on
(2.4) i s just the entropy of the ori gi nal col l ecti on S and the second term i s the expected val ue
274 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
of the entropy after S i s parti ti oned usi ng attri bute A. The expected entropy descri bed by
thi s second term i s si mpl y the sum of the entropi es of each subset Sv, wei ghted by the
fracti on of exampl es | S
v
| /| S| that bel ong to S
v
. Gain (S, A) i s therefore the expected
reducti on i n entropy caused by knowi ng the val ue of attri bute A. Put another way, Gain (S,
A) i s the i nformati on provi ded about the target functi on val ue, gi ven the val ue of some other
attri bute A. The val ue of Gain (S, A) i s the number of bi ts saved when encodi ng the target
val ue of an arbi trary member of S, by knowi ng the val ue of attri bute A.
For exampl e, suppose S i s a col l ecti on of trai ni ng-exampl e days descri bed i n Tabl e 2.3
by attri butes i ncl udi ng Wind, whi ch can have the val ues Weak or Strong. As before, assume
S i s a col l ecti on contai ni ng 14 exampl es ([9+, 5–]). Of these 14 exampl es, 6 of the posi ti ve
and 2 of the negati ve exampl es have Wind =Weak, and the remai nder have Wind =Strong.
The i nformati on gai n due to sorti ng the ori gi nal 14 exampl es by the attri bute Wind may
then be cal cul ated as
Val ues(Wi nd) = Weak, Strong
S = [9 +, 5–]
S
Weak
←[6+, 2–]
S
Strong
←[3+, 3–]
Gai n (S, Wi nd) = Entropy ( ) Entropy (
{Weak, Strong}
S
S
S
S
v
v
v



| |
| |
)
= Entropy(S) – (8/14) Entropy (S
Weak
)
– (6/14) Entropy (S
Strong
)
= 0.940 – (8/14) 0.811 – (6/14)1.00
= 0.048
I nformati on gai n i s preci sel y the measure used by I D3 to sel ect the best attri bute at
each step i n growi ng the tree.
An Illustrative Example. Consi der the fi rst step through the al gori thm, i n whi ch the
topmost node of the deci si on tree i s created. Whi ch attri bute shoul d be tested fi rst i n the
tr ee? I D3 deter mi nes the i nfor mati on gai n for each candi date attr i bute (i .e., Outlook,
Temperature, Humidity, and Wind), then sel ects the one wi th hi ghest i nformati on gai n. The
i nformati on gai n val ues for al l four attri butes are
Gain(S, Outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029
where S denotes the col l ecti on of trai ni ng exampl es from Tabl e 2.3.
Accordi ng to the i nformati on gai n measure, the Outlook attri bute provi des the best
predi cti on of the target attri bute, PlayTennis, over the trai ni ng exampl es. Therefore, Outlook
i s sel ected as the deci si on attri bute for the root node, and branches are created bel ow the
root for each of i ts possi bl e val ues (i .e., Sunny, Overcast, and Rain). The fi nal tree i s shown
i n Fi gure 2.3.
DATA MI NI NG WI TH DECI SI ON TREES 275
Outlook
Humidity Wind
Sunny Overcast Rain
No Yes No Yes
High Normal Strong Weak
Yes
Figure 2.3. A Deci si on Tree for the Concept Play Tennis
The process of sel ecti ng a new attri bute and parti ti oni ng the trai ni ng exampl es i s now
repeated for each non-termi nal descendant node, thi s ti me usi ng onl y the trai ni ng exampl es
associ ated wi th that node. Attri butes that have been i ncorporated hi gher i n the tree are
excl uded, so that any gi ven attri bute can appear atmost once al ong any path through the
tree. Thi s process conti nues for each new l eaf node unti l ei ther of two condi ti ons i s met:
1. every attri bute has al ready been i ncl uded al ong thi s path through the tree, or
2. al l the trai ni ng exampl es associ ated wi th thi s l eaf node have the same target
attri bute val ue (i .e., thei r entropy i s zero).
2.3 ISSUES IN DATA MINING WITH DECISION TREES
Practi cal i ssues i n l earni ng deci si on trees i ncl ude determi ni ng how deepl y to grow the
deci si on tree, handl i ng conti nuous attri butes, choosi ng an appropri ate attri bute sel ecti on
measure, handl i ng trai ni ng data wi th mi ssi ng attri bute val ues, handi ng attri butes wi th
di fferi ng costs, and i mprovi ng computati onal effi ci ency. Bel ow we di scuss each of these
i ssues and extensi ons to the basi c I D3 al gori thm that address them. I D3 has i tsel f been
extended to address most of these i ssues, wi th the resul ti ng system renamed as C4.5 and
See5/C5.0.
Avoiding Over-Fitting the Data
The CLS al gori thm descri bed i n Tabl e 2.2 grows each branch of the tree just deepl y
enough to perfectl y cl assi fy the trai ni ng exampl es. Whi l e thi s i s someti mes a reasonabl e
strategy, i n fact i t can l ead to di ffi cul ti es when there i s noi se i n the data, or when the
number of trai ni ng exampl es i s too smal l to produce a representati ve sampl e of the true
target functi on. I n ei ther of these cases, thi s si mpl e al gori thm can produce trees that over-
fit the trai ni ng exampl es.
Over-fi tti ng i s a si gni fi cant practi cal di ffi cul ty for deci si on tree l earni ng and many
other l earni ng methods. For exampl e, i n one experi mental study of I D3 i nvol vi ng fi ve di fferent
l earni ng tasks wi th noi sy, non-determi ni sti c data, over-fi tti ng was found to decrease the
accuracy of l earned deci si on trees by l 0-25% on most probl ems.
276 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
There are several approaches to avoi d over-fi tti ng i n deci si on tree l earni ng. These can
be grouped i nto two cl asses:
• approaches that stop growi ng the tree earl i er, before i t reaches the poi nt where i t
perfectl y cl assi fi es the trai ni ng data,
• approaches that al l ow the tree to over-fi t the data, and then post prune the tree.
Al though the fi rst of these approaches mi ght seem more di rect, the second approach of
post-pruni ng over-fi t trees has been found to be more successful i n practi ce. Thi s i s due to
the di ffi cul ty i n the fi rst approach of esti mati ng preci sel y when to stop growi ng the tree.
Regardl ess of whether the correct tree si ze i s found by stoppi ng earl y or by post-
pruni ng, a key questi on i s what cri teri on i s to be used to determi ne the correct fi nal tree
si ze. Approaches i ncl ude:
• Use a separate set of exampl es, di sti nct from the trai ni ng exampl es, to eval uate the
uti l i ty of post-pruni ng nodes from the tree.
• Use al l the avai l abl e data for trai ni ng, but appl y a stati sti cal test to esti mate whether
expandi ng (or pruni ng) a parti cul ar node i s l i kel y to produce an i mprovement beyond
the trai ni ng set. For exampl e, Qui nl an uses a chi -square test to esti mate whether
further expandi ng a node i s l i kel y to i mprove performance over the enti re i nstance
di stri buti on, or onl y on the current sampl e of trai ni ng data.
• Use an expl i ci t measure of the compl exi ty for encodi ng the trai ni ng exampl es and
the deci si on tree, hal ti ng growth of the tree when thi s encodi ng si ze i s mi ni mi zed.
Thi s approach, based on a heuri sti c cal l ed the Mi ni mum Descri pti on Length pri nci pl e.
The fi rst of the above approaches i s the most common and i s often referred to as
trai ni ng and val i dati on set approach. We di scuss the two mai n vari ants of thi s approach
bel ow. I n thi s approach, the avai l abl e data are separated i nto two sets of exampl es: a
training set, whi ch i s used to form the l earned hypothesi s, and a separate validation set,
whi ch i s used to eval uate the accuracy of thi s hypothesi s over subsequent data and, i n
parti cul ar, to eval uate the i mpact of pruni ng thi s hypothesi s.
Reduced error pruning. How exactl y mi ght we use a val i dati on set to prevent over-
fi tti ng? One approach, cal l ed reduced-error pruning (Qui nl an, 1987), i s to consi der each of
the deci si on nodes i n the tree, to be candi dates for pruni ng. Pruni ng a deci si on node consi sts
of removi ng the sub-tree rooted at that node, maki ng i t a l eaf node, and assi gni ng i t the
most common cl assi fi cati on of the trai ni ng exampl es affi l i ated wi th that node. Nodes are
removed onl y i f the resul ti ng pruned tree performs no worse than the ori gi nal over the
val i dati on set. Thi s has the effect that any l eaf node added due to coi nci dental regul ari ti es
i n the trai ni ng set i s l i kel y to be pruned because these same coi nci dences are unl i kel y to
occur i n the val i dati on set. Nodes are pruned i terati vel y, al ways choosi ng the node whose
removal most i ncreases the deci si on tree accuracy over the val i dati on set. Pruni ng of nodes
conti nues unti l further pruni ng i s harmful (i .e., decreases accuracy of the tree over the
val i dati on set).
Rule Post-Pruning
I n practi ce, one qui te successful method for fi ndi ng hi gh accuracy hypotheses i s a
techni que cal l ed rul e post-pruni ng. A vari ant of thi s pruni ng method i s used by C4.5. Rul e
post-pruni ng i nvol ves the fol l owi ng steps:
DATA MI NI NG WI TH DECI SI ON TREES 277
1. I nfer the deci si on tree from the trai ni ng set, growi ng the tree unti l the trai ni ng
data i s fi t as wel l as possi bl e and al l owi ng over-fi tti ng to occur.
2. Convert the l earned tree i nto an equi val ent set of rul es by creati ng one rul e for
each path from the root node to a l eaf node.
3. Prune (general i ze) each rul e by removi ng any precondi ti ons that resul t i n i mprovi ng
i ts esti mated accuracy.
4. Sort the pruned rul es by thei r esti mated accuracy, and consi der them i n thi s
sequence when cl assi fyi ng subsequent i nstances.
To i l l ustrate, consi der agai n the deci si on tree i n Fi gure 2.3. I n rul e post-pruni ng, one
rul e i s generated for each l eaf node i n the tree. Each attri bute test al ong the path from the
root to the l eaf becomes a rul e antecedent (precondi ti on) and the cl assi fi cati on at the l eaf
node becomes the rul e consequent (post-condi ti on). For exampl e, the l eftmost path of the
tree i n Fi gure 2.2 i s transl ated i nto the rul e
I F (Outlook =Sunny)
^
(Humidity =High)
THEN PlayTennis = No
Next, each such rul e i s pruned by removi ng any antecedent, or precondi ti on, whose
removal does not worsen i ts esti mated accuracy. Gi ven the above rul e, for exampl e, rul e
post-pruni ng woul d consi der removi ng the precondi ti ons (Outlook =Sunny) and (Humidity
= High). I t woul d sel ect whi chever of these pr uni ng steps that pr oduce the gr eatest
i mprovement i n esti mated rul e accuracy, then consi der pruni ng the second precondi ti on as
a further pruni ng step. No pruni ng step i s performed i f i t reduces the esti mated rul e
accuracy.
Why the deci si on tree i s converted to rul es before pruni ng? There are three mai n
advantages.
• Converti ng to rul es al l ows di sti ngui shi ng among the di fferent contexts i n whi ch a
deci si on node i s used. Because each di sti nct path through the deci si on tree node
produces a di sti nct rul e, the pruni ng deci si on regardi ng that attri bute test can be
made di fferentl y for each path. I n contrast, i f the tree i tsel f were pruned, the onl y
two choi ces woul d be to remove the deci si on node compl etel y, or to retai n i t i n i ts
ori gi nal form.
• Converti ng to rul es removes the di sti ncti on between attri bute tests that occur near
the root of the tree and those that occur near the l eaves. Thus, we avoi d messy
book-keepi ng i ssues such as how to reorgani ze the tree i f the root node i s pruned
whi l e retai ni ng part of the sub-tree bel ow thi s test.
• Conver ti ng to r ul es i mpr oves r eadabi l i ty. Rul es ar e often easi er for peopl e to
understand.
Incorporating Continuous-Valued Attributes
The i ni ti al defi ni ti on of I D3 i s restri cted to attri butes that take on a di screte set of
val ues. Fi rst, the target attri bute whose val ue i s predi cted by the l earned tree must be
di screte val ued. Second, the attri butes tested i n the deci si on nodes of the tree must al so be
278 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
di screte val ued. Thi s second restri cti on can easi l y be removed so that conti nuous-val ued
deci si on attri butes can be i ncorporated i nto the l earned tree. Thi s can be accompl i shed by
dynami cal l y defi ni ng new di screte-val ued attri butes that parti ti on the conti nuous attri bute
val ue i nto a di screte set of i nterval s. I n parti cul ar, for an attri bute A that i s conti nuous-
val ued, the al gori thm can dynami cal l y create a new Bool ean attri bute A
c
that i s true i f
A < c and fal se otherwi se. The onl y questi on i s how to sel ect the best val ue for the threshol d
c. As an exampl e, suppose we wi sh to i ncl ude the conti nuous-val ued attri bute Temperature
i n descri bi ng the trai ni ng exampl e days i n Tabl e 2.3. Suppose further that the trai ni ng
exampl es associ ated wi th a parti cul ar node i n the deci si on tree have the fol l owi ng val ues
for Temperature and the target attri bute PlayTennis.
Temperature: 40 48 60 72 80 90
Pl ayTenni s: No No Yes Yes Yes No
What threshol d-based Bool ean attri bute shoul d be defi ned based on Temperature? Cl earl y,
we woul d l i ke to pi ck a threshol d, c, that produces the greatest i nformati on gai n. By sorti ng
the exampl es accordi ng to the conti nuous attri bute A, then i denti fyi ng adjacent exampl es
that di ffer i n thei r target cl assi fi cati on, we can generate a set of candi date threshol ds
mi dway between the correspondi ng val ues of A. I t can be shown that the val ue of c that
maxi mi zes i nformati on gai n must al ways l i e at such a boundary. These candi date threshol ds
can then be eval uated by computi ng the i nformati on gai n associ ated wi th each. I n the
cur r ent exampl e, ther e ar e two candi date thr eshol ds, cor r espondi ng to the val ues of
Temperature at whi ch the val ue of PlayTennis changes: (48 + 60)/2 and (80 + 90)/2. The
i nformati on gai n can then be computed for each of the candi date attri butes, Temperature>54
and Temperature>85, and the best can t sel ected (Temperature>54). Thi s dynami cal l y created
Bool ean attri bute can then compete wi th the other di screte-val ued candi date attri butes
avai l abl e for growi ng the deci si on tree.
Handling Training Examples with Missing Attribute Values
I n certai n cases, the avai l abl e data may be mi ssi ng val ues for some attri butes. For
exampl e, i n a medi cal domai n i n whi ch we wi sh to predi ct pati ent outcome based on vari ous
l aboratory tests, i t may be that the l ab test Bl ood-Test-Resul t i s avai l abl e onl y for a subset
of the pati ents. I n such cases, i t i s common to esti mate the mi ssi ng attri bute val ue based
on other exampl es for whi ch thi s attri bute has a known val ue.
Consi der the si tuati on i n whi ch Gain(S, A) i s to be cal cul ated at node n i n the deci si on
tree to eval uate whether the attri bute A i s the best attri bute to test at thi s deci si on node.
Suppose that < x, c(x)> i s one of the trai ni ng exampl es i n S and that the val ue A(x) i s
unknown, where c(x) i s the cl ass l abel of x.
One strategy for deal i ng wi th the mi ssi ng attri bute val ue i s to assi gn i t the val ue that
i s most common among trai ni ng exampl es at node n. Al ternati vel y, we mi ght assi gn i t the
most common val ue among exampl es at node n that have the cl assi fi cati on c(x). The el aborated
trai ni ng exampl e usi ng thi s esti mated val ue for A(x) can then be used di rectl y by the
exi sti ng deci si on tree l earni ng al gori thm.
A second, more compl ex procedure i s to assi gn a probabi l i ty to each of the possi bl e
val ues of A rather than si mpl y assi gni ng the most common val ue to A(x). These probabi l i ti es
can be esti mated agai n based on the observed frequenci es of the vari ous val ues for A among
DATA MI NI NG WI TH DECI SI ON TREES 279
the exampl es at node n. For exampl e, gi ven a Bool ean attri bute A, i f node n contai ns si x
known exampl es wi th A = 1 and four wi th A = 0, then we woul d say the probabi l i ty that
A(x) = 1 i s 0.6, and the probabi l i ty that A(x) = 0 i s 0.4. A fracti onal 0.6 of i nstance x i s now
di stri buted down the branch for A = 1 and a fracti onal 0.4 of x down the other tree branch.
These fracti onal exampl es are used for the purpose of computi ng i nformati on Gain and can
be further subdi vi ded at subsequent branches of the tree i f a second mi ssi ng attri bute val ue
must be tested. Thi s same fracti oni ng of exampl es can al so be appl i ed after l earni ng, to
cl assi fy new i nstances whose attri bute val ues are unknown. I n thi s case, the cl assi fi cati on
of the new i nstance i s si mpl y the most probabl e cl assi fi cati on, computed by summi ng the
wei ghts of the i nstance fragments cl assi fi ed i n di fferent ways at the l eaf nodes of the tree.
Thi s method for handl i ng mi ssi ng attri bute val ues i s used i n C4.5.
2.4 VISUALIZATION OF DECISION TREES IN SYSTEM CABRO
I n thi s secti on we bri efl y descri be system CABRO for mi ni ng deci si on trees that focuses
on vi sual i zati on and model sel ecti on techni ques i n deci si on tree l earni ng.
Though deci si on trees are a si mpl e noti on, i t i s not easy to understand and anal yze
l arge deci si on trees generated from huge data sets. For exampl e, the wi del y used program
C4.5 produces a deci si on tree of nearl y 18,500 nodes wi th 2624 l eaf nodes from the census
bureau database gi ven recentl y to the KDD communi ty that consi sts of 199,523 i nstances
descri bed by 40 numeri c and symbol i c attri butes (103 Mbytes). I t i s extremel y di ffi cul t for
the user to understand and use that bi g tree i n i ts text form. I n such cases, a graphi c
vi sual i zati on of di scovered deci si on trees wi th di fferent ways of accessi ng and vi ewi ng trees
i s of great support and recentl y i t recei ves much attenti on from the KDD researcher and
user. System Mi neSet of Si l i con Graphi cs provi des a 3D vi sual i zati on of deci si on trees.
System CART (Sal fort Systems) provi des a tree map that can be used to navi gate the l arge
deci si on trees. The i nteracti ve vi sual i zati on system CABRO, associ ated wi th a new proposed
techni que cal l ed T2.5D (stands for Tree 2.5 Di mensi ons) offers an al ternati ve effi ci ent way
that al l ows the user to mani pul ate graphi cal l y and i nteracti vel y l arge trees i n data mi ni ng.
I n CABRO, a mi ni ng process concerns wi th model selection i n whi ch the user try
di fferent setti ngs of deci si on tree i nducti on to attai n most appropri ate deci si on trees. To
hel p the user to understand the effects of setti ngs on resul t trees, the tree vi sual i zer i s
capabl e of handl i ng mul ti pl e vi ews of di fferent trees at any stage i n the i nducti on. There are
several modes of vi ew: zoomed, ti ny, ti ghtl y-coupl ed, fi sh-eyed, and T2.5D, i n each mode the
user can i nteracti vel y change the l ayout of the structure to fi t the current i nterests.
The tree vi sual i zer hel ps the user to understand deci si on trees by provi di ng di fferent
vi ews; each i s conveni ent to use i n di fferent si tuati ons. The avai l abl e vi ews are
• Standard: The tree i s drawn i n proporti on, the si ze of a node i s up to the l ength
of i ts l abel , the parent i s verti cal l y l ocated at the mi ddl e of i ts chi l dren, and si bl i ng
are hori zontal l y al i gned.
• Tightly-coupled: The wi ndow i s di vi ded i nto two panel s, one di spl ays the tree i n a
ti ny si ze, another di spl ays i t i n a normal si ze. The fi rst panel i s a map to navi gate
the tree; the second di spl ays the correspondi ng area of the tree.
• Fish-eyes: Thi s vi ew di storts the magni fi ed i mage so that nodes around the center
of i nter est ar e di spl ayed at hi gh magni fi cati on, and the r est of the tr ee i s
progressi vel y compressed.
280 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• T2.5D: I n thi s vi ew, Z-order of tree nodes are used to make 3D effects, nodes and
l i nks may be overl apped. The focused path of the tree i s drawn i n the front and by
hi ghl i ghted col ors.
The tree vi sual i zer al l ows the user to customi ze above vi ews by several operati ons that
i ncl ude: zoom, collapse/ expand, and view node content.
• Zoom: The user can zoom i n or zoom out the drawn tree.
• Collapse/ expand: The user can choose to vi ew some parts of the tree by col l apsi ng
and expandi ng paths.
• View node content: The user can see the content of a node such as: attri bute/val ue,
cl ass, popul ati on, etc.
I n model sel ecti on, the tree vi sual i zati on has an i mportant rol e i n assi sti ng the user to
understand and i nterpret generated deci si on trees. Wi thout i ts hel p, the user cannot deci de
to favor whi ch deci si on trees more than the others. Moreover, i f the mi ni ng process i s set
to run i nteracti vel y, the system al l ows the user to take part at every step of the i nducti on.
He/she can manual l y choose, whi ch attri bute wi l l be used to branch at the consi deri ng node.
The tree vi sual i zer then can di spl ay several hypotheti cal deci si on trees si de by si de to hel p
user to deci de whi ch ones are worth to further devel op. We can say that an i nteracti ve tree
vi sual i zer i ntegrated wi th the system al l ows the user to use domai n knowl edge by acti vel y
taki ng part i n mi ni ng processes.
Very l arge hi erarchi cal structures are sti l l di ffi cul t to navi gate and vi ew even wi th
ti ghtl y-coupl ed and fi sh-eye vi ews. To address the probl em, we have been devel opi ng a
speci al techni que cal l ed T2.5D (Tree 2.5 Di mensi ons). The 3D browsers usual l y can di spl ay
more nodes i n a compact area of the screen but requi re currentl y expensi ve 3D ani mati on
Figure 2.4. T2.5 Vi sual i zati on of Large Deci si on Trees.
support and vi sual i zed structures are di ffi cul t to navi gate, whi l e 2D browsers have l i mi tati on
i n di spl ay many nodes i n one vi ew. The T2.5D techni que combi nes the advantages of both
2D and 3D drawi ng techni ques to provi de the user wi th cheap processi ng cost, a browser
whi ch can di spl ay more than 1000 nodes i n one vi ew; a l arge number of them may be
DATA MI NI NG WI TH DECI SI ON TREES 281
parti al l y overl apped but they al l are i n ful l si ze. I n T2.5D, a node can be hi ghl i ghted or di m.
The hi ghl i ghted nodes are that the user currentl y i nterested i n and they are di spl ayed i n
2D to be vi ewed and navi gated wi th ease. The di m nodes are di spl ayed i n 3D and they al l ow
the user to get an i dea about overal l structure of the hi erarchy (Fi gure 2.4).
2.5 STRENGTHS AND WEAKNESS OF DECISION TREE METHODS
The strengths of deci si on tree methods
The strengths of deci si on tree methods are:
• Deci si on trees are abl e to generate understandabl e rul es.
• Deci si on trees perform cl assi fi cati on wi thout requi ri ng much computati on.
• Deci si on trees are abl e to handl e both conti nuous and categori cal vari abl es.
• Deci si on trees provi de a cl ear i ndi cati on of whi ch fi el ds are most i mportant for
predi cti on or cl assi fi cati on.
• Ability to Generate Understandable Rules. The abi l i ty of deci si on trees to
generate rul es that can be transl ated i nto comprehensi bl e Engl i sh or SQL i s the
greatest strength of thi s techni que. Even when a compl ex domai n or a domai n that
does decompose easi l y i nto rectangul ar regi ons causes the deci si on tree to be l arge
and compl ex, i t i s general l y fai rl y easy to fol l ow any one path through the tree. So
the expl anati on for any par ti cul ar cl assi fi cati on or pr edi cti on i s r el ati vel y
strai ghtforward.
• Ability to Perform in Rule-Oriented Domains. I t may sound obvi ous, but rul e
i nducti on i n general , and deci si on trees i n parti cul ar, are an excel l ent choi ce i n
domai ns where there real l y are rul es to be found. The authors had thi s fact dri ven
home to them by an experi ence at Caterpi l l ar. Caterpi l l ar cal l ed upon MRJ to hel p
desi gn and oversee some experi ments i n data mi ni ng. One of the areas where we
fel t that data mi ni ng mi ght prove useful was i n the automati c approval of warranty
repai r cl ai ms. Many domai ns, rangi ng from geneti cs to i ndustri al processes real l y
do have underl yi ng rul es, though these may be qui te compl ex and obscured by
noi sy data. Deci si on trees are a natural choi ce when you suspect the exi stence of
underl yi ng rul es.
• Ease of Calculation at Classification Time. Al though, as we have seen, a
deci si on tree can take many forms, i n practi ce, the al gori thms used to produce
deci si on trees general l y yi el d trees wi th a l ow branchi ng factor and si mpl e tests at
each node. Typi cal tests i ncl ude numeri c compari sons, set membershi p, and si mpl e
conjuncti ons. When i mpl emented on a computer, these tests transl ate i nto si mpl e
Bool ean and i nteger operati ons that are fast and i nexpensi ve. Thi s i s an i mportant
poi nt because i n a commerci al envi ronment, a predi cti ve model i s l i kel y to be used
to cl assi fy many mi l l i ons or even bi l l i ons of records.
• Ability to Handle both Continuous and Categorical Variables. Deci si on-tree
methods ar e equal l y adept at handl i ng conti nuous and categor i cal var i abl es.
Categori cal vari abl es, whi ch pose probl ems for neural networks and stati sti cal
techni ques, come ready-made wi th thei r own spl i tti ng cri teri a: one branch for each
categor y. Conti nuous var i abl es ar e equal l y easy to spl i t by pi cki ng a number
somewhere i n thei r range of val ues.
282 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Ability to Clearly Indicate Best Fields. Deci si on-tree bui l di ng al gori thms put
the fi el d that does the best job of spl i tti ng the trai ni ng records at the root node of
the tree.
The Weaknesses of Decision Tree Methods
Deci si on trees are l ess appropri ate for esti mati on tasks where the goal i s to predi ct the
val ue of a conti nuous vari abl e such as i ncome, bl ood pressure, or i nterest rate. Deci si on
trees are al so probl emati c for ti me-seri es data unl ess a l ot of effort i s put i nto presenti ng
the data i n such a way that trends and sequenti al patterns are made vi si bl e.
• Error-Prone with Too Many Classes. Some deci si on-tree al gori thms can onl y
deal wi th bi nary-val ued target cl asses (yes/no, accept/reject). Others are abl e to
assi gn records to an arbi trary number of cl asses, but are error-prone when the
number of trai ni ng exampl es per cl ass gets smal l . Thi s can happen rather qui ckl y
i n a tree wi th many l evel s and/or many branches per node.
• Computationally Expensive to Train. The process of growi ng a deci si on tree i s
computati onal l y expensi ve. At each node, each candi date spl i tti ng fi el d must be
sorted before i ts best spl i t can be found. I n some al gori thms, combi nati ons of fi el ds
are used and a search must be made for opti mal combi ni ng wei ghts. Pruni ng
al gori thms can al so be expensi ve si nce many candi date sub-trees must be formed
and compared.
• Trouble with Non-Rectangular Regions. Most deci si on-tree al gori thms onl y
exami ne a si ngl e fi el d at a ti me. Thi s l eads to rectangul ar cl assi fi cati on boxes that
may not correspond wel l wi th the actual di stri buti on of records i n the deci si on
space.
In Summary
Deci si on Tree i s cl assi fi cati on scheme and i ts structure i s i n form of tree where each
node i s ei ther l eaf node or deci si on node. The ori gi nal i dea of constructi on of deci si on tree
i s based on Concept Learni ng System (CLS)–i t i s a top-down di vi de-and-conquer al gori thms.
Practi cal i ssues i n l earni ng deci si on tree i ncl udes: Avoi di ng over- fi tti ng the data, Reduced
error pruni ng, I ncorporati ng conti nuous- Val ued attri butes, Handl i ng trai ni ng exampl es
wi th mi ssi ng attri butes val ues. Vi sual i zati on and model sel ecti on i n deci si on tree l earni ng
i s descri bed by usi ng CABRO system.
Chapter 3: DATA MINING WITH
ASSOCI ATION RULES
An appeal of market anal ysi s comes from the cl ari ty and
uti l i ty of i ts r esul ts, i s the for m of associ ati on r ul e. The
di scovery of Associ ati on rul es can hel p the retai l er devel op
strategi es, by gai ni ng i nsi ghts i nto matters and i t al so hel ps
i n i nventory management, sal e promoti on strategi es, etc. the
mai n topi cs that are covered i n thi s chapter are:
• Associ ati on Rul e Anal ysi s.
• Process of Mi ni ng Associ ati on Rul es.
• Strengths and Weaknesses.
This page
intentionally left
blank
285
3.1 WHEN IS ASSOCIATION RULE ANALYSIS USEFUL?
An appeal of market anal ysi s comes from the cl ari ty and uti l i ty of i ts resul ts, whi ch are
i n the form of association rules. There i s an i ntui ti ve appeal to a market anal ysi s because
i t expresses how tangi bl e products and servi ces rel ate to each other, how they tend to group
together. A rul e l i ke, “i f a customer purchases three way cal l i ng, then that customer wi l l
al so purchase cal l wai ti ng” i s cl ear. Even better, i t suggests a speci fi c course of acti on, l i ke
bundl i ng three-way cal l i ng wi th cal l wai ti ng i nto a si ngl e servi ce package. Whi l e associ ati on
rul es are easy to understand, they are not al ways useful . The fol l owi ng three rul es are
exampl es of real rul es generated from real data:
• On Thursdays, grocery store consumers often purchase di apers and beer together.
• Customers who purchase mai ntenance agreements are very l i kel y to purchase l arge
appl i ances.
• When a new hardware store opens, one of the most commonl y sol d i tems i s toi l et
ri ngs.
These three exampl es i l l ustrate the three common types of rul es produced by associ ati on
rul e anal ysi s: the useful, the trivial, and the inexplicable.
The useful rule contains high quality, actionable information. I n fact, once the pattern
i s found, i t i s often not hard to justi fy. The rul e about di apers and beer on Thursdays
suggests that on Thursday eveni ngs, young coupl es prepare for the weekend by stocki ng up
on di apers for the i nfants and beer for dad (who, for the sake of argument, we stereotypi cal l y
assume i s watchi ng footbal l on Sunday wi th a si x-pack). By l ocati ng thei r own brand of
di apers near the ai sl e contai ni ng the beer, they can i ncrease sal es of a hi gh-margi n product.
Because the r ul e i s easi l y under stood, i t suggests pl ausi bl e causes, l eadi ng to other
i nterventi ons: pl aci ng other baby products wi thi n si ght of the beer so that customers do not
“forget” anythi ng and putti ng other l ei sure foods, l i ke potato chi ps and pretzel s, near the
baby products.
Trivial results are already known by anyone at all familiar with the business. The second
exampl e “Customers who purchase mai ntenance agreements are very l i kel y to purchase l arge
DATA M¡N¡NG W¡TH A$$OC¡AT¡ON RULE$
3
CHAFTER
286 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
appl i ances” i s an exampl e of a tri vi al rul e. I n fact, we al ready know that customers purchase
mai ntenance agreements and l arge appl i ances at the same ti me. Why el se woul d they
purchase mai ntenance agreements? The mai ntenance agreements are adverti sed wi th l arge
appl i ances and are rarel y sol d separatel y. Thi s rul e, though, was based on anal yzi ng hundreds
of thousands of poi nt-of-sal e transacti ons from Sears. Al though i t i s val i d and wel l -supported
i n the data, i t i s sti l l usel ess. Si mi l ar resul ts abound: Peopl e who buy 2-by-4s al so purchase
nai l s; customers who purchase pai nt buy pai nt brushes; oi l and oi l fi l ters are purchased
together as are hamburgers and hamburger buns, and charcoal and l i ghter fl ui d.
A subtl er probl em fal l s i nto the same category. A seemi ngl y i nteresti ng resul t–l i ke the
fact that peopl e who buy the three-way cal l i ng opti on on thei r l ocal tel ephone servi ce al most
al ways buy cal l wai ti ng-may be the resul t of marketi ng programs and product bundl es. I n
the case of tel ephone servi ce opti ons, three-way cal l i ng i s typi cal l y bundl ed wi th cal l wai ti ng,
so i t i s di ffi cul t to order i t separatel y. I n thi s case, the anal ysi s i s not produci ng acti onabl e
resul ts; i t i s produci ng al ready acted-upon resul ts. Al though a danger for any data mi ni ng
techni que, associ ati on rul e anal ysi s i s parti cul arl y suscepti bl e to reproduci ng the success of
previ ous marketi ng campai gns because of i ts dependence on un-summari zed poi nt-of-sal e
data–exactl y the same data that defi nes the success of the campai gn. Results from association
rule analysis may simply be measuring the success of previous marketing campaigns.
I nexplicable results seem to have no explanation and do not suggest a course of action.
The thi rd pattern (“When a new hardware store opens, one of the most commonl y sol d i tems
i s toi l et ri ngs”) i s i ntri gui ng, tempti ng us wi th a new fact but provi di ng i nformati on that
does not gi ve i nsi ght i nto consumer behavi or or the merchandi se, or suggest further acti ons.
I n thi s case, a l arge hardware company di scovered the pattern for new store openi ngs, but
di d not fi gure out how to profi t from i t. Many i tems are on sal e duri ng the store openi ngs,
but the toi l et ri ngs stand out. More i nvesti gati on mi ght gi ve some expl anati on: I s the
di scount on toi l et ri ngs much l arger than for other products? Are they consi stentl y pl aced
i n a hi gh-traffi c area for store openi ngs but hi dden at other ti mes? I s the resul t an anomal y
from a handful of stores? Are they di ffi cul t to fi nd at other ti mes? Whatever the cause, i t
i s doubtful that fur ther anal ysi s of just the associ ati on r ul e data can gi ve a cr edi bl e
expl anati on.
3.2 HOW DOES ASSOCIATION RULE ANALYSIS WORK?
Associ ati on rul e anal ysi s starts wi th transacti ons contai ni ng one or more products or
servi ce offeri ngs and some rudi mentary i nformati on about the transacti on. For the purpose
of anal ysi s, we cal l the products and servi ce offeri ngs i tems. Tabl e 3.1 i l l ustrates fi ve
transacti ons i n a grocery store that carri es fi ve products. These transacti ons are si mpl i fi ed
to i ncl ude onl y the i tems purchased. How to use i nformati on l i ke the date and ti me and
whether the customer used cash wi l l be di scussed l ater i n thi s chapter. Each of these
transacti ons gi ves us i nformati on about whi ch products are purchased wi th whi ch other
products. Usi ng thi s data, we can create a co-occurrence tabl e that tel l s the number of ti mes,
any pai r of products was purchased together (see Tabl e 3.2). For i nstance, by l ooki ng at the
box where the “Soda” row i ntersects the “OJ” col umn, we see that two transacti ons contai n
both soda and orange jui ce. The val ues al ong the di agonal (for i nstance, the val ue i n the
“OJ” col umn and the “OJ” row) represent the number of transacti ons contai ni ng just that
i tem.
DATA MI NI NG WI TH ASSOCI ATI ON RULES 287
Table 3.1. Grocery Point-of-sale Transactions
Customer I tems
1 orange jui ce, soda
2 mi l k, orange jui ce, wi ndow cl eaner
3 orange jui ce, detergent,
4 orange jui ce, detergent, soda
5 wi ndow cl eaner, soda
The co-occurrence tabl e contai ns some si mpl e patterns:
• OJ and soda are l i kel y to be purchased together than any other two i tems.
• Detergent i s never purchased wi th wi ndow cl eaner or mi l k.
• Mi l k i s never purchased wi th soda or detergent.
These si mpl e observati ons are exampl es of associ ati ons and may suggest a formal rul e
l i ke: “I f a customer purchases soda, then the customer al so purchases mi l k”. For now, we
defer di scussi on of how we fi nd thi s rul e automati cal l y. I nstead, we ask the questi on: How
good i s thi s rul e? I n the data, two of the fi ve transacti ons i ncl ude both soda and orange
jui ce. These two transacti ons support the rul e. Another way of expressi ng thi s i s as a
percentage. The support for the rul e i s two out of fi ve or 40 percent.
Table 3.2. Co-occurrence of Products
I tems OJ Cleaner Milk Soda Detergent
OJ 4 1 1 2 1
Wi ndow Cl eaner 1 2 1 1 0
Mi l k 1 1 1 0 0
Soda 2 1 0 3 1
Deter gent 1 0 0 1 2
Si nce both the transacti ons that contai n soda al so contai n orange jui ce, there i s a hi gh
degree of confidence i n the rul e as wel l . I n fact, every transacti on that contai ns soda al so
contai ns orange jui ce, so the rul e “i f soda, then orange jui ce” has a confi dence of 100 percent.
We are l ess confi dent about the i nverse rul e, “i f orange jui ce, then soda”, because of the four
transacti ons wi th orange jui ce, onl y two al so have soda. I ts confi dence, then, i s just 50
percent. More formal l y, confi dence i s the rati o of the number of the transacti ons supporti ng
the rul e to the number of transacti ons where the condi ti onal part of the rul e hol ds. Another
way of sayi ng thi s i s that confi dence i s the rati o of the number of transacti ons wi th al l the
i tems to the number of transacti ons wi th just the “i f ” i tems.
3.3 THE BASIC PROCESS OF MINING ASSOCIATION RULES
Thi s basi c process for associ ati on rul es anal ysi s consi st of three i mportant concerns
• Choosi ng the ri ght set of i tems.
288 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Generati ng rul es by deci pheri ng the counts i n the co-occurrence matri x.
• Overcomi ng the practi cal l i mi ts i mposed by thousands or tens of thousands of i tems
appeari ng i n combi nati ons l arge enough to be i nteresti ng.
Choosing the Right Set of Items. The data used for associ ati on rul e anal ysi s i s typi cal l y
the detai l ed transacti on data captured at the poi nt of sal e. Gatheri ng and usi ng thi s data
i s a cri ti cal part of appl yi ng associ ati on rul e anal ysi s, dependi ng cruci al l y on the i tems
chosen for anal ysi s. What consti tutes a parti cul ar i tem depends on the busi ness need.
Wi thi n a grocery store where there are tens of thousands of products on the shel ves, a
frozen pi zza mi ght be consi dered an i tem for anal ysi s purposes–regardl ess of i ts toppi ngs
(extra cheese, pepperoni , or mushrooms), i ts crust (extra thi ck, whol e wheat, or whi te), or
i ts si ze. So, the purchase of a l arge whol e wheat vegetari an pi zza contai ns the same “frozen
pi zza” i tem as the purchase of a si ngl e-servi ng, pepperoni wi th extra cheese. A sampl e of
such transacti ons at thi s summari zed l evel mi ght l ook l i ke Tabl e 3.3.
Table 3.3. Transactions with More Summarized Items
Pizza Milk Sugar Apples Coffee
1 √
2 √ √
3 √ √ √
4 √ √
5 √ √ √ √
On the other hand, the manager of frozen foods or a chai n of pi zza restaurants may be
very i nterested i n the parti cul ar combi nati ons of toppi ngs that are ordered. He or she mi ght
decompose a pi zza order i nto consti tuent parts, as shown i n Tabl e 3.4.
Table 3.4. Transactions with More Detailed Items
Cheese Onions Peppers Mush. Olives
1 √ √ √
2 √
3 √ √ √
4 √
5 √ √ √ √
At some l ater poi nt i n ti me, the grocery store may become i nterested i n more detai l i n
i ts transacti ons, so the si ngl e “frozen pi zza” i tem woul d no l onger be suffi ci ent. Or, the pi zza
restaurants mi ght broaden thei r menu choi ces and become l ess i nterested i n al l the di fferent
toppi ngs. The i tems of i nterest may change over ti me. Thi s can pose a probl em when tryi ng
to use hi stori cal data i f the transacti on data has been summari zed.
Choosi ng the ri ght l evel of detai l i s a cri ti cal consi derati on for the anal ysi s. I f the
transacti on data i n the grocery store keeps track of every type, brand, and si ze of frozen
DATA MI NI NG WI TH ASSOCI ATI ON RULES 289
pi zza-whi ch probabl y account for several dozen products—then al l these i tems need to map
down to the “frozen pi zza” i tem for anal ysi s.
Taxonomies Help to Generalize Items. I n the real worl d, i tems have product codes and
stock-keepi ng uni t codes (SKUs) that fal l i nto hi erarchi cal categori es, cal l ed taxonomy.
When approachi ng a probl em wi th associ ati on rul e anal ysi s, what l evel of the taxonomy i s
the ri ght one to use? Thi s bri ngs up i ssues such as
• Are l arge fri es and smal l fri es the same product?
• I s the brand of i ce cream more rel evant than i ts fl avor?
• Whi ch i s more i mportant: the si ze, styl e, pattern, or desi gner of cl othi ng?
• I s the energy-savi ng opti on on a l arge appl i ance i ndi cati ve of customer behavi or?
The number of combi nati ons to consi der grows very fast as the number of i tems used
i n the anal ysi s i ncreases. Thi s suggests usi ng i tems from hi gher l evel s of the taxonomy,
“frozen desserts” i nstead of “i ce cream”. On the other hand, the more speci fi c the i tems are,
the more l i kel y the resul ts are acti onabl e. Knowi ng what sel l s wi th a parti cul ar brand of
frozen pi zza, for i nstance, can hel p i n managi ng the rel ati onshi p wi th the producer. One
compromi se i s to use more general i tems i ni ti al l y, then to repeat the rul e generati on to one
i n more speci fi c i tems. As the anal ysi s focuses on more speci fi c i tems, use onl y the subset
of transacti ons contai ni ng those i tems.
The compl exi ty of a rul e refers to the number of i tems i t contai ns. The more i tems i n
the transacti ons, the l onger i t takes to generate rul es of a gi ven compl exi ty. So, the desi red
compl exi ty of the rul es al so determi nes how speci fi c or general the i tems shoul d be i n some
ci rcumstances, customers do not make l arge purchases. For i nstance, customers purchase
rel ati vel y few i tems at any one ti me at a conveni ence store or through some catal ogs, so
l ooki ng for rul es contai ni ng four or more i tems may appl y to very few transacti ons and be
a wasted effort. I n other cases, l i ke i n a supermarket, the average transacti on i s l arger, so
more compl ex rul es are useful .
Movi ng up the taxonomy hi erarchy reduces the number of i tems. Dozens or hundreds
of i tems may be reduced to a si ngl e general i zed i tem, often correspondi ng to a si ngl e
depar tment or pr oduct l i ne. An i tem l i ke a pi nt of Ben & Jer r y’s Cher r y Gar ci a gets
general i zed to “i ce cream” or “frozen desserts “ I nstead of i nvesti gati ng “orange jui ce”,
i nvesti gate “frui t jui ces”. I nstead of l ooki ng at 2 percent mi l k, map i t to “dai ry products”.
Often, the appropri ate l evel of the hi erarchy ends up matchi ng a department wi th a product-
l i ne manager , so that usi ng gener al i zed i tems has the pr acti cal effect of fi ndi ng
i nterdepartmental rel ati onshi ps, because the structure of the organi zati on i s l i kel y to hi de
rel ati onshi ps between departments, these rel ati onshi ps are more l i kel y to be acti onabl e.
General i zed i tems al so hel p fi nd rul es wi th suffi ci ent support. There wi l l be many ti mes as
more transacti ons are to be supported by hi gher l evel s of the taxonomy than l ower l evel s.
Just because some i tems are general i zed, i t does not mean that al l i tems need to move
up to the same l evel . The appropri ate l evel depends on the i tem, on i ts i mportance for
produci ng acti onabl e resul ts, and on i ts frequency i n the data. For i nstance, i n a department
store bi g-ti cket i tems (l i ke appl i ances) mi ght stay at a l ow l evel i n the hi erarchy whi l e l ess
expensi ve i tems (such as books) mi ght be hi gher. Thi s hybri d approach i s al so useful when
l ooki ng at i ndi vi dual products. Si nce there are often thousands of products i n the data,
general i ze everythi ng el se except for the product or products of i nterest.
290 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Associ ati on rul e anal ysi s produces the best resul ts when the i tems occur i n roughl y the
same number of transacti ons i n the data. Thi s hel ps to prevent rul es from bei ng domi nated
by the most common i tems; Taxonomi es can hel p here. Rol l up rare i tems to hi gher l evel s
i n the taxonomy; so they become more frequent. More common i tems may not have to be
rol l ed up at al l .
Generating Rules from All This Data. Cal cul ati ng the number of ti mes that a gi ven
combi nati on of i tems appears i n the transacti on data i s wel l and good, but a combi nati on
of i tems i s not a rul e. Someti mes, just the combi nati on i s i nteresti ng i n i tsel f, as i n the
di aper, beer, and Thursday exampl e. But i n other ci rcumstances, i t makes more sense to
fi nd an underl yi ng rul e. What i s a rul e? A rul e has two parts, a condi ti on and a resul t, and
i s usual l y represented as a statement:
I f condition then result.
I f the rul e says
I f 3-way calling then call-waiting.
we read i t as: “i f a customer has 3-way cal l i ng, then the customer al so has cal l -wai ti ng”. I n
practi ce, the most acti onabl e rul es have just one i tem as the resul t. So, a rul e l i kes
I f diapers and Thursday, then beer i s more useful than
I f Thursday, then diapers and beer.
Constructs l i ke the co-occurrence tabl e provi de the i nformati on about whi ch combi nati on
of i tems occur most commonl y i n the transacti ons. For the sake of i l l ustrati on, l et’s say the
most common combi nati on has three i tems, A, B, and C. The onl y rul es to consi der are those
wi th al l three i tems i n the rul e and wi th exactl y one i tem i n the resul t:
I f A and B, then C
I f A and C, then B
I f B and C, then A
What about thei r confi dence l evel ? Confi dence i s the rati o of the number of transacti ons
wi th al l the i tems i n the rul e to the number of transacti ons wi th just the i tems i n the
condi ti on. What i s confi dence real l y sayi ng? Sayi ng that the rul e “i f B and C then A” has
a confi dence of 0.33 i s equi val ent to sayi ng that when B and C appear i n a transacti on, there
i s a 33 percent chance that A al so appears i n i t. That i s, one ti me i n three A occurs wi th
B and C, and the other two ti mes, A does not.
The most confi dent rul e i s the best rul e, so we are tempted to choose “i f B and C then
A”. But there i s a probl em. Thi s rul e i s actual l y worse than i f just randoml y sayi ng that A
appears i n the transacti on. A occurs i n 45 percent of the transacti ons but the rul e onl y gi ves
33 percent confi dence. The rul e does worse than just randoml y guessi ng. Thi s suggests
another measure cal l ed i mprovement. I mprovement tel l s how much better a rul e i s at
predi cti ng the resul t than just assumi ng the resul t i n the fi rst pl ace. I t i s gi ven by the
fol l owi ng formul a:
i mprovement =
p (condi ti on and resul t)
p (condi ti on) p (resul t)
DATA MI NI NG WI TH ASSOCI ATI ON RULES 291
When i mprovement i s greater than 1, then the resul ti ng rul e i s better at predi cti ng the
resul t than random chance. When i t i s l ess than 1, i t i s worse. The rul e “if A then B” i s 1.31
ti mes better at predi cti ng when B i s i n a transacti on than randoml y guessi ng. I n thi s case,
as i n many cases, the best rul e actual l y contai ns fewer i tems than other rul es bei ng consi dered.
When i mprovement i s l ess than 1, negati ng the resul t produces a better rul e. I f the rul e
I f B and C then A
has a confi dence of 0.33, then the rul e
I f B and C then NOT A
has a confi dence of 0.67. Si nce A appears i n 45 percent of the transacti ons, i t does NOT
occur i n 55 percent of them. Appl yi ng the same i mprovement measure shows that the
i mprovement of thi s new rul e i s 1.22 (0.67/0.55). The negati ve rul e i s useful . The rul e “I f A
and B then NOT C” has an i mprovement of 1.33, better than any of the other rul es. Rul es
are generated from the basi c probabi l i ti es avai l abl e i n the co-occurrence tabl e. Useful rul es
have an i mprovement that i s greater than 1. When the i mprovement scores are l ow, you can
i ncrease them by negati ng the rul es. However, you may fi nd that negated rul es are not as
useful as the ori gi nal associ ati on rul es when i t comes to acti ng on the resul ts.
Overcoming Practical Limits. Generati ng associ ati on rul es i s a mul ti -step process. The
general al gori thm i s:
• Generate the co-occurrence matri x for si ngl e i tems.
• Generate the co-occurrence matri x for two i tems. Use thi s to fi nd rul es wi th two
i tems.
• Generate the co-occurrence matri x for three i tems. Use thi s to fi nd rul es wi th three
i tems.
• And so on.
For i nstance, i n the grocery store that sel l s orange jui ce, mi l k, detergent, soda, and
wi ndow cl eaner, the fi rst step cal cul ates the counts for each of these i tems. Duri ng the
second step, the fol l owi ng counts are created:
• OJ and mi l k, OJ and detergent, OJ and soda, OJ and cl eaner.
• Mi l k and detergent, mi l k and soda, mi l k and cl eaner.
• Detergent and soda, detergent and cl eaner.
• Soda and cl eaner.
Thi s i s a total of 10 counts. The thi rd pass takes al l combi nati ons of three i tems and
so on. Of course, each of these stages may requi re a separate pass through the data or
mul ti pl e stages can be combi ned i nto a si ngl e pass by consi deri ng di fferent numbers of
combi nati ons at the same ti me.
Al though i t i s not obvi ous when there are just fi ve i tems, i ncreasi ng the number of
i tems i n the combi nati ons r equi r es exponenti al l y mor e computati on. Thi s r esul ts i n
exponenti al l y growi ng run ti mes-and l ong, l ong wai ts when consi deri ng combi nati ons wi th
more than three or four i tems. The sol uti on i s pruning. Pruni ng i s a techni que for reduci ng
the number of i tems and combi nati ons of i tems bei ng consi dered at each step. At each stage,
the al gori thm throws out a certai n number of combi nati ons that do not meet some threshol d
cri teri on.
292 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
The most common pruni ng mechani sm i s cal l ed minimum support pruning. Recal l that
support refers to the number of transacti ons i n the database where the rul e hol ds. Mi ni mum
support pruni ng requi res that a rul e hol d on a mi ni mum number of transacti ons. For
i nstance, i f there are 1 mi l l i on transacti ons and the mi ni mum support i s 1 percent, then
onl y rul es supported by 10,000 transacti ons are of i nterest. Thi s makes sense, because the
purpose of generati ng these rul es i s to pursue some sort of acti on-such as putti ng own-brand
di apers i n the same ai sl e as beer-and the acti on must affect enough transacti ons to be
worthwhi l e.
The mi ni mum support constrai nt has a cascadi ng effect. Say, we are consi deri ng a rul e
wi th four i tems i n i t, l i ke
I f A, B, and C, then D.
Usi ng mi ni mum support pruni ng, thi s rul e has to be true on at l east 10,000 transacti ons
i n the data. I t fol l ows that:
A must appear i n at l east 10,000 transacti ons; and,
B must appear i n at l east 10,000 transacti ons; and,
C must appear i n at l east 10,000 transacti ons; and,
D must appear i n at l east 10,000 transacti ons.
I n other words, mi ni mum support pruni ng el i mi nates i tems that do not appear i n
enough transacti ons! There are two ways to do thi s. The fi rst way i s to el i mi nate the i tems
from consi derati on. The second way i s to use the taxonomy to general i ze the i tems so the
resul ti ng general i zed i tems meet the threshol d cri teri on.
The threshol d cri teri on appl i es to each step i n the al gori thm. The mi ni mum threshol d
al so i mpl i es that:
A and B must appear together i n at l east 10,000 transacti ons; and,
A and C must appear together i n at l east 10,000 transacti ons; and,
A and D must appear together i n at l east 10,000 transacti ons;
And so on.
Each step of the cal cul ati on of the co-occurrence tabl e can el i mi nate combi nati ons of
i tems that do not meet the threshol d, reduci ng i ts si ze and the number of combi nati ons to
consi der duri ng the next pass. The best choi ce for mi ni mum support depends on the data
and the si tuati on. I t i s al so possi bl e to vary the mi ni mum support as the al gori thm progresses.
For i nstance, usi ng di fferent l evel s at di fferent stages you can fi nd uncommon combi nati ons
of common i tems (by decreasi ng the support l evel for successi ve steps) or rel ati vel y common
combi nati ons of uncommon i tems (by i ncreasi ng the support l evel ). Varyi ng the mi ni mum
support hel ps to fi nd acti onabl e rul es, so the rul es generated are not al l l i ke fi ndi ng that
peanut butter and jel l y are often purchased together.
3.4 THE PROBLEM OF LARGE DATASETS
A typi cal fast-food restaurant that offers several dozen i tems on i ts menu, says there
are a 100. To use probabi l i ti es to generate associ ati on rul es, counts have to be cal cul ated for
each combi nati on of i tems. The number of combi nati ons of a gi ven si ze tends to grow
DATA MI NI NG WI TH ASSOCI ATI ON RULES 293
exponenti al l y. A combi nati on wi th three i tems mi ght be a smal l fri es, cheeseburger, and
medi um di et Coke. On a menu wi th 100 i tems, how many combi nati ons are there wi th three
menu i tems? There are 161,700! (Thi s i s based on the bi nomi al formul a from mathemati cs).
On the other hand, a typi cal supermarket has at l east 10,000 di fferent i tems i n stock, and
more typi cal l y 20,000 or 30,000.
Cal cul ati ng the support, confi dence, and i mprovement qui ckl y gets out of hand as the
number of i tems i n the combi nati ons grows. There are al most 50 mi l l i on possi bl e combi nati ons
of two i tems i n the grocery store and over 100 bi l l i on combi nati ons of three i tems. Al though
computers are getti ng faster and cheaper, i t i s sti l l very expensi ve to cal cul ate the counts
for thi s number of combi nati ons. Cal cul ati ng the counts for fi ve or more i tems i s prohi bi ti vel y
expensi ve. The use of taxonomi es reduces the number of i tems to a manageabl e si ze.
The number of transacti ons i s al so very l arge. I n the course of a year, a decent-si ze
chai n of supermarkets wi l l generate tens of mi l l i ons of transacti ons. Each of these transacti ons
consi sts of one or more i tems, often several dozen at a ti me. So, determi ni ng i f a parti cul ar
combi nati on of i tems i s present i n a parti cul ar transacti on, i t may requi re a bi t of effort
mul ti pl i ed by a mi l l i on-fol d for al l the transacti ons.
3.5 STRENGTHS AND WEAKNESSES OF ASSOCIATION RULES ANALYSIS
The strengths of association rule analysis are:
• I t produces cl ear and understandabl e resul ts.
• I t supports undi rected data mi ni ng.
• I t works on vari abl e-l ength data.
• The computati ons i t uses are si mpl e to understandabl e.
• Results are clearly understood. The resul ts of associ ati on rul e anal ysi s are
associ ati on rul es; these are readi l y expressed as Engl i sh or as a statement i n a
query l anguage such as SQL. The expressi on of patterns i n the data as “i f-then”
rul es makes the resul ts easy to understand and faci l i tates turni ng the resul ts i nto
acti on. I n some ci rcumstances, merel y the set of rel ated i tems i s of i nterest and
rul es do not even need to be produced.
• Association rule analysis is strong for undirected data mining. Undi rected
data mi ni ng i s very i mportant when approachi ng a l arge set of data and you do not
know where to begi n. Associ ati on rul e anal ysi s i s an appropri ate techni que, when
i t can be appl i ed, to anal yze data and to get a start. Most data mi ni ng techni ques
are not pri mari l y used for undi rected data mi ni ng. Associ ati on rul e anal ysi s, on the
other hand, i s used i n thi s case and provi des cl ear resul ts.
• Association rule analysis works on variable-length data. Associ ati on rul e
anal ysi s can handl e vari abl e-l ength data wi thout the need for summari zati on. Other
techni ques tend to requi re records i n a fi xed format, whi ch i s not a natural way to
represent i tems i n a transacti on. Associ ati on rul e anal ysi s can handl e transacti ons
wi thout any l oss of i nformati on.
• Computationally simple. The computati ons needed to appl y associ ati on rul e
anal ysi s are rather si mpl e, al though the number of computati ons grows very qui ckl y
294 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
wi th the number of transacti ons and the number of di fferent i tems i n the anal ysi s.
Smal l er probl ems can be set up on the desktop usi ng a spreadsheet. Thi s makes the
techni que more comfortabl e to use than compl ex techni ques, l i ke geneti c al gori thms
or neural networks.
The weaknesses of association rule analysis are:
• I t requi res exponenti al l y more computati onal effort as the probl em si ze grows.
• I t has a l i mi ted support for attri butes on the data.
• I t i s di ffi cul t to determi ne the ri ght number of i tems.
• I t di scounts rare i tems.
• Exponential growth as problem size increases. The computati ons requi red to
generate associ ati on rul es grow exponenti al l y wi th the number of i tems and the
compl exi ty of the rul es bei ng consi dered. The sol uti on i s to reduce the number of
i tems by general i zi ng them. However, more general i tems are usual l y l ess acti onabl e.
Methods to control the number of computati ons, such as mi ni mum support pruni ng,
may el i mi nate i mportant rul es from consi derati on.
• Limited support for data attributes. Associ ati on rul e anal ysi s i s a techni que
speci al i zed for i tems i n a transacti on. I tems are assumed to be i denti cal except for
one i denti fyi ng characteri sti c, such as the product type. When appl i cabl e, associ ati on
rul e anal ysi s i s very powerful . However, not al l probl ems fi t thi s descri pti on. The
use of i tem taxonomi es and vi rtual i tems hel ps to make rul es more expressi ve.
• Determining the right items. Probabl y the most di ffi cul t probl em when appl yi ng
associ ati on rul e anal ysi s i s determi ni ng the ri ght set of i tems to use i n the anal ysi s.
By general i zi ng i tems up thei r taxonomy, you can ensure that the frequenci es of
the i tems used i n the anal ysi s are about the same. Al though thi s general i zati on
process l oses some i nformati on, vi rtual i tems can then be rei nserted i nto the anal ysi s
to capture i nformati on that spans general i zed i tems.
• Association rule analysis has trouble with rare items. Associ ati on rul e anal ysi s
works best when al l i tems have approxi matel y the same frequency i n the data.
I tems that rarel y occur are i n very few transacti ons and wi l l be pruned. Modi fyi ng
mi ni mum support threshol d to take i nto account product val ue i s one way to ensure
that expensi ve i tems remai n i n consi derati on, even though they may be rare i n the
data. The use of i tem taxonomi es can ensure that rare i tems are rol l ed up and
i ncl uded i n the anal ysi s i n some form.
In Summary
Associ ati on Rul e i s the form of an appeal of market anal ysi s comes from the cl ari ty and
uti l i ty of i ts resul ts. The associ ati on rul e anal ysi s starts wi th transacti ons and the basi c
process for associ ati on rul es anal ysi s i ncl udes three i mportant concerns: Choosi ng the ri ght
set of i tems, Generati ng rul es and overcomi ng practi cal l i mi ts.
Chapter 4: AUTOMATIC CLUSTERING
DETECTION
Cl usteri ng i s a method of groupi ng data i nto di fferent groups
and i s used for i denti fyi ng a fi ni te set of categori es or cl usters
to descri be the data. The mai n topi cs that are covered i n thi s
chapter are:
• K-Mean Method
• Aggl omerati on Methods
• Eval uati ng Cl usters
• Strengths and Weaknesses.
This page
intentionally left
blank
297
INTRODUCTION
When human bei ngs try to make sense of compl ex questi ons, our natural tendency i s
to break the subject i nto smal l er pi eces, each of whi ch can be expl ai ned more si mpl y.
Cl usteri ng i s a techni que used for combi ni ng observed objects i nto groups or cl usters such
that:
• Each gr oup or cl uster i s homogeneous or compact wi th r espect to cer tai n
characteri sti cs. That i s, objects i n each group are si mi l ar to each other.
• Each gr oup shoul d be di ffer ent fr om other gr oups wi th r espect to the same
characteri sti cs; that i s, objects of one group shoul d be di fferent from the objects of
other groups.
4.1 SEARCHING FOR CLUSTERS
For most data mi ni ng tasks, we start out wi th a pre-cl assi fi ed trai ni ng set and attempt
to devel op a model capabl e of predi cti ng how a new record wi l l be cl assi fi ed. I n cl usteri ng,
ther e i s no pr e-cl assi fi ed data and no di sti ncti on between i ndependent and dependent
vari abl es. I nstead, we are searchi ng for groups of records—the cl usters—that are si mi l ar to
one another, i n the expectati on that si mi l ar records represent si mi l ar customers or suppl i ers
or products that wi l l behave i n si mi l ar ways.
Automati c cl uster detecti on i s rarel y used i n i sol ati on because fi ndi ng cl usters i s not an
end i n i tsel f. Once cl usters have been detected, other methods must be appl i ed i n order to
fi gure out what the cl usters mean. When cl usteri ng i s successful , the resul ts can be dramati c:
One famous earl y appl i cati on of cl uster detecti on l ed to our current understandi ng of stel l ar
evol uti on.
Star Light, Star Bright. Earl y i n thi s century, astronomers tryi ng to understand the
rel ati onshi p between the l umi nosi ty (bri ghtness) of stars and thei r temperatures, made
scatter pl ots l i ke the one i n Fi gure 4.1. The verti cal scal e measures l umi nosi ty i n mul ti pl es
of the bri ghtness of our own sun. The hori zontal scal e measures surface temperature i n
degrees Kel vi n (degrees centi grade above absol ute, the theoreti cal col dest possi bl e temperature
AUTOMAT¡C CLU$TER¡NG DETECT¡ON
4
CHAFTER
298 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
where mol ecul ar moti on ceases). As you can see, the stars pl otted by astronomers, Hertzsprung
and Russel l , fal l i nto three cl usters. We now understand that these three cl usters represent
stars i n very di fferent phases i n the stel l ar l i fe cycl e.
White Dwarfs
Red Gi ants
10
6
10
4
10
2
1
10
2
10
4
40,000 20,000 10,000 5,000 2,500
M
a
i
n

S
e
q
u
e
n
c
e
s
Figure 4.1. The Hertzsprung-Russel l Di agram Cl usters Stars
The rel ati onshi p between l umi nosi ty and temperature i s consi stent wi thi n each cl uster,
but the rel ati onshi p i s di fferent i n each cl uster because a fundamental l y di fferent process
i s generati ng the heat and l i ght. The 80 percent of stars that fal l on the mai n sequence are
generati ng energy by converti ng hydrogen to hel i um through nucl ear fusi on. Thi s i s how al l
stars spend most of thei r l i fe.
But after 10 bi l l i on years or so, the hydrogen gets used up. Dependi ng on the star’s
mass, i t then begi ns fusi ng hel i um or the fusi on stops. I n the l atter case, the core of the star
begi ns to col l apse, generati ng a great deal of heat i n the process. At the same ti me, the outer
l ayer of gases expands away from the core. A red gi ant i s formed. Eventual l y, the outer
l ayer of gases i s stri pped away and the remai ni ng core begi ns to cool . The star i s now a
whi te dwarf. A recent query of the Al ta Vi sta web i ndex usi ng the search terms “HR
di agram” and “mai n sequence” returned many pages of l i nks to current astronomi cal research
based on cl uster detecti on of thi s ki nd. Thi s si mpl e, two-vari abl e cl uster di agram i s bei ng
used today to hunt for new ki nds of stars l i ke “brown dwarfs” and to understand mai n
sequence stel l ar evol uti on.
Fitting the troops. We chose the Hertzsprung-Russel l di agram as our i ntroductory exampl e
of cl usteri ng because wi th onl y two vari abl es, i t i s easy to spot the cl usters vi sual l y. Even
i n three di mensi ons, i t i s easy to pi ck out cl usters by eye from a scatter pl ot cube. I f al l
probl ems had so a few di mensi ons, there woul d be no need for automati c cl uster detecti on
al gori thms. As the number of di mensi ons (i ndependent vari abl es) i ncreases, our abi l i ty to
vi sual i ze cl usters and our i ntui ti on about the di stance between two poi nts qui ckl y break
down.
When we speak of a probl em as havi ng many di mensi ons, we are maki ng a geometri c
anal ogy. We consi der each of the thi ngs that must be measured i ndependentl y i n order to
AUTOMATI C CLUSTERI NG DETECTI ON 299
descri be somethi ng to be a di mensi on. I n other words, i f there are N vari abl es, we i magi ne
a space i n whi ch the val ue of each vari abl e represents a di stance al ong the correspondi ng
axi s i n an N-di mensi onal space. A si ngl e record contai ni ng a val ue for each of the N vari abl es
can be thought of as the vector that defi nes a parti cul ar poi nt i n that space.
4.2 THE K-MEANS METHOD
The K-means method of cl uster detecti on i s the most commonl y used i n practi ce. I t has
many vari ati ons, but the form descri bed here was fi rst publ i shed by J.B. MacQueen i n 1967.
For ease of drawi ng, we i l l ustrate the process usi ng two-di mensi onal di agrams, but bear i n
mi nd that i n practi ce we wi l l usual l y be worki ng i n a space of many more di mensi ons. That
means that i nstead of poi nts descri bed by a two-el ement vector (x
1
, x
2
), we work wi th poi nts
descri bed by an n-el ement vector (x
1
, x
2
, ..., x
n
). The procedure i tsel f i s unchanged.
I n the fi rst step, we sel ect K data poi nts to be the seeds. MacQueen’s al gori thm si mpl y
takes the fi rst K records. I n cases where the records have some meani ngful order, i t may
be desi rabl e to choose wi del y spaced records i nstead. Each of the seeds i s an embryoni c
cl uster wi th onl y one el ement. I n thi s exampl e, we use outsi de i nformati on about the data
to set the number of cl usters to 3.
I n the second step, we assi gn each record to the cl uster whose centroi d i s nearest. I n
Fi gure 4.2 we have done the fi rst two steps. Drawi ng the boundari es between the cl usters
i s easy i f you recal l that gi ven two poi nts, X and Y, al l poi nts that are equi di stant from X
and Y fal l al ong a l i ne that i s hal f way al ong the l i ne segment that joi ns X and Y and
perpendi cul ar to i t. I n Fi gure 4.2, the i ni ti al seeds are joi ned by dashed l i nes and the cl uster
boundari es constructed from them are sol i d l i nes. Of course, i n three di mensi ons, these
boundari es woul d be pl anes and i n N di mensi ons they woul d be hyper-pl anes of di mensi on
N–1.
X
1
X
2
Seed 1
Seed 2 Seed 3
Figure 4.2. The I ni ti al Sets Determi ne the I ni ti al Cl uster Boundari es
As we conti nue to work through the K-means al gori thm, pay parti cul ar attenti on to the
fate of the poi nt wi th the box drawn around i t. On the basi s of the i ni ti al seeds, i t i s assi gned
to the cl uster control l ed by seed number 2 because i t i s cl oser to that seed than to ei ther
of the others.
300 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
At thi s poi nt, every poi nt has been assi gned to one or another of the three cl usters
centered about the ori gi nal seeds. The next step i s to cal cul ate the centroi ds of the new
cl usters. Thi s i s si mpl y a matter of averagi ng the posi ti ons of each poi nt i n the cl uster al ong
each di mensi on. I f there are 200 records assi gned to a cl uster and we are cl usteri ng based
on four fi el ds from those records, then geometri cal l y we have 200 poi nts i n a 4-di mensi onal
space. The l ocati on of each poi nt i s descri bed by a vector of the val ues of the four fi el ds. The
vectors have the form (X
1
, X
2
, X
3
, X
4
). The val ue of X
1
for the new centroi d i s the mean of
al l 200 X
1
s and si mi l arl y for X
2
, X
3
and X
4
.
x
2
x
1
Figure 4.3. Cal cul ati ng the Centroi ds of the New Cl usters
I n Fi gure 4.3 the new centroi ds are marked wi th a cross. The arrows show the moti on
from the posi ti on of the ori gi nal seeds to the new centroi ds of the cl usters formed from those
seeds. Once the new cl usters have been found, each poi nt i s once agai n assi gned to the
cl uster wi th the cl osest centroi d. Fi gure 4.4 shows the new cl uster boundari es–formed as
before, by drawi ng l i nes equi di stant between each pai r of centroi ds. Noti ce that the poi nt
wi th the box around i t, whi ch was ori gi nal l y assi gned to cl uster number 2, has now been
assi gned to cl uster number 1. The process of assi gni ng poi nts to cl uster and then recal cul ati ng
centroi ds conti nues unti l the cl uster boundari es stop changi ng.
Similarity, Association, and Distance
After readi ng the precedi ng descri pti on of the K-means al gori thm, we hope you agree
that once the records i n a database have been mapped to poi nts i n space, automati c cl uster
detecti on i s real l y qui te si mpl e–a l i ttl e geometry, some vector means, and that’s al l . The
probl em, of course, i s that the databases we encounter i n marketi ng, sal es, and customer
support are not about poi nts i n space. They are about purchases, phone cal l s, ai rpl ane tri ps,
car regi strati ons, and a thousand other thi ngs that have no obvi ous connecti on to the dots
i n a cl uster di agram. When we speak of cl usteri ng records of thi s sort, we have an i ntui ti ve
noti on that members of a cl uster have some ki nd of natural association, that they are more
similar to each other than to records i n another cl uster. Si nce i t i s di ffi cul t to convey
i ntui ti ve noti ons to a computer, we must transl ate the vague concept of associ ati on i nto
some sort of numeri c measure of the degree of si mi l ari ty. The most common method, but by
no means the onl y one, i s to transl ate al l fi el ds i nto numeri c val ues so that the records may
AUTOMATI C CLUSTERI NG DETECTI ON 301
be treated as poi nts i n space. Then, i f two poi nts are cl ose i n the geometri c sense, we assume
that they represent si mi l ar records i n the database. Two mai n probl ems wi th thi s approach
are:
1. Many vari abl e types, i ncl udi ng al l categori cal vari abl es and many numeri c vari abl es
such as ranki ngs, do not have the ri ght behavi or to be treated properl y as components
of a posi ti on vector.
2. I n geometry, the contri buti ons of each di mensi on are of equal i mportance, but i n
our databases, a smal l change i n one fi el d may be much more i mportant than a
l arge change i n another fi el d.
Figure 4.4. At Each I terati on Al l Cl uster Assi gnments are Reeval uated
A Variety of Variables. Vari abl es can be categori zed i n vari ous ways–by mathemati cal
properti es (conti nuous, di screte), by storage type (character, i nteger, fl oati ng poi nt), and by
other properti es (quanti tati ve, qual i tati ve). For thi s di scussi on, however, the most i mportant
cl assi fi cati on i s how much the vari abl e can tel l us about i ts pl acement al ong the axi s that
corresponds to i t i n our geometri c model . For thi s purpose, we can di vi de vari abl es i nto four
cl asses, l i sted here i n i ncreasi ng order of sui tabi l i ty for the geometri c model : Categori es,
Ranks, I nterval s, True measures.
Categorical variables onl y tel l us to whi ch of several unordered categori es a thi ng
bel ongs. We can say that thi s i cecream i s pi stachi o whi l e that one i s mi nt-cooki e, but we
cannot say that one i s greater than the other or judge whi ch one i s cl oser to bl ack cherry.
I n mathemati cal terms, we can tel l that X ≠ Y, but not whether X > Y or Y < X.
Ranks al l ow us to put thi ngs i n order, but don’t tel l us how much bi gger one thi ng i s
than another. The val edi ctori an has better grades than the sal utatori an, but we don’t know
by how much. I f X, Y, and Z are ranked 1, 2, and 3, we know that X > Y > Z, but not whether
(X–Y) > (Y–Z).
I ntervals al l ow us to measure the di stance between two observati ons. I f we are tol d that
i t i s 56° i n San Franci sco and 78° i n San Jose, we know that i t i s 22 degrees warmer at one
end of the bay than the other.
True measures are i nterval vari abl es that measure from a meani ngful zero poi nt. Thi s
trai t i s i mportant because i t means that the rati o of two val ues of the vari abl e i s meani ngful .
The Fahrenhei t temperature scal e used i n the Uni ted States and the Cel si us scal e used i n
most of the rest of the worl d do not have thi s property. I n nei ther system does i t make sense
X
1
X
2
302 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
to say that a 30° day i s twi ce as warm as a 15° day. Si mi l arl y, a si ze 12 dress i s not twi ce
as l arge as a si ze 6 and gypsum i s not twi ce as hard as tal c though they are 2 and 1 on the
hardness scal e. I t does make perfect sense, however, to say that a 50-year-ol d i s twi ce as
ol d as a 25-year-ol d or that a 10-pound bag of sugar i s twi ce as heavy as a 5-pound one. Age,
wei ght, l ength, and vol ume are exampl es of true measures.
Geometri c di stance metri cs are wel l -defi ned for i nterval vari abl es and true measures.
I n order to use categori cal vari abl es and ranki ngs, i t i s necessary to transform them i nto
i nterval vari abl es. Unfortunatel y, these transformati ons add spuri ous i nformati on. I f we
number i ce cream fl avors 1 to 28, i t wi l l appear that fl avors 5 and 6 are cl osel y rel ated whi l e
fl avors 1 and 28 are far apart. The i nverse probl em ari ses when we transform i nterval
vari abl es and true measures i nto ranks or categori es. As we go from age (true measure) to
seni ori ty (posi ti on on a l i st) to broad categori es l i ke “veteran” and “new hi re”, we l ose
i nformati on at each step.
Formal Measures of Association
There are dozens i f not hundreds of publ i shed techni ques for measuri ng the si mi l ari ty
of two records. Some have been devel oped for speci al i zed appl i cati ons such as compari ng
passages of text. Others are desi gned especi al l y for use wi th certai n types of data such as
bi nary vari abl es or categori cal vari abl es. Of the three we present here, the fi rst two are
sui tabl e for use wi th i nterval vari abl es and true measures whi l e the thi rd i s sui tabl e for
categori cal vari abl es.
The Distance between Two Points. Each fi el d i n a record becomes one el ement i n a
vector descri bi ng a poi nt i n space. The di stance between two poi nts i s used as the measure
of associ ati on. I f two poi nts are cl ose i n di stance, the correspondi ng records are consi dered
si mi l ar. There are actual l y a number of metri cs that can be used to measure the di stance
between two poi nts (see asi de), but the most common one i s the Eucl i dean di stance we al l
l earned i n hi gh school . To fi nd the Eucl i dean di stance between X and Y, we fi rst fi nd the
di fferences between the correspondi ng el ements of X and Y (the di stance al ong each axi s)
and square them.
Distance Metrics. Any functi on that takes two poi nts and produces a si ngl e number
descri bi ng a rel ati onshi p between them i s a candi date measure of associ ati on, but to be a
true di stance metri c, i t must meet the fol l owi ng cri teri a:
Di stance(X,Y) = 0 i f and onl y i f X = Y
Di stance(X,Y) > 0 for al l X and al l Y
Di stance(X,Y) = Di stance(Y,X)
Di stance(X,Y) < Di stance(X,Z) + Di stance(X,Y)
The Angle between Two Vectors. Someti mes we woul d l i ke to consi der two records to be
cl osel y associ ated because of si mi l ari ti es i n the way the fi el ds within each record are rel ated.
We woul d l i ke to cl uster mi nnows wi th sardi nes, cod, and tuna, whi l e cl usteri ng ki ttens wi th
cougars, l i ons, and ti gers even though i n a database of body-part l engths, the sardi ne i s
cl oser to the ki tten than i t i s to the tuna.
The sol uti on i s to use a di fferent geometri c i nterpretati on of the same data. I nstead of
thi nki ng of X and Y as poi nts i n space and measuri ng the di stance between them, we thi nk
AUTOMATI C CLUSTERI NG DETECTI ON 303
of them as vectors and measure the angl e between them. I n thi s context, a vector i s the l i ne
segment connecti ng the ori gi n of our coordi nate system to the poi nt descri bed by the vector
val ues. A vector has both magni tude (the di stance from the ori gi n to the poi nt) and di recti on.
For our purposes, i t i s the di recti on that matters.
I f we take the val ues for l ength of whi skers, l ength of tai l , overal l body l ength, l ength
of teeth, and l ength of cl aws for a l i on and a house cat and pl ot them as si ngl e poi nts, they
wi l l be very far apart. But i f the rati os of l engths of these body parts to one another are
si mi l ar i n the two speci es, than the vectors wi l l be nearl y paral l el . The angl e between
vectors provi des a measure of associ ati on that i s not i nfl uenced by di fferences i n magni tude
between the two thi ngs bei ng compared (see Fi gure 4.5). Actual l y, the si ne of the angl e i s
a better measure si nce i t wi l l range from 0 when the vectors are cl osest (most nearl y
paral l el ) to 1 when they are perpendi cul ar wi thout our havi ng to worry about the actual
angl es or thei r si gns.
B
ig
C
a
t
L
i
t
t
l
e

F
i
s
h
B
i
g

F
i
s
h
L
ittle
C
a
t
Figure 4.5. The Angl e Between Vectors as a Measure of Associ ati on
The Number of Features i n Common. When the vari abl es i n the records we wi sh to
compare are categori cal ones, we abandon geometri c measures and turn i nstead to measures
based on the degree of overl ap between records. As wi th the geometri c measures, there are
many vari ati ons on thi s i dea. I n al l vari ati ons, we compare two records fi el d by fi el d and
count the number of fi el ds that match and the number of fi el ds that don’t match. The
si mpl est measure i s the rati o of matches to the total number of fi el ds. I n i ts si mpl est form,
thi s measure counts two nul l fi el ds as matchi ng wi th the resul t that everythi ng we don’t
know much about ends up i n the same cl uster. A si mpl e i mprovement i s to not i ncl ude
matches of thi s sort i n the match count. I f, on the other hand, the usual degree of overl ap
i s l ow, you can gi ve extra wei ght to matches to make sure that even a smal l overl ap i s
rewarded.
What K-Means
Cl usters form some subset of the fi el d vari abl es tend to vary together. I f al l the vari abl es
are trul y i ndependent, no cl usters wi l l form. At the opposi te extreme, i f al l the vari abl es are
dependent on the same thi ng (i n other words, i f they are co-l i near), then al l the records wi l l
form a si ngl e cl uster. I n between these extremes, we don’t real l y know how many cl usters
to expect. I f we go l ooki ng for a certai n number of cl usters, we may fi nd them. But that
doesn’t mean that there aren’t other perfectl y good cl usters l urki ng i n the data where we
304 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
coul d fi nd them by tryi ng a di fferent val ue of K. I n hi s excel l ent 1973 book, Cluster Analysis
for Applications, M. Anderberg uses a deck of pl ayi ng cards to i l l ustrate many aspects of
cl usteri ng. We have borrowed hi s i dea to i l l ustrate the way that the i ni ti al choi ce of K, the
number of cl uster seeds, can have a l arge effect on the ki nds of cl usters that wi l l be found.
I n descri pti ons of K-means and rel ated al gori thms, the sel ecti on of K i s often gl ossed over.
But si nce, i n many cases, there i s no a pri ori reason to sel ect a parti cul ar val ue, we al ways
need to perform automati c cl uster detecti on usi ng one val ue of K, eval uati ng the resul ts,
then tryi ng agai n wi th another val ue of K.
A♠ A♣ A♦ A♥
K♠ K♣ K♦ K♥
Q♠ Q♣ Q♦ Q♥
J♠ J♣ J♦ J♥
10♠ 10♣ 10♦ 10♥
9♠ 9♣ 9♦ 9♥
8♠ 8♣ 8♦ 8♥
7♠ 7♣ 7♦ 7♥
6♠ 6♣ 6♦ 6♥
5♠ 5♣ 5♦ 5♥
4♠ 4♣ 4♦ 4♥
3♠ 3♣ 3♦ 3♥
2♠ 2♣ 2♦ 2♥
Figure 4.6. K=2 Cl ustered by Col or
A♠ A♣ A♦ A♥
K♠ K♣ K♦ K♥
Q♠ Q♣ Q♦ Q♥
J♠ J♣ J♦ J♥
10♠ 10♣ 10♦ 10♥
9♠ 9♣ 9♦ 9♥
8♠ 8♣ 8♦ 8♥
7♠ 7♣ 7♦ 7♥
6♠ 6♣ 6♦ 6♥
5♠ 5♣ 5♦ 5♥
4♠ 4♣ 4♦ 4♥
3♠ 3♣ 3♦ 3♥
2♠ 2♣ 2♦ 2♥
Figure 4.7. K=2 Cl ustered by Ol d Mai d Rul ess
AUTOMATI C CLUSTERI NG DETECTI ON 305
After each tri al , the strength of the resul ti ng cl usters can be eval uated by compari ng
the average di stance between records i n a cl uster wi th the average di stance between cl usters,
and by other procedures descri bed l ater i n thi s chapter. But the cl usters must al so be
eval uated on a more subjecti ve basi s to determi ne thei r useful ness for a gi ven appl i cati on.
As shown i n Fi gures 4.6, 4.7, 4.8, 4.9, and 4.10, i t i s easy to create very good cl usters
from a deck of pl ayi ng cards usi ng vari ous val ues for K and vari ous di stance measures. I n
the case of pl ayi ng cards, the di stance measures are di ctated by the rul es of vari ous games.
The di stance from Ace to Ki ng, for exampl e, mi ght be 1 or 12 dependi ng on the game.
A♠ A♣ A♦ A♥
K♠ K♣ K♦ K♥
Q♠ Q♣ Q♦ Q♥
J♠ J♣ J♦ J♥
10♠ 10♣ 10♦ 10♥
9♠ 9♣ 9♦ 9♥
8♠ 8♣ 8♦ 8♥
7♠ 7♣ 7♦ 7♥
6♠ 6♣ 6♦ 6♥
5♠ 5♣ 5♦ 5♥
4♠ 4♣ 4♦ 4♥
3♠ 3♣ 3♦ 3♥
2♠ 2♣ 2♦ 2♥
Figure 4.8. K=2 Cl ustered by Rul es for War, Beggar My Nei ghbor, and other Games
A♠ A♣ A♦ A♥
K♠ K♣ K♦ K♥
Q♠ Q♣ Q♦ Q♥
J♠ J♣ J♦ J♥
10♠ 10♣ 10♦ 10♥
9♠ 9♣ 9♦ 9♥
8♠ 8♣ 8♦ 8♥
7♠ 7♣ 7♦ 7♥
6♠ 6♣ 6♦ 6♥
5♠ 5♣ 5♦ 5♥
4♠ 4♣ 4♦ 4♥
3♠ 3♣ 3♦ 3♥
2♠ 2♣ 2♦ 2♥
Figure 4.9. K=3 Cl ustered by Rul es of Hearts
306 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
K = N. Even wi th pl ayi ng cards, some val ues of K don’t l ead to good cl usters–at l east
not wi th di stance measures suggested by the card games known to the authors. There are
obvi ous cl usteri ng rul es for K = 1, 2, 3, 4, 8, 13, 26, and 52. For these val ues we can come
up wi th “perfect” cl usters where each el ement of a cl uster i s equi di stant from every other
member of the cl uster, and equal l y far away from the members of some other cl uster. For
other val ues of K, we have the more fami l i ar si tuati on that some cards do not seem to fi t
parti cul arl y wel l i n any cl uster.
A♠ A♣ A♦ A♥
K♠ K♣ K♦ K♥
Q♠ Q♣ Q♦ Q♥
J♠ J♣ J♦ J♥
10♠ 10♣ 10♦ 10♥
9♠ 9♣ 9♦ 9♥
8♠ 8♣ 8♦ 8♥
7♠ 7♣ 7♦ 7♥
6♠ 6♣ 6♦ 6♥
5♠ 5♣ 5♦ 5♥
4♠ 4♣ 4♦ 4♥
3♠ 3♣ 3♦ 3♥
2♠ 2♣ 2♦ 2♥
Figure 4.10. K=4 Cl ustered by Sui t
The Importance of Weights
I t i s i mportant to di fferenti ate between the noti ons of scaling and weighting. They are
not the same, but they are often confused. Scal i ng deal s wi th the probl em that di fferent
vari abl es are measured i n di fferent uni ts. Wei ghti ng deal s wi th the probl em that we care
about some vari abl es more than others.
I n geometry, al l di mensi ons are equal l y i mportant. Two poi nts that di ffer by 2 i n
di mensi ons X and Y and by 1 i n di mensi on Z are the same di stance from one another as two
other poi nts that di ffer by 1 i n di mensi on X and by 2 i n di mensi ons Y and Z. We don’t even
ask what uni ts X, Y, and Z are measured i n; i t doesn’t matter, so l ong as they are the same.
But what i f X i s measured i n yards, Y i s measured i n centi meters, and Z i s measured
i n nauti cal mi l es? A di fference of 1 i n Z i s now equi val ent to a di fference of 185,200 i n Y
or 2,025 i n X. Cl earl y, they must al l be converted to a common scal e before di stances wi l l
make any sense. Unfortunatel y, i n commerci al data mi ni ng there i s usual l y no common
scal e avai l abl e because the di fferent uni ts bei ng used are measuri ng qui te di fferent thi ngs.
I f we are l ooki ng at pl ot si ze, house-hol d si ze, car ownershi p, and fami l y i ncome, we cannot
convert al l of them to acres or dol l ars. On the other hand, i t seems bothersome that a
di fference of 20 acres i n pl ot si ze i s i ndi sti ngui shabl e from a change of $20 i n i ncome. The
sol uti on i s to map al l the vari abl es to a common range (often 0 to 1 or –1 to 1). That way,
at l east the rati os of change become comparabl e—doubl i ng pl ot si ze wi l l have the same
AUTOMATI C CLUSTERI NG DETECTI ON 307
effect as doubl i ng i ncome. We refer to thi s re-mappi ng to a common range as scaling. But
what i f we thi nk that two fami l i es wi th the same i ncome have more i n common than two
fami l i es on the same si ze pl ot, and i f we want that to be taken i nto consi derati on duri ng
cl usteri ng? That i s where wei ghti ng comes i n.
There are three common ways of scal i ng vari abl es to bri ng them al l i nto comparabl e
ranges:
1. Di vi de each vari abl e by the mean of al l the val ues i t takes on.
2. Di vi de each vari abl e by the range (the di fference between the l owest and hi ghest
val ue i t takes on) after subtracti ng the l owest val ue.
3. Subtract the mean val ue from each vari abl e and then di vi de by the standard
devi ati on. Thi s i s often cal l ed “converti ng to z scores.”
Use Weights to Encode Outside Information. Scal i ng takes care of the probl em that
changes i n one vari abl e appear more si gni fi cant than changes i n another si mpl y because of
di fferences i n the speed wi th whi ch the uni ts they are measured get i ncremented. Many
books recommend scal i ng al l vari abl es to a normal form wi th a mean of zero and a vari ance
of one. That way, al l fi el ds contri bute equal l y when the di stance between two records i s
computed. We suggest goi ng further. The whol e poi nt of automati c cl uster detecti on i s to
fi nd cl usters that make sense to you. I f, for your purposes, whether peopl e have chi l dren i s
much more i mportant than the number of credi t cards they carry, there i s no reason not to
bi as the outcome of the cl usteri ng by mul ti pl yi ng the number of chi l dren fi el d by a hi gher
wei ght than the number of credi t cards fi el d.
After scal i ng to get ri d of bi as that i s due to the uni ts, you shoul d use wei ghts to
i ntroduce bi as based on your knowl edge of the busi ness context. Of course, i f you want to
eval uate the effects of di fferent wei ghti ng strategi es, you wi l l have to add yet another outer
l oop to the cl usteri ng process. I n fact, choosi ng wei ghts i s one of the opti mi zati on probl ems
that can be addressed wi th geneti c al gori thms.
Variations on the K-Means Method
The basi c K-means al gori thm has many vari ati ons. I t i s l i kel y that the commerci al
software tool s you fi nd to do automati c cl usteri ng wi l l i ncorporate some of these vari ati ons.
Among the di fferences you are l i kel y to encounter are:
• Al ternate methods of choosi ng the i ni ti al seeds
• Al ternate methods of computi ng the next centroi d
• Usi ng probabi l i ty densi ty rather than di stance to associ ate records wi th cl usters.
Of these, onl y the l ast i s i mportant enough to meri t further di scussi on here.
Gaussian Mixture Models. The K-means method as we have descr i bed i t has some
drawbacks.
• I t does not do wel l wi th overl appi ng cl usters.
• The cl usters are easi l y pul l ed off center by outl i ers.
• Each record i s ei ther i n or out of a cl uster; there i s no noti on of some records bei ng
more or l ess l i kel y than others to real l y bel ong to the cl uster to whi ch they have
been assi gned.
308 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
G1
G3
G2
x
1
x
2
Figure 4.11. I n the Esti mati on Step, Each Gaussi an i s Assi gned Some Responsi bi l i ty for Each
Poi nt. Thi cker I ndi cate Greater Responsi bi l i ty.
Gaussian mixture models are a probabi l i sti c vari ant of K-means. Thei r name comes
from the Gaussi an di stri buti on, a probabi l i ty di stri buti on often assumed for hi gh-di mensi onal
probl ems. As before, we start by choosi ng K seeds. Thi s ti me, however, we regard the seeds
as means of Gaussi an di stri buti ons. We then i terate over two steps cal l ed the esti mati on
step and the maxi mi zati on step. I n the esti mati on step, we cal cul ate the responsi bi l i ty that
each Gaussi an has for each data poi nt (see Fi gur e 4.11). Each Gaussi an has str ong
responsi bi l i ty for poi nts that are cl ose to i t and weak responsi bi l i ty for poi nts that are
di stant. The responsi bi l i ti es wi l l be used as wei ghts i n the next step. I n the maxi mi zati on
step, the mean of each Gaussi an i s moved towards the centroi d of the enti re data set,
wei ghted by the responsi bi l i ti es as i l l ustrated i n Fi gure 4.12.
x
1
x
2
Figure 4.12. Each Gaussi an Mean i s Moved to the Centroi d of Al l the Data Poi nts Wei ghted by
the Responsi bi l i ti es for Each Poi nt. Thi cker Arrows I ndi cate Hi gher Wei ghts
These steps are repeated unti l the Gaussi ans are no l onger movi ng. The Gaussi ans
themsel ves can grow as wel l as move, but si nce the di stri buti on must al ways i ntegrate to
one, a Gaussi an gets weaker as i t gets bi gger. Responsi bi l i ty i s cal cul ated i n such a way that
a gi ven poi nt may get equal responsi bi l i ty from a nearby Gaussi an wi th l ow vari ance and
from a more di stant one wi th hi gher vari ance. Thi s i s cal l ed a “mi xture model ” because the
probabi l i ty at each data poi nt i s the sum of a mi xture of several di stri buti ons. At the end
AUTOMATI C CLUSTERI NG DETECTI ON 309
of the process, each poi nt i s ti ed to the vari ous cl usters wi th hi gher or l ower probabi l i ty.
Thi s i s someti mes cal l ed soft clustering.
4.3 AGGLOMERATIVE METHODS
I n the K-means approach to cl usteri ng, we start out wi th a fi xed number of cl usters and
gather al l records i nto them. There i s another cl ass of methods that works by aggl omerati on.
I n these methods, we start out wi th each data poi nt formi ng i ts own cl uster and gradual l y
merge cl usters unti l al l poi nts have been gathered together i n one bi g cl uster. Towards the
begi nni ng of the process, the cl usters are very smal l and very pure–the members of each
cl uster are few, but very cl osel y rel ated. Towards the end of the process, the cl usters are
l arge and l ess wel l -defi ned. The enti re hi story i s preserved so that you can choose the l evel
of cl usteri ng that works best for your appl i cati on.
The Agglomerative Algorithm
The fi rst step i s to create a similarity matrix. The si mi l ari ty matri x i s a tabl e of al l the
pai r-wi se di stances or degrees of associ ati on between poi nts. As before, we can use any of
a l arge number of measures of associ ati on between records, i ncl udi ng the Eucl i dean di stance,
the angl e between vectors, and the rati o of matchi ng to non-matchi ng categori cal fi el ds. The
i ssues rai sed by the choi ce of di stance measures are exactl y the same as previ ousl y di scussed
i n rel ati on to the K-means approach.
Closest clusters by
single linkage method
Closest clusters
by centroid
method
C1
C2
C3
X
2
X
1
Closest clusters by complete
linkage method
Figure 4.13. Three Methods of Measuri ng the Di stance Between Cl usters
At fi rst gl ance you mi ght thi nk that i f we have N data poi nts we wi l l need to make N2
measurements to create the di stance tabl e, but i f we assume that our associ ati on measure
i s a true di stance metri c, we actual l y onl y need hal f of that because al l true di stance metri cs
fol l ow the rul e that Di stance (X,Y) = Di stance(Y,X). I n the vocabul ary of mathemati cs, the
si mi l ari ty matri x i s l ower tri angul ar. At the begi nni ng of the process there are N rows i n
the tabl e, one for each record.
Next, we fi nd the smal l est val ue i n the si mi l ari ty matri x. Thi s i denti fi es the two cl usters
that are most si mi l ar to one another. We merge these two cl usters and update the si mi l ari ty
matri x by repl aci ng the two rows that descri bed the parent cl uster wi th a new row that
310 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
descri bes the di stance between the merged cl uster and the remai ni ng cl usters. There are
now N–1 cl usters and N–1 rows i n the si mi l ari ty matri x.
We repeat the merge step N–1 ti mes, after whi ch al l records bel ong to the same l arge
cl uster. At each i terati on we make a record of whi ch cl usters were merged and how far apart
they were. Thi s i nformati on wi l l be hel pful i n deci di ng whi ch l evel of cl usteri ng to make use
of.
Distance between Clusters. We need to say a l i ttl e more about how to measure
di stance between cl usters. On the fi rst tri p through the merge step, the cl usters to be
merged consi st of si ngl e records so the di stance between cl usters i s the same as the di stance
between records, a subject we may al ready have sai d too much about. But on the second and
subsequent tri ps around the l oop, we need to update the si mi l ari ty matri x wi th the di stances
from the new, mul ti -record cl uster to al l the others. How do we measure thi s di stance? As
usual , there i s a choi ce of approaches. Three common ones are:
• Si ngl e l i nkage
• Compl ete l i nkage
• Compari son of centroi ds
I n the si ngl e l i nkage method, the di stance between two cl usters i s gi ven by the di stance
between thei r closest members. Thi s method produces cl usters wi th the property that every
member of a cl uster i s more cl osel y rel ated to at l east one member of i ts cl uster than to any
poi nt outsi de i t.
I n the compl ete l i nkage method, the di stance between two cl usters i s gi ven by the
di stance between thei r most distant members. Thi s method produces cl usters wi th the
property that al l members l i e wi thi n some known maxi mum di stance of one another.
I n the thi rd method, the di stance between two cl usters i s measured between the centroi ds
of each. The centroi d of a cl uster i s i ts average el ement. Fi gure 4.13 gi ves a pi ctori al
representati on of al l three methods.
Clusters and Trees. The aggl omerati on al gori thm creates hi erarchi cal cl usters. At each
l evel i n the hi erarchy, cl usters are formed from the uni on of two cl usters at the next l evel
down. Another way of l ooki ng at thi s i s as a tree, much l i ke the deci si on trees except that
cl uster trees are bui l t by starti ng from the l eaves and worki ng towards the root.
Clustering People by Age: An Example of Agglomerative Clustering. To i l l ustrate
aggl omerati ve cl usteri ng, we have chosen an exampl e of cl usteri ng i n one di mensi on usi ng
the si ngl e l i nkage measure for di stance between cl usters. These choi ces shoul d enabl e you
to fol l ow the al gori thm through al l i ts i terati ons i n your head wi thout havi ng to worry about
squares and square roots.
The data consi sts of the ages of peopl e at a fami l y gatheri ng. Our goal i s to cl uster the
parti ci pants by age. Our metri c for the di stance between two peopl e i s si mpl y the di fference
i n thei r ages. Our metri c for the di stance between two cl usters of peopl e i s the di fference
i n age between the ol dest member of the younger cl uster and the youngest member of the
ol der cl uster. (The one-di mensi onal versi on of the si ngl e l i nkage measure.)
Because the di stances are so easy to cal cul ate, we di spense wi th the si mi l ari ty matri x.
Our procedure i s to sort the parti ci pants by age, then begi n cl usteri ng by fi rst mergi ng
312 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Inside the Cluster. Once you have found a strong cl uster, you wi l l want to anal yze what
makes i t speci al . What i s i t about the records i n thi s cl uster that causes them to be l umped
together? Even more i mportantl y, i s i t possi bl e to fi nd rul es and patterns wi thi n thi s cl uster
now that the noi se from the rest of the database has been el i mi nated?
The easi est way to approach the fi rst questi on i s to take the mean of each vari abl e
wi thi n the cl uster and compare i t to the mean of the same vari abl e i n the parent popul ati on.
Rank order the vari abl es by the magni tude of the di fference. Looki ng at the vari abl es that
show the l argest di fference between the cl uster and the rest of the database wi l l go a l ong
way towards expl ai ni ng what makes the cl uster speci al . As for the second questi on, that i s
what al l the other data mi ni ng techni ques are for!
Outside the Cluster. Cl usteri ng can be useful even when onl y a si ngl e cl uster i s found.
When screeni ng for a very rare defect, there may not be enough exampl es to trai n a di rected
data mi ni ng model to detect i t. One exampl e i s testi ng el ectri c motors at the factory where
they are made. Cl uster detecti on methods can be used on a sampl e contai ni ng onl y good
motors to determi ne the shape and si ze of the “normal ” cl uster. When a motor comes al ong
that fal l s outsi de the cl uster for any reason, i t i s suspect. Thi s approach has been used i n
medi ci ne to detect the presence of abnormal cel l s i n ti ssue sampl es.
4.5 OTHER APPROACHES TO CLUSTER DETECTION
I n addi ti on to the two approaches to automati c cl uster detecti on descri bed i n thi s
chapter, there are two other approaches that make use of vari ati ons of techni ques of deci si on
trees and neural networks.
Divisive Methods. We have al ready noted the si mi l ari ty between the tree formed by the
aggl omerati ve cl usteri ng techni ques and the ones formed by deci si on tree al gori thms such
as C4.5. Al though the aggl omerati ve methods work from the l eaves to the root, whi l e the
deci si on tr ee al gor i thms wor k fr om the r oot to the l eaves, they both cr eate a si mi l ar
hi erarchi cal structure. The hi erarchi cal structure refl ects another si mi l ari ty between the
methods. Deci si ons made earl y on i n the process are never revi si ted, whi ch means that some
fai rl y si mpl e cl usters wi l l not be detected i f an earl y spl i t or aggl omerati on destroys the
structure.
Seei ng the si mi l ari ty between the trees produced by the two methods, i t i s natural to
ask whether the al gori thms used for deci si on trees may al so be used for cl usteri ng. The
answer i s yes. A deci si on tree al gori thm starts wi th the enti re col l ecti on of records and l ooks
for a way to spi t i t i nto cl usters that are purer, i n some sense defi ned by a di versi ty functi on.
Al l that i s requi red to turn thi s i nto a cl usteri ng al gori thm i s to suppl y a di versi ty functi on
chosen to ei ther mi ni mi ze the average i ntra-cl uster di stance or maxi mi ze the i nter-cl uster
di stances.
Self-Organizing Maps. Sel f-organi zi ng maps are a vari ant of neural networks that have
been used for many years i n appl i cati ons such as feature detecti on i n two-di mensi onal
i mages. More recentl y, they have been appl i ed successful l y for more general cl usteri ng
appl i cati ons. There i s a di scussi on of sel f-organi zi ng neural networks.
AUTOMATI C CLUSTERI NG DETECTI ON 313
4.6 STRENGTHS AND WEAKNESSES OF AUTOMATIC CLUSTER DETECTION
The mai n strengths of automati c cl uster detecti on are:
• Automati c cl uster detecti on i s an unsupervi sed knowl edge di scovery
• Automati c cl uster detecti on works wel l wi th categori cal , numeri c, and textual data
• Easy to appl y.
• Automatic Cluster Detection is Unsupervised. The chi ef strength of automati c
cl uster detecti on i s that i t i s undi rected. Thi s means that i t can be appl i ed even
when you have no pri or knowl edge of the i nternal structure of a database. Automati c
cl uster detecti on can be used to uncover hi dden structure that can be used to
i mprove the performance of more di rected techni ques.
• Clustering can be Performed on Diverse Data Types. By choosi ng di fferent
di stance measures, automati c cl usteri ng can be appl i ed to al most any ki nd of data.
I t i s as easy to fi nd cl usters i n col l ecti ons of new stori es or i nsurance cl ai ms as i n
astronomi cal or fi nanci al data.
• Automatic Cluster Detection is Easy to Apply. Most cl uster detecti on techni ques
requi re very l i ttl e massagi ng of the i nput data and there i s no need to i denti fy
parti cul ar fi el ds as i nputs and others as outputs.
The mai n weaknesses of automati c cl uster detecti on are:
• I t can be di ffi cul t to choose the ri ght di stance measures and wei ghts.
• Sensi ti vi ty to i ni ti al parameters.
• I t can be hard to i nterpret the resul ti ng cl usters.
• Difficulty with Weights and Measures. The performance of automati c cl uster
detecti on al gori thms i s hi ghl y dependent on the choi ce of a di stance metri c or other
si mi l ari ty measure. I t i s someti mes qui te di ffi cul t to devi se di stance metri cs for
data that contai ns a mi xture of vari abl e types. I t can al so be di ffi cul t to determi ne
a proper wei ghti ng scheme for di sparate vari abl e types.
• Sensitivity to Initial Parameters. I n the K-means method, the ori gi nal choi ce
of a val ue for K determi nes the number of cl usters that wi l l be found. I f thi s
number does not match the natural structure of the data, the techni que wi l l not
obtai n good resul ts.
• Difficulty Interpreting Results. A strength of automati c cl uster detecti on i s
that i t i s an unsupervi sed knowl edge di scovery techni que. The fl i p si de i s that
when you don’t know what you are l ooki ng for, you may not recogni ze i t when you
fi nd i t! The cl usters you di scover are not guaranteed to have any practi cal val ue.
In Summary
Cl usteri ng i s a techni que for combi ni ng observed objects i nto groups or cl usters. The
most commonl y used Cl uster detecti on i s K-means method. And thi s al gori thm has a many
vari ati ons: al ternate methods of choosi ng the i ni ti al seeds, al ternati ng methods of computi ng
the next centroi d, and usi ng probabi l i ty densi ty rather than di stance to associ ate records
wi th record.
This page
intentionally left
blank
Chapter 5: DATA MINING WITH
NEURAL NETWORKS
Neural Networks are a paradi gm for computi ng and the appeal
of neural network i s that they bri dge between di gi tal computer
and the neural connecti ons i n the human brai n by model i ng.
The mai n topi cs that are covered i n thi s chapter are:
• Neural Network Topol ogi es
• Neural Network Model s
• I terati ve Process
• Strengths and Weaknesses
This page
intentionally left
blank
317
INTRODUCTION
Arti fi ci al neural networks are popul ar because they have a proven track record i n many
data mi ni ng and deci si on-support appl i cati ons. They have been appl i ed across a broad range
of i ndustr i es, fr om i denti fyi ng fi nanci al ser i es to di agnosi ng medi cal condi ti ons, fr om
i denti fyi ng cl usters of val uabl e customers to i denti fyi ng fraudul ent credi t card transacti ons,
from recogni zi ng numbers wri tten on checks to predi cti ng the fai l ure rates of engi nes.
Whereas peopl e are good at general i zi ng from experi ence, computers usual l y excel at
fol l owi ng expl i ci t i nstructi ons over and over. The appeal of neural networks i s that they
bri dge thi s gap by model i ng, on a di gi tal computer, the neural connecti ons i n human brai ns.
When used i n wel l -defi ned domai ns, thei r abi l i ty to general i ze and l earn from data mi mi cs
our own abi l i ty to l earn from experi ence. Thi s abi l i ty i s useful for data mi ni ng and i t al so
makes neural networks an exci ti ng area for research, promi si ng new and better resul ts i n
the future.
5.1 NEURAL NETWORKS FOR DATA MINING
A neural processi ng el ement recei ves i nputs from other connected processi ng el ements.
These i nput si gnal s or val ues pass through wei ghted connecti ons, whi ch ei ther ampl i fy or
di mi ni sh the si gnal s. I nsi de the neural processi ng el ement, al l of these i nput si gnal s are
summed together to gi ve the total i nput to the uni t. Thi s total i nput val ue i s then passed
through a mathemati cal functi on to produce an output or deci si on val ue rangi ng from 0 to
1. Noti ce that thi s i s a real val ued (anal og) output, not a di gi tal 0/1 output. I f the i nput
si gnal matches the connecti on wei ghts exactl y, then the output i s cl ose to 1. I f the i nput
si gnal total l y mi smatches the connecti on wei ghts then the output i s cl ose to 0. Varyi ng
degrees of si mi l ari ty are represented by the i ntermedi ate val ues. Now, of course, we can
force the neural processi ng el ement to make a bi nary (1/0) deci si on, but by usi ng anal og
val ues rangi ng between 0.0 and 1.0 as the outputs, we are retai ni ng more i nformati on to
pass on to the next l ayer of neural processi ng uni ts. I n a very real sense, neural networks
are anal og computers.
DATA M¡N¡NG W¡TH NEURAL NETWORK$
5
CHAFTER
318 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Each neural processi ng el ement acts as a si mpl e pattern recogni ti on machi ne. I t checks
the i nput si gnal s agai nst i ts memory traces (connecti on wei ghts) and produces an output
si gnal that corresponds to the degree of match between those patterns. I n typi cal neural
networks, there are hundreds of neural processi ng el ements whose pattern recogni ti on and
deci si on maki ng abi l i ti es are harnessed together to sol ve probl ems.
5.2 NEURAL NETWORK TOPOLOGIES
The arrangement of neural processi ng uni ts and thei r i nterconnecti ons can have a
profound i mpact on the processi ng capabi l i ti es of the neural networks. I n general , al l neural
networks have some set of processi ng uni ts that recei ve i nputs from the outsi de worl d,
whi ch we refer to appropri atel y as the “i nput uni ts.” Many neural networks al so have one
or more l ayers of “hi dden” processi ng uni ts that recei ve i nputs onl y from other processi ng
uni ts. A l ayer or “sl ab” of processi ng uni ts recei ves a vector of data or the outputs of a
previ ous l ayer of uni ts and processes them i n paral l el . The set of processi ng uni ts that
represents the fi nal resul t of the neural network computati on i s desi gnated as the “output
uni ts”. There are three major connecti on topol ogi es that defi ne how data fl ows between the
i nput, hi dden, and output processi ng uni ts. These mai n categori es feed forward, l i mi ted
recurrent, and ful l y recurrent networks are descri bed i n detai l i n the next secti ons.
Feed-Forward Networks
Feed-forward networks are used i n si tuati ons when we can bri ng al l of the i nformati on
to bear on a probl em at once, and we can present i t to the neural network. I t i s l i ke a pop
qui z, where the teacher wal ks i n, wri tes a set of facts on the board, and says, “OK, tel l me
the answer.” You must take the data, process i t, and “jump to a concl usi on.” I n thi s type of
neural network, the data fl ows through the network i n one di recti on, and the answer i s
based sol el y on the current set of i nputs.
I n Fi gure 5.1, we see a typi cal feed-forward neural network topol ogy. Data enters the
neural network through the i nput uni ts on the l eft. The i nput val ues are assi gned to the
i nput uni ts as the uni t acti vati on val ues. The output val ues of the uni ts are modul ated by
the connecti on wei ghts, ei ther bei ng magni fi ed i f the connecti on wei ght i s posi ti ve and
greater than 1.0, or bei ng di mi ni shed i f the connecti on wei ght i s between 0.0 and 1.0. I f the
connecti on wei ght i s negati ve, the si gnal i s magni fi ed or di mi ni shed i n the opposi te di recti on.
I
n
p
u
t
H
i
d
d
e
n
O
u
t
p
u
t
Figure 5.1. Feed-Forward Neural Networks
Each processi ng uni t combi nes al l of the i nput si gnal s corni ng i nto the uni t al ong wi th
a threshol d val ue. Thi s total i nput si gnal i s then passed through an acti vati on functi on to
DATA MI NI NG WI TH NEURAL NETWORKS 319
determi ne the actual output of the processi ng uni t, whi ch i n turn becomes the i nput to
another l ayer of uni ts i n a mul ti -l ayer network. The most typi cal acti vati on functi on used
i n neural networks i s the S-shaped or si gmoi d (al so cal l ed the l ogi sti c) functi on. Thi s functi on
converts an i nput val ue to an output rangi ng from 0 to 1. The effect of the threshol d wei ghts
i s to shi ft the curve ri ght or l eft, thereby maki ng the output val ue hi gher or l ower, dependi ng
on the si gn of the threshol d wei ght. As shown i n Fi gure 5.1, the data fl ows from the i nput
l ayer through zero, one, or more succeedi ng hi dden l ayers and then to the output l ayer. I n
most networks, the uni ts from one l ayer are ful l y connected to the uni ts i n the next l ayer.
However, thi s i s not a requi rement of feed-forward neural networks. I n some cases, especi al l y
when the neural network connecti ons and wei ghts are constructed from a rul e or predi cate
form, there coul d be l ess connecti on wei ghts than i n a ful l y connected network. There are
al so techni ques for pruni ng unnecessary wei ghts from a neural network after i t i s trai ned.
I n general , the l ess wei ghts there are, the faster the network wi l l be abl e to process data
and the better i t wi l l general i ze to unseen i nputs. I t i s i mportant to remember that “feed-
forward” i s a defi ni ti on of connecti on topol ogy and data fl ow. I t does not i mpl y any speci fi c
type of acti vati on functi on or trai ni ng paradi gm.
Limited Recurrent Networks
Recurrent networks are used i n si tuati ons when we have current i nformati on to gi ve
the network, but the sequence of i nputs i s i mportant, and we need the neural network to
somehow store a record of the pri or i nputs and factor them i n wi th the current data to
produce an answer. I n recurrent networks, i nformati on about past i nputs i s fed back i nto
and mi xed wi th the i nputs through recurrent or feedback connecti ons for hi dden or output
uni ts. I n thi s way, the neural network contai ns a memory of the past i nputs vi a the acti vati ons
(see Fi gure 5.2).
C
o
n
t
e
x
t
H
i
d
d
e
n
O
u
t
p
u
t

I
n
p
u
t
C
o
n
t
e
x
t
H
i
d
d
e
n
O
u
t
p
u
t

I
n
p
u
t
Figure 5.2. Parti al Recurrent Neural Networks
Two major archi tectures for l i mi ted recurrent networks are wi del y used. El man (1990)
suggested al l owi ng feedback from the hi dden uni ts to a set of addi ti onal i nputs cal l ed
320 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
context uni ts. Earl i er, Jordan (1986) descri bed a network wi th feedback from the output
uni ts back to a set of context uni ts. Thi s form of recurrence i s a compromi se between the
si mpl i ci ty of a feed-forward network and the compl exi ty of a ful l y recurrent neural network
because i t sti l l al l ows the popul ar back propagati on trai ni ng al gori thm (descri bed i n the
fol l owi ng) to be used.
Fully Recurrent Networks
Ful l y recurrent networks, as thei r name suggests, provi de two-way connecti ons between
al l processors i n the neural network. A subset of the uni ts i s desi gnated as the i nput
processors, and they are assi gned or cl amped to the speci fi ed i nput val ues. The data then
fl ows to al l adjacent connected uni ts and ci rcul ates back and forth unti l the acti vati on of the
uni ts stabi l i zes. Fi gure 5.3 shows the i nput uni ts feedi ng i nto both the hi dden uni ts (i f any)
and the output uni ts. The acti vati ons of the hi dden and output uni ts then are recomputed
unti l the neural network stabi l i zes. At thi s poi nt, the output val ues can be read from the
output l ayer of processi ng uni ts.
I
n
p
u
t
H
i
d
d
e
n
O
u
t
p
u
t
Figure 5.3. Ful l y Recurrent Neural Networks
Ful l y recurrent networks are compl ex, dynami cal systems, and they exhi bi t al l of the
power and i nstabi l i ty associ ated wi th l i mi t cycl es and chaoti c behavi or of such systems.
Unl i ke feed-forward network vari ants, whi ch have a determi ni sti c ti me to produce an output
val ue (based on the ti me for the data to fl ow through the network), ful l y recurrent networks
can take an i n-determi nate amount of ti me.
I n the best case, the neural network wi l l reverberate a few ti mes and qui ckl y settl e i nto
a stabl e, mi ni mal energy state. At thi s ti me, the output val ues can be read from the output
uni ts. I n l ess opti mal ci rcumstances, the network mi ght cycl e qui te a few ti mes before i t
settl es i nto an answer. I n worst cases, the network wi l l fal l i nto a l i mi t cycl e, vi si ti ng the
same set of answer states over and over wi thout ever settl i ng down. Another possi bi l i ty i s
that the network wi l l enter a chaoti c pattern and never vi si t the same output state.
DATA MI NI NG WI TH NEURAL NETWORKS 321
By pl aci ng some constrai nts on the connecti on wei ghts, we can ensure that the network
wi l l enter a stabl e state. The connecti ons between uni ts must be symmetri cal . Ful l y recurrent
networks are used pri mari l y for opti mi zati on probl ems and as associ ati ve memori es. A ni ce
attri bute wi th opti mi zati on probl ems i s that dependi ng on the ti me avai l abl e, you can
choose to get the recurrent network’s current answer or wai t a l onger ti me for i t to settl e
i nto a better one. Thi s behavi or i s si mi l ar to the performance of peopl e i n certai n tasks.
5.3 NEURAL NETWORK MODELS
The combi nati on of topol ogy, l earni ng paradi gm (supervi sed or non-supervi sed l earni ng),
and l earni ng al gori thm defi nes a neural network model . There i s a wi de sel ecti on of popul ar
neural network model s. For data mi ni ng, perhaps the back propagati on network and the
Kohonen feature map are the most popul ar. However, there are many di fferent types of
neural networks i n use. Some are opti mi zed for fast trai ni ng, a few others for fast recal l of
stored memori es, and the next for computi ng the best possi bl e answer regardl ess of trai ni ng
or recal l ti me. But the best model for a gi ven appl i cati on or data mi ni ng functi on depends
on the data and the functi on requi red.
The di scussi on that fol l ows i s i ntended to provi de an i ntui ti ve understandi ng of the
di fferences between the major types of neural networks. No detai l s of the mathemati cs
behi nd these model s are provi ded.
Back Propagation Networks
A back propagati on neural network uses a feed-forward topol ogy, supervi sed l earni ng,
and the (what el se) back propagati on l earni ng al gori thm. Thi s al gori thm was responsi bl e i n
l arge part for the reemergence of neural networks i n the mi d 1980s.
Input
Actual
Output
Specific
Desired
Output
Er ror Toler ance
Adjust Weights using E rror
(Desir ed- Actual)
Lear n Rate
Momentum

1
2
3
Learn Rate
Momentum
Input
Actual
Output
Specific
Desired
Output
Error Tolerance
Adjust Weights using Error
(Desired-Actual)
Figure 5.4. Back Propagati on Networks
Back propagati on i s a general purpose l earni ng al gori thm. I t i s powerful but al so
expensi ve i n terms of computati onal requi rements for trai ni ng. A back propagati on network
wi th a si ngl e hi dden l ayer of processi ng el ements can model any conti nuous functi on to any
degree of accuracy (gi ven enough processi ng el ements i n the hi dden l ayer). There are l i teral l y
322 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
hundreds of vari ati ons of back propagati on i n the neural network l i terature, and al l cl ai m
to be superi or to “basi c” back propagati on i n one way or the other. I ndeed, si nce back
propagati on i s based on a rel ati vel y si mpl e form of opti mi zati on known as gradi ent descent,
mathemati cal l y astute observers soon proposed modi fi cati ons usi ng more powerful techni ques
such as conjugate gradi ent and Newton’s methods. However, “basi c” back propagati on i s sti l l
the most wi del y used vari ant. I ts two pri mary vi rtues are that i t i s si mpl e and easy to
understand, and i t works for a wi de range of probl ems.
The basi c back propagati on al gori thm consi sts of three steps (see Fi gure 5.4). The i nput
pattern i s presented to the i nput l ayer of the network. These i nputs are propagated through
the network unti l they reach the output uni ts. Thi s forward pass produces the actual or
predi cted output pattern. Because back propagati on i s a supervi sed l earni ng al gori thm, the
desi red outputs are gi ven as part of the trai ni ng vector. The actual network outputs are
subtracted from the desi red outputs and an error si gnal i s produced. Thi s error si gnal i s
then the basi s for the back propagati on step, whereby the errors are passed back through
the neural network by computi ng the contri buti on of each hi dden processi ng uni t and deri vi ng
the correspondi ng adjustment needed to produce the correct output. The connecti on wei ghts
are then adjusted and the neural network has just “l earned” from an experi ence.
As menti oned earl i er, back propagati on i s a powerful and fl exi bl e tool for data model i ng
and anal ysi s. Suppose you want to do l i near regressi on. A back propagati on network wi th
no hi dden uni ts can be easi l y used to bui l d a regressi on model rel ati ng mul ti pl e i nput
parameters to mul ti pl e outputs or dependent vari abl es. Thi s type of back propagati on network
actual l y uses an al gori thm cal l ed the delta rule, fi rst proposed by Wi drow and Hoff (1960).
Addi ng a si ngl e l ayer of hi dden uni ts turns the l i near neural network i nto a nonl i near
one, capabl e of performi ng mul ti vari ate l ogi sti c regressi on, but wi th some di sti nct advantages
over the tradi ti onal stati sti cal techni que. Usi ng a back propagati on network to do l ogi sti c
regressi on al l ows you to model mul ti pl e outputs at the same ti me. Confoundi ng effects from
mul ti pl e i nput parameters can be captured i n a si ngl e back propagati on network model .
Back propagati on neural networks can be used for cl assi fi cati on, model i ng, and ti me-seri es
forecasti ng. For cl assi fi cati on probl ems, the i nput attri butes are mapped to the desi red
cl assi fi cati on categori es. The trai ni ng of the neural network amounts to setti ng up the
correct set of di scri mi nant functi ons to correctl y cl assi fy the i nputs. For bui l di ng model s or
functi on approxi mati on, the i nput attri butes are mapped to the functi on output. Thi s coul d
be a si ngl e output such as a pri ci ng model , or i t coul d be compl ex model s wi th mul ti pl e
outputs such as tryi ng to predi ct two or more functi ons at once.
Two major l earni ng parameters are used to control the trai ni ng process of a back
propagati on network. The learn rate i s used to speci fy whether the neural network i s goi ng
to make major adjustments after each l earni ng tri al or i f i t i s onl y goi ng to make mi nor
adjustments. Momentum i s used to control possi bl e osci l l ati ons i n the wei ghts, whi ch coul d
be caused by al ternatel y si gned error si gnal s. Whi l e most commerci al back propagati on tool s
provi de anywhere from 1 to 10 or more parameters for you to set, these two wi l l usual l y
produce the most i mpact on the neural network trai ni ng ti me and performance.
Kohonen Feature Maps
Kohonen feature maps are feed-forward networks that use an unsupervi sed trai ni ng
al gori thm, and through a process cal l ed sel f-organi zati on, confi gure the output uni ts i nto a
DATA MI NI NG WI TH NEURAL NETWORKS 323
topol ogi cal or spati al map. Kohonen (1988) was one of the few researchers who conti nued
worki ng on neural networks and associ ati ve memory even after they l ost thei r cachet as a
research topi c i n the 1960s. Hi s work was reeval uated duri ng the l ate 1980s, and the uti l i ty
of the sel f-or gani zi ng featur e map was r ecogni zed. Kohonen has pr esented sever al
enhancements to thi s model , i ncl udi ng a supervi sed l earni ng vari ant known as Learning
Vector Quantization (LVQ).
A feature map neural network consi sts of two l ayers of processi ng uni ts an i nput l ayer
ful l y connected to a competi ti ve output l ayer. There are no hi dden uni ts. When an i nput
pattern i s presented to the feature map, the uni ts i n the output l ayer compete wi th each
other for the ri ght to be decl ared the wi nner. The wi nni ng output uni t i s typi cal l y the uni t
whose i ncomi ng connecti on wei ghts are the cl osest to the i nput pattern (i n terms of Eucl i dean
di stance). Thus the i nput i s presented and each output uni t computes i ts cl oseness or match
score to the i nput pattern. The output that i s deemed cl osest to the i nput pattern i s decl ared
the wi nner and so earns the ri ght to have i ts connecti on wei ghts adjusted. The connecti on
wei ghts are moved i n the di recti on of the i nput pattern by a factor determi ned by a l earni ng
rate parameter. Thi s i s the basi c nature of competi ti ve neural networks.
The Kohonen feature map creates a topol ogi cal mappi ng by adjusti ng not onl y the
wi nner’s wei ghts, but al so adjusti ng the wei ghts of the adjacent output uni ts i n cl ose proxi mi ty
or i n the nei ghborhood of the wi nner. So not onl y does the wi nner get adjusted, but the
whol e nei ghborhood of output uni ts gets moved cl oser to the i nput pattern. Starti ng from
randomi zed wei ght val ues, the output uni ts sl owl y al i gn themsel ves such that when an
i nput pattern i s presented, a nei ghborhood of uni ts responds to the i nput pattern. As trai ni ng
progresses, the si ze of the nei ghborhood radi ati ng out from the wi nni ng uni t i s decreased.
I ni ti al l y l arge numbers of output uni ts wi l l be updated, and l ater on smal l er and smal l er
numbers are updated unti l at the end of trai ni ng onl y the wi nni ng uni t i s adjusted. Si mi l arl y,
the l earni ng rate wi l l decrease as trai ni ng progresses, and i n some i mpl ementati ons, the
l earn rate decays wi th the di stance from the wi nni ng output uni t.
Input
Out put compete
to be Winner
Adj ust Weights of Winner
toward Input Pattern

Learn Rate

1
2
3

Wi nner Neighbor
Input
Output compete
to be Winner
Adjust Weights of Winner
toward Input Pattern
Winner Neighbor
Learn Rate
Figure 5.5. Kohonen Sel f-Organi zi ng Feature Maps
324 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Looki ng at the feature map from the perspecti ve of the connecti on wei ghts, the Kohonen
map has performed a process cal l ed vector quanti zati on or code book generati on i n the
engi neeri ng l i terature. The connecti on wei ghts represent a typi cal or prototype i nput pattern
for the subset of i nputs that fal l i nto that cl uster. The process of taki ng a set of hi gh
di mensi onal data and reduci ng i t to a set of cl usters i s cal l ed segmentati on. The hi gh-
di mensi onal i nput space i s reduced to a two-di mensi onal map. I f the i ndex of the wi nni ng
output uni t i s used, i t essenti al l y parti ti ons the i nput patterns i nto a set of categori es or
cl usters.
From a data mi ni ng perspecti ve, two sets of useful i nformati on are avai l abl e from a
trai ned feature map. Si mi l ar customers, products, or behavi ors are automati cal l y cl ustered
together or segmented so that marketi ng messages can be targeted at homogeneous groups.
The i nformati on i n the connecti on wei ghts of each cl uster defi nes the typi cal attri butes of
an i tem that fal l s i nto that segment. Thi s i nformati on l ends i tsel f to i mmedi ate use for
eval uati ng what the cl usters mean. When combi ned wi th appropri ate vi sual i zati on tool s
and/or anal ysi s of both the popul ati on and segment stati sti cs, the makeup of the segments
i denti fi ed by the feature map can be anal yzed and turned i nto val uabl e busi ness i ntel l i gence.
Recurrent Back Propagation
Recurrent back propagati on i s, as the name suggests, a back propagati on network wi th
feedback or recurrent connecti ons. Typi cal l y, the feedback i s l i mi ted to ei ther the hi dden
l ayer uni ts or the output uni ts. I n ei ther confi gurati on, addi ng feedback from the acti vati on
of outputs from the pri or pattern i ntroduces a ki nd of memory to the process. Thus addi ng
recurrent connecti ons to a back propagati on network enhances i ts abi l i ty to l earn temporal
sequences wi thout fundamental l y changi ng the trai ni ng process. Recurrent back propagati on
networks wi l l , i n general , perform better than regul ar back propagati on networks on ti me-
seri es predi cti on probl ems.
Radial Basis Function
Radi al basi s functi on (RBF) networ ks ar e feed-for war d networ ks tr ai ned usi ng a
supervi sed trai ni ng al gori thm. They are typi cal l y confi gured wi th a si ngl e hi dden l ayer of
uni ts whose acti vati on functi on i s sel ected from a cl ass of functi ons cal l ed basis functions.
Whi l e si mi l ar to back propagati on i n many respects, radi al basi s functi on networks have
several advantages. They usual l y trai n much faster than back propagati on networks. They
are l ess suscepti bl e to probl ems wi th non-stati onary i nputs because of the behavi or of the
r adi al basi s functi on hi dden uni ts. Radi al basi s functi on networ ks ar e si mi l ar to the
probabi l i sti c neural networks i n many respects (Wasserrnan 1993). Popul ari zed by Moody
and Darken (1989), radi al basi s functi on networks have proven to be a useful neural network
ar chi tectur e. The major di ffer ence between r adi al basi s functi on networ ks and back
propagati on networks i s the behavi or of the si ngl e hi dden l ayer. Rather than usi ng the
si gmoi dal or S-shaped acti vati on functi on as i n back propagati on, the hi dden uni ts i n RBF
networks use a Gaussi an or some other basi s kernel functi on. Each hi dden uni t acts as a
l ocal l y tuned processor that computes a score for the match between the i nput vector and
i ts connecti on wei ghts or centers. I n effect, the basi s uni ts are hi ghl y speci al i zed pattern
detectors. The wei ghts connecti ng the basi s uni ts to the outputs are used to take l i near
combi nati ons of the hi dden uni ts to product the fi nal cl assi fi cati on or output.
DATA MI NI NG WI TH NEURAL NETWORKS 325
Remember that i n a back propagati on network, al l wei ghts i n al l of the l ayers are
adjusted at the same ti me. I n radi al basi s functi on networks, however, the wei ghts i n the
hi dden l ayer basi s uni ts are usual l y set before the second l ayer of wei ghts i s adjusted. As
the i nput moves away from the connecti on wei ghts, the acti vati on val ue fal l s off. Thi s
behavi or l eads to the use of the term “center” for the fi rst-l ayer wei ghts. These center
wei ghts can be computed usi ng Kohonen feature maps, stati sti cal methods such as K-Means
cl usteri ng, or some other means. I n any case, they are then used to set the areas of sensi ti vi ty
for the RBF hi dden uni ts, whi ch then remai n fi xed. Once the hi dden l ayer wei ghts are set,
a second phase of trai ni ng i s used to adjust the output wei ghts. Thi s process typi cal l y uses
the standard back propagati on trai ni ng rul e.
I n i ts si mpl est form, al l hi dden uni ts i n the RBF network have the same wi dth or
degree of sensi ti vi ty to i nputs. However, i n porti ons of the i nput space where there are few
patterns, i t i s someti me desi rabl e to have hi dden uni ts wi th a wi de area of recepti on.
Li kewi se, i n porti ons of the i nput space, whi ch are crowded, i t mi ght be desi rabl e to have
very hi ghl y tuned processors wi th narrow recepti on fi el ds. Computi ng these i ndi vi dual wi dths
i ncreases the performance of the RBF network at the expense of a more compl i cated trai ni ng
process.
Adaptive Resonance Theory
Adapti ve resonance theory (ART) networks are a fami l y of recurrent networks that can
be used for cl usteri ng. Based on the work of researcher Stephen Grossberg (1987), the ART
model s are desi gned to be bi ol ogi cal l y pl ausi bl e. I nput patterns are presented to the network,
and an output uni t i s decl ared a wi nner i n a process si mi l ar to the Kohonen feature maps.
However, the feedback connecti ons from the wi nner output encode the expected i nput pattern
templ ate. I f the actual i nput pattern does not match the expected connecti on wei ghts to a
suffi ci ent degree, then the wi nner output i s shut off, and the next cl osest output uni t i s
decl ared as the wi nner. Thi s process conti nues unti l one of the output uni t’s expectati on i s
sati sfi ed to wi thi n the requi red tol erance. I f none of the output uni ts wi ns, then a new
output uni t i s commi tted wi th the i ni ti al expected pattern set to the current i nput pattern.
The ART fami l y of networks has been expanded through the addi ti on of fuzzy l ogi c,
whi ch al l ows r eal -val ued i nputs, and thr ough the ARTMAP ar chi tectur e, whi ch al l ows
supervi sed trai ni ng. The ARTMAP archi tecture uses back-to-back ART networks, one to
cl assi fy the i nput patterns and one to encode the matchi ng output patterns. The MAP part
of ARTMAP i s a fi el d of uni ts (or i ndexes, dependi ng on the i mpl ementati on) that serves as
an i ndex between the i nput ART network and the output ART network. Whi l e the detai l s
of the trai ni ng al gori thm are qui te compl ex, the basi c operati on for recal l i s surpri si ngl y
si mpl e. The i nput pattern i s presented to the i nput ART network, whi ch comes up wi th a
wi nner output. Thi s wi nner output i s mapped to a correspondi ng output uni t i n the output
ART network. The expected pattern i s read out of the output ART network, whi ch provi des
the overal l output or predi cti on pattern.
Probabilistic Neural Networks
Probabi l i sti c neural networks (PNN) feature a feed-forward archi tecture and supervi sed
trai ni ng al gori thm si mi l ar to back propagati on (Specht, 1990). I nstead of adjusti ng the i nput
l ayer wei ghts usi ng the general i zed del ta rul e, each trai ni ng i nput pattern i s used as the
326 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
connecti on wei ghts to a new hi dden uni t. I n effect, each i nput pattern i s i ncorporated i nto
the PNN archi tecture. Thi s techni que i s extremel y fast, si nce onl y one pass through the
network i s requi red to set the i nput connecti on wei ghts. Addi ti onal passes mi ght be used to
adjust the output wei ghts to fi ne-tune the network outputs.
Several researchers have recogni zed that addi ng a hi dden uni t for each i nput pattern
mi ght be overki l l . Vari ous cl usteri ng schemes have been proposed to cut down on the
number of hi dden uni ts when i nput patterns are cl ose i n i nput space and can be represented
by a si ngl e hi dden uni t. Probabi l i sti c neural networks offer several advantages over back
propagati on networks (Wasserman, 1993). Trai ni ng i s much faster, usual l y a si ngl e pass.
Gi ven enough i nput data, the PNN wi l l conver ge to a Bayesi an (opti mum) cl assi fi er .
Probabi l i sti c neural networks al l ow true i ncremental l earni ng where new trai ni ng data can
be added at any ti me wi thout requi ri ng retrai ni ng of the enti re network. And because of the
stati sti cal basi s for the PNN, i t can gi ve an i ndi cati on of the amount of evi dence i t has for
basi ng i ts deci si on.
Table 5.1. Neural Network Models and Their Functions
Model Training Paradigm Topology Primary Functions
Adapti ve resonance Unsupervi sed Recurrent Cl usteri ng
theory
ARTMAP Supervi sed Recurrent Cl assi fi cati on
Back propagati on Supervi sed Feed-forward Cl assi fi cati on,
model i ng, ti me-seri es
Radi al basi s functi on Supervi sed Feed-forward Cl assi fi cati on,
networks model i ng, ti me-seri es
Probabi l i sti c neural Supervi sed Feed-forward Cl assi fi cati on
networks
Kohonen feature map Unsupervi sed Feed-forward Cl usteri ng
Learni ng vector Supervi sed Feed-forward Cl assi fi cati on
quanti zati on
Recurrent back Supervi sed Li mi ted Model i ng, ti me-seri es
propagati on recurrent
Temporal di fference Rei nforcement Feed-forward Ti me-serei s
l earni ng
Key Issues in Selecting Models and Architecture
Sel ecti ng whi ch neur al networ k model to be used for a par ti cul ar appl i cati on i s
strai ghtforward i f you use the fol l owi ng process. Fi rst, sel ect the functi on you want to
perform. Thi s can i ncl ude cl usteri ng, cl assi fi cati on, model i ng, or ti me-seri es approxi mati on.
Then l ook at the i nput data wi th whi ch you have to trai n the network. I f the data i s al l
bi nary, or i f i t contai ns real -val ued i nputs, that mi ght di squal i fy some of the network
DATA MI NI NG WI TH NEURAL NETWORKS 327
archi tectures. Next you shoul d determi ne how much data you have and how fast you need
to trai n the network. Thi s mi ght suggest usi ng probabi l i sti c neural networks or radi al basi s
functi on networks rather than a back propagati on network. Tabl e 5.1 can be used to ai d i n
thi s sel ecti on process. Most commerci al neural network tool s shoul d support at l east one
vari ant of these al gori thms.
Our defi ni ti on of archi tecture i s the number of i nputs, hi dden, and output uni ts. So i n
my vi ew, you mi ght sel ect a back pr opagati on model , but expl or e sever al di ffer ent
archi tectures havi ng di fferent numbers of hi dden l ayers, and/or hi dden uni ts.
Data Type and Quantity. I n some cases, whether the data i s al l bi nary or contai ns some
real numbers that mi ght hel p to determi ne whi ch neural network model to be used. The
standard ART network (cal l ed ART l ) works onl y wi th bi nary data and i s probabl y preferabl e
to Kohonen maps for cl usteri ng i f the data i s al l bi nary. I f the i nput data has real val ues,
then fuzzy ART or Kohonen maps shoul d be used.
Training Requirements. (Online or batch learning) I n general , whenever we want
onl i ne l earni ng, then trai ni ng speed becomes the overri di ng factor i n determi ni ng whi ch
neural network model to use. Back propagati on and recurrent back propagati on trai n qui te
sl owl y and so are al most never used i n real -ti me or onl i ne l earni ng si tuati ons. ART and
radi al basi s functi on networks, however, trai n qui te fast, usual l y i n a few passes over the
data.
Functional Requirements. Based on the functi on requi red, some model s can be di squal i fi ed.
For exampl e, ART and Kohonen feature maps are cl usteri ng al gori thms. They cannot be
used for model i ng or ti me-ser i es for ecasti ng. I f you need to do cl uster i ng, then back
propagati on coul d be used, but i t wi l l be much sl ower trai ni ng than usi ng ART of Kohonen
maps.
5.4 ITERATIVE DEVELOPMENT PROCESS
Despi te al l your sel ecti ons, i t i s qui te possi bl e that the fi rst or second ti me that you try
to trai n i t, the neural network wi l l not be abl e to meet your acceptance cri teri a. When thi s
happens you are then i n a troubl eshooti ng mode. What can be wrong and how can you fi x
i t?
The major steps of the i nter acti ve devel opment pr ocess ar e data sel ecti on and
representati on, neural network model sel ecti on, archi tecture speci fi cati on, trai ni ng parameter
sel ecti on, and choosi ng an appropri ate acceptance cri teri a. I f any of these deci si ons are off
the mark, the neural network mi ght not be abl e to l earn what you are tryi ng to teach i t.
I n the fol l owi ng secti ons, I descri be the major deci si on poi nts and the recovery opti ons when
thi ngs go wrong duri ng trai ni ng.
Network Convergence Issues
How do you know that you are i n troubl e whi l e trai ni ng a neural network model ? The
fi rst hi nt i s that i t takes a l ong, l ong ti me for the network to trai n, and you are moni tori ng
the cl assi fi cati on accuracy or the predi cti on accuracy of the neural network. I f you are
pl otti ng the RMS error, you wi l l see that i t fal l s qui ckl y and then stays fl at, or that i t
osci l l ates up and down. Ei ther of these two condi ti ons mi ght mean that the network i s
trapped i n a l ocal mi ni ma, whi l e the objecti ve i s to reach the gl obal mi ni ma.
328 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
There are two pri mary ways around thi s probl em. Fi rst, you can add some random
noi se to the neural network wei ghts i n order to try to break i t free from the l ocal mi ni ma.
The other opti on i s to reset the network wei ghts to new random val ues and start trai ni ng
al l over agai n. Thi s mi ght not be enough to get the neural network to converge on a sol uti on.
Any of the desi gn deci si ons you made mi ght be negati vel y i mpacti ng the abi l i ty of the neural
network to l earn the functi on you are tryi ng to teach.
Model Selection
I t i s someti mes best to revi si t your major choi ces i n the same order as your ori gi nal
deci si ons. Di d you sel ect an i nappropri ate neural network model for the functi on you are
tryi ng to perform? I f so, pi cki ng a neural network model that can perform the functi on i s
the sol uti on. I f not, i t i s most l i kel y a si mpl e matter of addi ng more hi dden uni ts or another
l ayer of hi dden uni ts. I n practi ce, one l ayer of hi dden uni ts usual l y wi l l suffi ce. Two l ayers
are requi red onl y i f you have added a l arge number of hi dden uni ts and the network sti l l
has not converged. I f you do not provi de enough hi dden uni ts, the neural network wi l l not
have the computati onal power to l earn some compl ex nonl i near functi ons.
Other factors besi des the neural network archi tecture coul d be at work. Maybe the data
has a strong temporal or ti me el ement embedded i n i t. Often a recurrent back propagati on
or a radi al basi s functi on network wi l l perform better than regul ar back propagati on. I f the
i nputs are non-stati onary, that i s they change sl owl y over ti me, then radi al basi s functi on
networks are defi ni tel y goi ng to work best.
Data Representation
I f a neural network does not converge to a sol uti on, and i f you are sure that your model
archi tecture i s appropri ate for the probl em, then the next thi ng i s to reeval uate your data
representati on deci si ons. I n some cases, a key i nput parameter i s not bei ng scal ed or coded
i n a manner that l ets the neural network l earn i ts i mportance to the functi on at hand. One
exampl e i s a conti nuous vari abl e, whi ch has a l arge range i n the ori gi nal domai n and i s
scal ed down to a 0 to 1 val ue for presentati on to the neural network. Perhaps a thermometer
codi ng wi th one uni t for each magni tude of 10 i s i n or der . Thi s woul d change the
representati on of the i nput parameter from a si ngl e i nput to 5, 6, or 7, dependi ng on the
range of the val ue.
A more seri ous probl em i s a key parameter mi ssi ng from the trai ni ng data. I n some
ways, thi s i s the most di ffi cul t probl em to detect. You can easi l y spend much ti me pl ayi ng
around wi th the data representati on tryi ng to get the network to converge. Unfortunatel y,
thi s i s one area where experi ence i s requi red to know what a normal trai ni ng process feel s
l i ke and what one that i s doomed to fai l ure feel s l i ke. I t i s al so i mportant to have a domai n
expert i nvol ved who can provi de i deas when thi ngs are not worki ng. A domai n expert mi ght
recogni ze an i mportant parameter mi ssi ng from the trai ni ng data.
Model Architectures
I n some cases, we have done everythi ng ri ght, but the network just won’t converge. I t
coul d be that the probl em i s just too compl ex for the archi tecture you have speci fi ed. By
addi ng addi ti onal hi dden uni ts, and even another hi dden l ayer, you are enhanci ng the
computati onal abi l i ti es of the neural network. Each new connecti on wei ght i s another free
DATA MI NI NG WI TH NEURAL NETWORKS 329
vari abl e, whi ch can be adjusted. That i s why i t i s good practi ce to start out wi th an abundant
suppl y of hi dden uni ts when you fi rst start worki ng on a probl em. Once you are sure that
the neural network can l earn the functi on, you can start reduci ng the number of hi dden
uni ts unti l the general i zati on performance meets your requi rements. But beware. Too much
of a good thi ng can be bad, too!
I f some addi ti onal hi dden uni ts are good, i s addi ng a few more better? I n most cases,
no! Gi vi ng the neural network more hi dden uni ts (and the associ ated connecti on wei ghts)
can actual l y make i t too easy for the network. I n some cases, the neural network wi l l si mpl y
l earn to memori ze the trai ni ng patterns. The neural network has opti mi zed to the trai ni ng
set’s parti cul ar patterns and has not extracted the i mportant rel ati onshi ps i n the data. You
coul d have saved yoursel f ti me and money by just usi ng a l ookup tabl e. The whol e poi nt i s
to get the neural network to detect key features i n the data i n order to general i ze when
presented wi th patterns i t has not seen before. There i s nothi ng worse than a fat, l azy
neural network. By keepi ng the hi dden l ayers as thi n as possi bl e, you usual l y get the best
resul ts.
Avoiding Over-Training
Whi l e trai ni ng a neural network, i t i s i mportant to understand when to stop. I t i s
natural to thi nk that i f 100 epochs are good, then 1000 epochs wi l l be much better. However,
thi s i ntui ti ve i dea of “more practi ce i s better” doesn’t hol d wi th neural networks. I f the same
trai ni ng patterns or exampl es are gi ven to the neural network over and over, and the
wei ghts are adjusted to match the desi red outputs, we are essenti al l y tel l i ng the network
to memori ze the patterns, rather than to extract the essence of the rel ati onshi ps. What
happens i s that the neural network performs extremel y wel l on the trai ni ng data. However,
when i t i s presented wi th patterns i t hasn’t seen before, i t cannot general i ze and does not
perform wel l . What i s the probl em? I t i s cal l ed over-trai ni ng.
Over-trai ni ng a neural network i s si mi l ar to when an athl ete practi ces and practi ces for
an event on hi s home court, and when the actual competi ti on starts he or she i s faced wi th
an unfami l i ar arena and ci rcumstances, i t mi ght be i mpossi bl e for hi m or her to react and
perform at the same l evel s as duri ng trai ni ng.
I t i s i mportant to remember that we are not tryi ng to get the neural network to make
as best predi cti ons as i t can on the trai ni ng data. We are tryi ng to opti mi ze i ts performance
on the testi ng and val i dati on data. Most commerci al neural network tool s provi de the means
to swi tch automati cal l y between trai ni ng and testi ng data. The i dea i s to check the network
performance on the testi ng data whi l e you are trai ni ng.
Automating the Process
What has been descri bed i n the precedi ng secti ons i s the manual process of bui l di ng a
neural network model . I t requi res some degree of ski l l and experi ence wi th neural networks
and model bui l di ng i n order to be successful . Havi ng to tweak many parameters and make
somewhat arbi trary deci si ons concerni ng the neural network archi tecture does not seem l i ke
a great advantage to some appl i cati on devel opers. Because of thi s, researchers have worked
i n a vari ety of ways to mi ni mi ze these probl ems.
330 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
Perhaps the fi rst attempt was to automate the sel ecti on of the appropri ate number of
hi dden l ayers and hi dden uni ts i n the neural network. Thi s was approached i n a number
of ways: a pri ori attempts to compute the requi red archi tecture by l ooki ng at the data,
bui l di ng arbi trary l arge networks and then pruni ng out nodes and connecti ons unti l the
smal l est network that coul d do the job i s produced, and starti ng wi th a smal l network and
then growi ng i t up unti l i t can perform the task appropri atel y.
Geneti c al gori thms are often used to opti mi ze functi ons usi ng paral l el search methods
based on the bi ol ogi cal theory of nature. I f we vi ew the sel ecti on of the number of hi dden
l ayers and hi dden uni ts as an opti mi zati on probl em, geneti c al gori thms can be used to fi nd
the opti mum archi tecture.
The i dea of pruni ng nodes and wei ghts from neural networks i n order to i mprove thei r
general i zati on capabi l i ti es has been expl ored by several research groups (Si etsma and Dow,
1988). A network wi th an arbi trari l y l arge number of hi dden uni ts i s created and trai ned
to perform some processi ng functi on. Then the wei ghts connected to a node are anal yzed to
see i f they contri bute to the accurate predi cti on of the output pattern. I f the wei ghts are
extremel y smal l , or i f they do not i mpact the predi cti on error when they are removed, then
that node and i ts wei ghts are pruned or removed from the network. Thi s process conti nues
unti l the removal of any addi ti onal node causes a decrease i n the performance on the test
set.
Several researchers have al so expl ored the opposi te approach to pruni ng. That i s, a
smal l neur al networ k i s cr eated, and addi ti onal hi dden nodes and wei ghts ar e added
i ncremental l y. The network predi cti on error i s moni tored, and as l ong as performance on
the test data i s i mprovi ng, addi ti onal hi dden uni ts are added. The cascade correl ati on
network al l ocates a whol e set of potenti al new network nodes. These new nodes compete
wi th each other and the one that reduces the predi cti on error the most i s added to the
network. Perhaps the hi ghest l evel of automati on of the neural network data mi ni ng process
wi l l come wi th the use of i ntel l i gent agents.
5.5 STRENGTHS AND WEAKNESSES OF ARTIFICIAL NEURAL NETWORKS
Strengths of Artificial Neural Networks
• Neural Networks are Versatile. Neural networks provi de a very general way of
approachi ng probl ems. When the output of the network i s conti nuous, such as the
apprai sed val ue of a home, then i t i s performi ng predi cti on. When the output has
di screte val ues, then i t i s doi ng cl assi fi cati on. A si mpl e re-arrangement of the
neurons and the network becomes adept at detecti ng cl usters.
The fact that neural networks are so versati l e defi ni tel y accounts for thei r popul ari ty.
The effort needed to l earn how to use them and to l earn how to massage data i s
not wasted, si nce the knowl edge can be appl i ed wherever neural networks woul d
be appropri ate.
• Neural Networks can Produce Good Results in Complicated Domains. Neural
networks produce good resul ts. Across a l arge number of i ndustri es and a l arge
number of appl i cati ons, neural networks have proven themsel ves over and over
agai n. These resul ts come i n compl i cated domai ns, such as anal yzi ng ti me seri es
DATA MI NI NG WI TH NEURAL NETWORKS 331
and detecti ng fraud, that are not easi l y amenabl e to other techni ques. The l argest
neural network i n producti on use i s probabl y the system that AT&T uses for readi ng
number s on checks. Thi s neur al networ k has hundr eds of thousands of uni ts
organi zed i nto seven l ayers.
As compared to standard stati sti cs or to deci si on-tree approaches, neural networks
are much more powerful . They i ncorporate non-l i near combi nati ons of features i nto
thei r resul ts, not l i mi ti ng themsel ves to rectangul ar regi ons of the sol uti on space.
They are abl e to take advantage of al l the possi bl e combi nati ons of features to
arri ve at the best sol uti on.
• Neural Networks can Handle Categorical and Continuous Data Types.
Al though the data has to be massaged, neural networks have proven themsel ves
usi ng both categori cal and conti nuous data, both for i nputs and outputs. Categori cal
data can be handl ed i n two di fferent ways, ei ther by usi ng a si ngl e uni t wi th each
category gi ven a subset of the range from 0 to 1 or by usi ng a separate uni t for each
category. Conti nuous data i s easi l y mapped i nto the necessary range.
• Neural Networks are Available in Many Off-the-Shelf Packages. Because of
the versati l i ty of neural networks and thei r track record of good resul ts, many
software vendors provi de off-the-shel f tool s for neural networks. The competi ti on
between vendors makes these packages easy to use and ensures that advances i n
the theory of neural networks are brought to market.
Weaknesses of Artificial Neural Networks
• All Inputs and Outputs Must be Massaged to [0.1]. The i nputs to a neural
network must be massaged to be i n a parti cul ar range, usual l y between 0 and 1.
Thi s requi res addi ti onal transforms and mani pul ati ons of the i nput data that requi re
addi ti onal ti me, CPU power, and di sk space. I n addi ti on, the choi ce of transform
can effect the resul ts of the network. Fortunatel y tool s try to make thi s massagi ng
process as si mpl e as possi bl e. Good tool s provi de hi stograms for seei ng categori cal
val ues and automati cal l y transform numeri c val ues i nto the range. Sti l l , skewed
di stri buti ons wi th a few outl i ers can resul t i n poor neural network performance.
The requi rement to massage the data i s actual l y a mi xed bl essi ng. I t requi res
anal yzi ng the trai ni ng set to veri fy the data val ues and thei r ranges. Si nce data
qual i ty i s the number one i ssue i n data mi ni ng, thi s addi ti onal perusal of the data
can actual l y forestal l probl ems l ater i n the anal ysi s.
• Neural Networks cannot Explain Results. Thi s i s the bi ggest cri ti ci sm di rected
at neural networks. I n domai ns where expl ai ni ng rul es may be cri ti cal , such as
denyi ng l oan appl i cati ons, neural networks are not the tool of choi ce. They are the
tool of choi ce when acti ng on the resul ts i s more i mportant than understandi ng
them. Even though neural networks cannot produce expl i ci t rul es, sensi ti vi ty anal ysi s
does enabl e them to expl ai n whi ch i nputs are more i mportant than others. Thi s
anal ysi s can be performed i nsi de the network, by usi ng the errors generated from
back propagati on, or i t can be performed external l y by poki ng the network wi th
speci fi c i nputs.
332 DATA WAREHOUSI NG, OLAP AND DATA MI NI NG
• Neural Networks may Converge on an Inferior Solution. Neural networks
usual l y converge on some sol uti on for any gi ven trai ni ng set. Unfortunatel y, there
i s no guarantee that thi s sol uti on provi des the best model of the data. Use the test
set to determi ne when a model provi des good enough performance to be used on
unknown data.
In Summary
Neural networks are the di fferent paradi gm for computi ng and thi s act as a bri dge
between di gi tal computer and the neural connecti ons i n human brai n. There are three major
connecti ons topol ogi es that defi ne how data fl ows between the i nput, hi dden and output
processi ng uni ts of networks. These mai n categori es are Feed forward, Li mi ted recurrent
and ful l y recurrent networks.
The key i ssues i n sel ecti ng model s and archi tecture i ncl udes: data type and quanti ty,
Trai ni ng requi rements, functi onal requi rements.

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close