Data Warehousing and Business Intelligence
In this section of techtiks, you will find detailed information and must-read articles
related to Data Warehousing and Business Intelligence. The target audience ranges
from beginners to experts. These articles are written by highly ualified Data
Whene"er you see or hear the words "Data Warehouse" you might think of some
large building that has bits of information stored on shel"es waiting for someone to
retrie"e them, perhaps# $et%s think of some traditional warehouse, what it contains#
It contains some goods stored in such a way that they are easy to identify and they
can be uickly retrie"ed. & Data Warehouse also functions in the similar way. Then
how the Data Warehouse is different from traditional relational databases# 'elational
database is similar to Data Warehouse but there are certain defining differences. (ou
will see the differences in following articles.
• Datawarehouse Defined
• !lements of Data Warehouse
• )istory of Data Warehousing
• Dimensional *odeling
+,- The goal of techtiks.com is to be one stop shop for all the internet needs of
today%s technologists. To achie"e this we are adding new articles e"ery day. Do
Data Warehouse Definition
/onsider an example where business analyst uses the systems containing
operational data 0the data that runs the daily transactions of your business1.
&nalysts can use information about, which products were sold in which regions at
what time of the year, to look for anomalies or to pro2ect future sales. )owe"er,
there are se"eral problems if analyst accesses operational data directly-
• )e might not ha"e the expertise to uery the operational database. 3or
example, uerying I*, databases reuires an application program that uses a
speciali4ed type of data manipulation language. In general, those
programmers who ha"e the expertise to uery the operational database ha"e
a full-time 2ob in maintaining the database and its applications.
• +erformance is critical for many operational databases, such as databases for
a bank. The system cannot handle users making ad-hoc ueries.
• The operational data generally is not in the best format to be used for
Data warehousing sol"es these problems. In data warehousing, you create stores of
informational data data that is extracted from the operational data and then
transformed for reporting and decision making. 3or example, a data warehousing
tool might copy all the sales data from the operational database, perform
calculations to summari4e the data, and write it to a new database. !nd-users can
uery the new database 0the warehouse1 without impacting the operational
• The purpose of data warehouse is to store data consistently across the
organi4ation and to make the organi4ational information accessible.
• It is adapti"e and resilient source of information. When new data is added to
the Data Warehouse, the existing data and technologies are not disrupted.
The design of separate data marts that make up the data warehouse must be
distributed and incremental. &nything else is a compromise.
• The data warehouse not only controls the access to the data, but gi"es its
owners great "isibility into the uses and abuses of the data, e"en after it has
left the data warehouse.
• Data warehouse is the foundation for decision-making.
!lements of Data Warehouse
Typically in any organi4ation the data is stored in "arious databases, usually di"ided
up by the systems. There may be data for marketing, sales, payroll, engineering,
etc. These systems might be legacy5mainframe systems or relational database
The data coming from "arious source systems is first kept in a staging area. The
staging area is used to clean, transform, combine, de-duplicate, household, archi"e,
and to prepare source data for use in data warehouse. The data coming from source
system is kept as it is in this area. This need not be based on relational terminology.
,ometimes managers of the data are comfortable with normali4ed set of data. In
these cases, normali4ed structure of the data staging storage is certainly acceptable.
&lso, staging area doesnt pro"ide uerying5presentation ser"ices.
6nce the data is in staging area, it is cleansed, transformed and then sent to Data
warehouse. (ou may or may not ha"e 6D, before transferring data to Data
The data in Data Warehouse has to be easily manipulated in order to answer the
business uestions from management and other users. This is accomplished by
connecting the data to fast and easy-to-use tools known as 6nline &nalytical
+rocessing 06$&+1 tools. 6$&+ tools can be thought of as super high-speed forklifts
that ha"e knowledge of the warehouse and the operators built into them in order to
allow ordinary people off the street to 2ump in and uickly find products by asking
!nglish-like uestions. Within the 6$&+ ser"er, data is reorgani4ed to meet the
reporting and analysis reuirements of the business, including-
• !xception reporting
• &d-hoc analysis
• &ctual "s. budget reporting
• Data mining 0looking for trends or anomalies in the data1
In order to process business ueries at high speed, answers to common uestions
are preprocessed in some 6$&+ ser"ers, resulting in exceptional uery responses at
the cost of ha"ing an 6$&+ database that may be se"eral times bigger than the data
Data mart is a logical subset of complete data warehouse. It is often "iewed as
the restriction of data warehouse to a single business process or to a group of
related business processes targeted toward a particular business group. 3or
example an organi4ation may ha"e a data mart for ,ales or In"entory.
Quick Reference Guide to Dimensional Modeling
Dimensional modeling is the design concept used by many data warehouse designers
to build their data warehouse. Dimensional model is the underlying data model used
by many of the commercial 6$&+ products a"ailable today in the market. Designing a
data warehouse is "ery different from designing an online transaction processing
06$T+1 system. In contrast to an 6$T+ system in which the purpose is to capture
high rates of data changes and additions, the purpose of a data warehouse is to
organi4e large amounts of stable data for ease of analysis and retrie"al. Because of
these differing purposes, there are many considerations in data warehouse design
that differ from 6$T+ database design. In dimensional model, all data is contained in
two types of tables called 3act Table and Dimension Table.
!ach data warehouse or data mart includes one or more fact tables. The fact table
captures the data that measures the organi4ations business operations. & fact table
might contain business sales e"ents such as cash register transactions or the
contributions and expenditures of a nonprofit organi4ation. 3act tables usually
contain large numbers of rows, sometimes in the hundreds of millions of records
when they contain one or more years of history for a large organi4ation. & key
characteristic of a fact table is that it contains numerical data 0facts1 that can be
summari4ed to pro"ide information about the history of the operation of the
organi4ation. !ach fact table also includes a multipart index that contains as foreign
keys the primary keys of related dimension tables, which contain the attributes of
the fact records. 3act tables should not contain descripti"e information or any data
other than the numerical measurement fields and the index fields that relate the
facts to corresponding entries in the dimension tables. &n example of fact table is
,ales73act table that might contain the information like sale7amount, unit7price,
Dimension tables contain attributes that describe fact records in the fact table. ,ome
of these attributes pro"ide descripti"e information8 others are used to specify how
fact table data should be summari4ed to pro"ide useful information to the analyst.
Dimension tables contain hierarchies of attributes that aid in summari4ation. 3or
example, a dimension containing product information would often contain a hierarchy
that separates products into categories such as food, drink, and non-consumable
items, with each of these categories further subdi"ided a number of times until the
indi"idual product is reached at the lowest le"el.
Dimensional modeling produces dimension tables in which each table contains fact
attributes that are independent of those in other dimensions. 3or example, a
customer dimension table contains data about customers, a product dimension table
contains information about products, and a store dimension table contains
information about stores. 9ueries use attributes in dimensions to specify a "iew into
the fact information. 3or example, a uery might use the product, store, and time
dimensions to ask the uestion :What was the cost of non-consumable goods sold in
the northeast region in ;<<<#: ,ubseuent ueries might drill down along one or
more dimensions to examine more detailed data, such as :What was the cost of
kitchen products in =ew (ork /ity in the third uarter of ;<<<#: In these examples,
the dimension tables are used to specify how a measure 0sale7amount1 in the fact
table is to be summari4ed.
/onsider an example of ,ales73act table and the "arious attributes that describe this
fact are ,tore, +roduct, Time and say ,ales +erson. In this case we will ha"e four
dimension tables, "i4. ,tore7Dimension, +roduct7Dimension, Time7Dimension and
(ou may notice that all of these dimensions contain a >ey field. This is called
,urrogate >ey. This key is substitute for a natural key in dimensions 0e.g., in
,ales7+erson7Dimension, we ha"e natural key as ID1. In a data warehouse a
surrogate key is a generali4ation of the natural production key and is one of the basic
elements of data warehouse.
&s a fact table is described by the four dimension tables described abo"e, it will
contain the ,urrogate >eys of all these dimensions. This is how the ,ales73act table
will look like-
=ow if you carefully look at the structure of abo"e tables and how they are linked the
schema will look like this-
(ou can easily tell that this looks like a ,T&'. )ence its known as ,tar ,chema.
d!antages of ha!ing "tar "chema
• ,tar ,chema is "ery easy to understand, e"en for non technical business
• ,tar ,chema pro"ides better performance and smaller uery times
• ,tar schema is easily extensible and will handle future changes easily
"lo#l$ %hanging Dimensions
)andling changes to dimensional data across time is the most important aspect in
designing a data warehouse. In dimensional modeling, there is a "ery rare chance
that a dimension will remain static o"er time. 3or example, a customer address may
change8 a company may phase out old products and introduce new products. What if
a customer name changes, sales person changes his region of sale or a company
assigns new sales territory. )ow to record the history or preser"e the old "ersion of
history# )ere comes the concept of ,lowly /hanging Dimensions. The term ,lowly
/hanging Dimension is about "ariation in dimensional attributes o"er time. The word
slowly, in this context, might seem incorrect. & sales person may change his territory
rapidly. But in general, when compared to measures in fact table, the changes in
dimensions occur slowly.
T$&es of "lo#l$ %hanging Dimensions
In reference to 3igure ? abo"e, lets say a sales person changes his region of sale. We
may handle this change in se"eral ways. These methods fall in "arious categories
based on companys need to preser"e an accurate history of dimensional changes.
'alph >imball categori4ed the dimensional changes into three categories
• Type 6ne- /hanges that o"erwrite history
• Type Two- +reser"e history
• Type Three- +reser"e a "ersion of history
T$&e 'ne ('!er#rite )istor$*
& type one change o"erwrites existing dimensional attribute with new information. In
,ales +erson 'egion change example, the old region name will be o"erwritten by the
new region. ,ay, a sales person 'ob, has territory as &,I&.
,ales7+erson7>ey ID =ame 'egion ...
;@@ A@?A?B 'ob Doe "+ ...
=ow, if he starts looking after =orthWest 'egion, by implementing Type ; dimension,
the dimension table will look like-
,ales7+erson7>ey ID =ame 'egion ...
;@@ A@?A?B 'ob Doe ,orthWest ...
• This is the easiest way to handle the ,lowly /hanging Dimension problem,
since there is no need to keep track of the old information.
• &ll history is lost. By applying this methodology, it is not possible to trace
back in history. 3or example, in this case, the company would not be able to
know that /hristina li"ed in Illinois before.
T$&e T#o (-reser!e )istor$*
& Type Two change writes a record with the new attribute information and preser"es
a record of the old dimensional data. Type Two changes let you preser"e historical
data. Implementing Type Two changes within a data warehouse might reuire
significant analysis and de"elopment. Type Two changes accurately partition history
across time more effecti"ely than other types. )owe"er, because Type Two changes
add records, they can significantly increase the database%s si4e.
In our example, lets say we identify 'egion as Type Two attribute. This can be
handled in this way using-
,ales7+erson7>ey ID =ame 'egion ...
;@@ A@?A?B 'ob Doe "+ ...
;C? A@?A?B 'ob Doe ,orthWest ...
• This allows us to accurately keep all historical information.
• This will cause the si4e of the table to grow fast. In cases where the number
of rows for the table is "ery high to start with, storage and performance can
become a concern.
• This necessarily complicates the !T$ process.
T$&e Three (-reser!e a .ersion of )istor$*
(ou usually implement Type Three changes only if you ha"e a limited need to
preser"e and accurately describe history, such as when someone gets married and
you need to retain the pre"ious name. Instead of creating a new dimensional record
to hold the attribute change, a Type Three change places a "alue for the change in
the original dimensional record. (ou can create multiple fields to hold distinct "alues
for separate points in time. In the case of a region change example, you could create
an 6$D7'!DI6= and =!W7'!DI6= field and a '!DI6=7/)&=D!7!337D&T! field to
record when the change occurs. This method preser"es the change. But how would
you handle a second name change, or a third, and so on# The side effects of this
method are increased table si4e and, more important, increased complexity of the
ueries that analy4e historical "alues from these old fields. &fter more than a couple
of iterations, ueries become impossibly complex, and ultimately you%re constrained
by the maximum number of attributes allowed on a table.
This is how the table will look like in Type Three change-
,ales7+erson7>ey ID =ame 6ld 'egion =ew 'egion ...
;@@ A@?A?B 'ob Doe "+ ,orthWest ...
• This does not increase the si4e of the table, since new information is updated.
• This allows us to keep some part of history.
• Type ? will not be able to keep all history where an attribute is changed more
than once. 3or example, if /hristina later mo"es to Texas on December ;C,
A@@?, the /alifornia information will be lost.
Because most business reuirements include tracking changes o"er time, data
warehouse architects commonly implement Type Two changes. & data warehouse
might use Type Two changes for all attributes in all tables. &s an alternati"e, you can
implement a mix of Type 6ne and Type Two changes at an attribute le"el by
implementing Type A changes for only attributes whose historical "alues are
important when you%re slicing and dicing. 3or example, users might not need to know
an indi"idual%s pre"ious name if a name change occurs, so a Type 6ne change would
suffice. Esers might want the system to show only the person%s current name.
)owe"er, if the company reassigns sales territories, users might need to track who
sold what, at what time, and in what territory, necessitating a Type Two change.
<hough most data warehouses include Type Two changes, you need to seriously
examine the business need to record historical data. Implementing Type Two
changes might be necessary, but those changes will increase the database si4e,
degrade performance, and lengthen the de"elopment time. (ou need to carefully
e"aluate using a Type Two implementation, a Type 6ne implementation, or a hybrid