Data Warehousing and Business Intelligence
In this section of techtiks, you will find detailed information and must-read articles related to Data Warehousing and
Business Intelligence. The target audience ranges from beginners to experts. These articles are written by highly
qualified Data Warehouse Engineers.
Whenever you see or hear the words "Data Warehouse" you might think of some large building that has bits of
information stored on shelves waiting for someone to retrieve them, perhaps? Let's think of some traditional
warehouse, what it contains? It contains some goods stored in such a way that they are easy to identify and they can
be quickly retrieved. A Data Warehouse also functions in the similar way. Then how the Data Warehouse is
different from traditional relational databases? Relational database is similar to Data Warehouse but there are certain
defining differences. You will see the differences in following articles.
Datawarehouse Defined
Elements of Data Warehouse
History of Data Warehousing
Dimensional Modeling
Datawarehouse Defined
Consider an example where business analyst uses the systems containing operational data (the data that runs the
daily transactions of your business). Analysts can use information about, which products were sold in which regions
at what time of the year, to look for anomalies or to project future sales. However, there are several problems if
analyst accesses operational data directly:
He might not have the expertise to query the operational database. For example, querying IMS databases
requires an application program that uses a specialized type of data manipulation language. In general,
those programmers who have the expertise to query the operational database have a full-time job in
maintaining the database and its applications.
Performance is critical for many operational databases, such as databases for a bank. The system cannot
handle users making ad-hoc queries.
The operational data generally is not in the best format to be used for reporting queries
Data warehousing solves these problems. In data warehousing, you create stores of informational data data that is
extracted from the operational data and then transformed for reporting and decision making. For example, a data
warehousing tool might copy all the sales data from the operational database, perform calculations to summarize the
data, and write it to a new database. End-users can query the new database (the warehouse) without impacting the
operational databases.
To summarize
The purpose of data warehouse is to store data consistently across the organization and to make the
organizational information accessible.
It is adaptive and resilient source of information. When new data is added to the Data Warehouse, the
existing data and technologies are not disrupted. The design of separate data marts that make up the data
warehouse must be distributed and incremental. Anything else is a compromise.
The data warehouse not only controls the access to the data, but gives its owners great visibility into the
uses and abuses of the data, even after it has left the data warehouse.
Data warehouse is the foundation for decision-making.
Elements of Data Warehouse
Source Systems
Typically in any organization the data is stored in various databases, usually divided up by the systems. There may
be data for marketing, sales, payroll, engineering, etc. These systems might be legacy/mainframe systems or
relational database systems.
Staging Area
The data coming from various source systems is first kept in a staging area. The staging area is used to clean,
transform, combine, de-duplicate, household, archive, and to prepare source data for use in data warehouse. The data
coming from source system is kept as it is in this area. This need not be based on relational terminology. Sometimes
managers of the data are comfortable with normalized set of data. In these cases, normalized structure of the data
staging storage is certainly acceptable. Also, staging area doesn’t provide querying/presentation services.
Presentation Server
Once the data is in staging area, it is cleansed, transformed and then sent to Data warehouse. You may or may not
have ODS before transferring data to Data Warehouse.
OLAP
The data in Data Warehouse has to be easily manipulated in order to answer the business questions from
management and other users. This is accomplished by connecting the data to fast and easy-to-use tools known as
Online Analytical Processing (OLAP) tools. OLAP tools can be thought of as super high-speed forklifts that have
knowledge of the warehouse and the operators built into them in order to allow ordinary people off the street to
jump in and quickly find products by asking English-like questions. Within the OLAP server, data is reorganized to
meet the reporting and analysis requirements of the business, including:
Exception reporting
Ad-hoc analysis
Actual vs. budget reporting
Data mining (looking for trends or anomalies in the data)
In order to process business queries at high speed, answers to common questions are preprocessed in some OLAP
servers, resulting in exceptional query responses at the cost of having an OLAP database that may be several times
bigger than the data warehouse itself.
Data Mart
Data mart is a logical subset of complete data warehouse. It is often viewed as the restriction of data warehouse to a
single business process or to a group of related business processes targeted toward a particular business group. For
example an organization may have a data mart for Sales or Inventory.
History of Data Warehousing
Ralph Kimball Vs. Bill Inmon's Paradigm of Data
Warehouse
In data warehousing field, we often hear about discussion on whether a person/organization’s philosophy falls into
Bill Inmon's camp or into Ralph Kimball's camp. Below is the difference between two philosophies:
Bill Inmon's paradigm
Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and
data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd
normal form.
Ralph Kimball's paradigm
Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the
dimensional model.
There is no right or wrong between these two ideas, as they represent different data warehousing philosophies. In
reality, the data warehouse in most enterprises is closer to Ralph Kimball's idea. This is because most data
warehouses started out as a departmental effort, and hence they originated as a data mart. Only when more data
marts are built later do they evolve into a data warehouse.
Dimensional Modeling
Quick Reference Guide to Dimensional Modeling
Dimensional modeling is the design concept used by many data warehouse designers to build their data warehouse.
Dimensional model is the underlying data model used by many of the commercial OLAP products available today in
the market. Designing a data warehouse is very different from designing an online transaction processing (OLTP)
system. In contrast to an OLTP system in which the purpose is to capture high rates of data changes and additions,
the purpose of a data warehouse is to organize large amounts of stable data for ease of analysis and retrieval.
Because of these differing purposes, there are many considerations in data warehouse design that differ from OLTP
database design. In dimensional model, all data is contained in two types of tables called Fact Table and Dimension
Table.
Fact Table
Each data warehouse or data mart includes one or more fact tables. The fact table captures the data that measures the
organizations business operations. A fact table might contain business sales events such as cash register transactions
or the contributions and expenditures of a nonprofit organization. Fact tables usually contain large numbers of rows,
sometimes in the hundreds of millions of records when they contain one or more years of history for a large
organization. A key characteristic of a fact table is that it contains numerical data (facts) that can be summarized to
provide information about the history of the operation of the organization. Each fact table also includes a multipart
index that contains as foreign keys the primary keys of related dimension tables, which contain the attributes of the
fact records. Fact tables should not contain descriptive information or any data other than the numerical
measurement fields and the index fields that relate the facts to corresponding entries in the dimension tables. An
example of fact table is Sales_Fact table that might contain the information like sale_amount, unit_price, discount,
etc.
Dimension Table
Dimension tables contain attributes that describe fact records in the fact table. Some of these attributes provide
descriptive information; others are used to specify how fact table data should be summarized to provide useful
information to the analyst. Dimension tables contain hierarchies of attributes that aid in summarization. For
example, a dimension containing product information would often contain a hierarchy that separates products into
categories such as food, drink, and non-consumable items, with each of these categories further subdivided a number
of times until the individual product is reached at the lowest level.
Dimensional modeling produces dimension tables in which each table contains fact attributes that are independent of
those in other dimensions. For example, a customer dimension table contains data about customers, a product
dimension table contains information about products, and a store dimension table contains information about stores.
Queries use attributes in dimensions to specify a view into the fact information. For example, a query might use the
product, store, and time dimensions to ask the question "What was the cost of non-consumable goods sold in the
northeast region in 1999?" Subsequent queries might drill down along one or more dimensions to examine more
detailed data, such as "What was the cost of kitchen products in New York City in the third quarter of 1999?" In
these examples, the dimension tables are used to specify how a measure (sale_amount) in the fact table is to be
summarized.
Consider an example of Sales_Fact table and the various attributes that describe this fact are Store, Product, Time
and say Sales Person. In this case we will have four dimension tables, viz. Store_Dimension, Product_Dimension,
Time_Dimension and Sales_Person_Dimension.
Figure 1
You may notice that all of these dimensions contain a Key field. This is called Surrogate Key. This key is substitute
for a natural key in dimensions (e.g., in Sales_Person_Dimension, we have natural key as ID). In a data warehouse a
surrogate key is a generalization of the natural production key and is one of the basic elements of data warehouse.
As a fact table is described by the four dimension tables described above, it will contain the Surrogate Keys of all
these dimensions. This is how the Sales_Fact table will look like:
Figure 2
Now if you carefully look at the structure of above tables and how they are linked the schema will look like this:
Figure 3
You can easily tell that this looks like a STAR. Hence its known as Star Schema.
Advantages of having Star Schema
Star Schema is very easy to understand, even for non-technical business managers
Star Schema provides better performance and smaller query times
Star schema is easily extensible and will handle future changes easily
Slowly Changing Dimensions
Handling changes to dimensional data across time is the most important aspect in designing a data warehouse. In
dimensional modeling, there is a very rare chance that a dimension will remain static over time. For example, a
customer address may change; a company may phase out old products and introduce new products. What if a
customer name changes, sales person changes his region of sale or a company assigns new sales territory. How to
record the history or preserve the old version of history? Here comes the concept of Slowly Changing Dimensions.
The term Slowly Changing Dimension is about variation in dimensional attributes over time. The word slowly, in
this context, might seem incorrect. A sales person may change his territory rapidly. But in general, when compared
to measures in fact table, the changes in dimensions occur slowly.
Types of Slowly Changing Dimensions
In reference to Figure 3 above, let’s say a sales person changes his region of sale. We may handle this change in
several ways. These methods fall in various categories based on company’s need to preserve an accurate history of
dimensional changes. Ralph Kimball categorized the dimensional changes into three categories
Type One: Changes that overwrite history
Type Two: Preserve history
Type Three: Preserve a version of history
Type One (Overwrite History)
A type one change overwrites existing dimensional attribute with new information. In Sales Person Region change
example, the old region name will be overwritten by the new region. Say, a sales person Rob, has territory as ASIA.
Sales_Person_Dimension
Sales_Person_Key
ID
Name
Region
...
100
203234
Rob Doe
ASIA
...
Now, if he starts looking after NorthWest Region, by implementing Type 1 dimension, the dimension table will look
like:
Sales_Person_Dimension
Sales_Person_Key
ID
Name
Region
...
100
203234
Rob Doe
NorthWest
...
Advantages:
This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep
track of the old information.
Disadvantages:
All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in
this case, the company would not be able to know that Christina lived in Illinois before.
Type Two (Preserve History)
A Type Two change writes a record with the new attribute information and preserves a record of the old dimensional
data. Type Two changes let you preserve historical data. Implementing Type Two changes within a data warehouse
might require significant analysis and development. Type Two changes accurately partition history across time more
effectively than other types. However, because Type Two changes add records, they can significantly increase the
database's size.
In our example, lets say we identify Region as Type Two attribute. This can be handled in this way using:
Sales_Person_Dimension
Sales_Person_Key
ID
Name
Region
...
100
203234
Rob Doe
ASIA
...
153
203234
Rob Doe
NorthWest
...
Advantages:
This allows us to accurately keep all historical information.
Disadvantages:
This will cause the size of the table to grow fast. In cases where the number of rows for the table is very
high to start with, storage and performance can become a concern.
This necessarily complicates the ETL process.
Type Three (Preserve a Version of History)
You usually implement Type Three changes only if you have a limited need to preserve and accurately describe
history, such as when someone gets married and you need to retain the previous name. Instead of creating a new
dimensional record to hold the attribute change, a Type Three change places a value for the change in the original
dimensional record. You can create multiple fields to hold distinct values for separate points in time. In the case of a
region change example, you could create an OLD_REGION and NEW_REGION field and a
REGION_CHANGE_EFF_DATE field to record when the change occurs. This method preserves the change. But
how would you handle a second name change, or a third, and so on? The side effects of this method are increased
table size and, more important, increased complexity of the queries that analyze historical values from these old
fields. After more than a couple of iterations, queries become impossibly complex, and ultimately you're constrained
by the maximum number of attributes allowed on a table.
This is how the table will look like in Type Three change:
Sales_Person_Dimension
Sales_Person_Key
ID
Name
Old Region
New Region
...
100
203234
Rob Doe
ASIA
NorthWest
...
Advantages:
This does not increase the size of the table, since new information is updated.
This allows us to keep some part of history.
Disadvantages:
Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if
Christina later moves to Texas on December 15, 2003, the California information will be lost.
Because most business requirements include tracking changes over time, data warehouse architects commonly
implement Type Two changes. A data warehouse might use Type Two changes for all attributes in all tables. As an
alternative, you can implement a mix of Type One and Type Two changes at an attribute level by implementing
Type 2 changes for only attributes whose historical values are important when you're slicing and dicing. For
example, users might not need to know an individual's previous name if a name change occurs, so a Type One
change would suffice. Users might want the system to show only the person's current name. However, if the
company reassigns sales territories, users might need to track who sold what, at what time, and in what territory,
necessitating a Type Two change.
Although most data warehouses include Type Two changes, you need to seriously examine the business need to
record historical data. Implementing Type Two changes might be necessary, but those changes will increase the
database size, degrade performance, and lengthen the development time. You need to carefully evaluate using a
Type Two implementation, a Type One implementation, or a hybrid implementation.