List of Figures 1.1 A simp simple le Data Data Wareh Warehouse ouse . . . . . . . . . . . . . . . . . . . . . . .
2.1 Elemen Elements ts of of a Data Data Wareh Warehouse ouse . . . . . . . . . . . . . . . . . . . . . 14 2.2 A single single DDS archit architectu ecture re . . . . . . . . . . . . . . . . . . . . . . . 14 3.1 The FourFour-Step Step Dime Dimensi nsional onal Desi Design gn Process Process . . . . . . . . . . . . . . 17 5.1 5.2 5.2 5.3 5.4 5.5 5.6 5.7 5.7 5.8 5.8 5.9 5.9 5.10 5.11 5.11
The FourFour-Step Step Dime Dimensi nsional onal Desi Design gn Process Process . Product Product Sale Saless Dat Dataa mart mart . . . . . . . . . . . The Product Product dimens dimension ion Hierarc Hierarchy hy . . . . . . The Custo Customer mer dime dimensi nsion on Hierar Hierarch chy y . . . . . The Date dimens dimension ion Hierarc Hierarchy hy . . . . . . . . The Oﬃce dimensi dimension on Hiera Hierarch rchy y . . . . . . . The The Date Date dime dimens nsio ion n . . . . . . . . . . . . . The The Oﬃce Oﬃce dimen dimensi sion on . . . . . . . . . . . . . The The prod product uct dime dimens nsio ion n. . . . . . . . . . . . The Customer Customer dimension dimension . . . . . . . . . . . The Product Product Sales Data mart mart . . . . . . . .
Creati Creating ng the the Proﬁ Proﬁtt repor reportt . Buil Buildi ding ng the the Proﬁ Proﬁtt repor reportt . Design Designing ing the report report matri matrix x Sales Sales by by coun country try report report . . . Model Model Sales Sales Report Report . . . . . Model Model Sales Sales Report Report . . . . .
Type Type 2 respon response se to SCD SCD . . . . . . . . . . . . . . . . . . . . . . . . 29 Type Type 3 respon response se to SCD SCD . . . . . . . . . . . . . . . . . . . . . . . . 30
Faculty of Engineering and Science Aalborg University
Department of Computer Science
TITLE: Building a Data Warehouse
PROJECT PERIOD: DE, Sept 1st 2008 Dec 19th 2008
ABSTRACT: This report documents our experiences while trying to learn the fundamental aspects of data warehousing. fundamental aspect of w building this repo reporrt trie triess to. to.. our jour journey ney into into data warehousing/ foray into, tries to present our/ the obstacle encountered
PROJECT GROUP: DE-1 GROUP MEMBERS: Dovydas Sabunas Femi Adisa SUPERVISOR: Liu Xiufeng NUMBER OF COPIES: 4 REPORT PAGES: ?? TOTAL PAGES: ??
Chapter 1 Introduction What is a Data Warehouse? Before we get down to work and try to build a data warehouse, we feel it is very important important to ﬁrst deﬁne a data warehouse warehouse and related terminologies terminologies and why organizations decide to implement one. Further down we will talk about what should be the driving force behind the need to build a data warehouse and what the focus should be on, during implementation. While various deﬁnitions abound for what is and what constitutes a data warehouse, the deﬁnition which we believe best describes a data warehouse is deﬁned by : 1]: A data warehouse is a subject-oriented, integrated, time-variant and nonvolatile collection of data in support of management’s decision making process. We take take a moment moment to go through through the deﬁnitio deﬁnition. n. A data wa wareho rehouse use is subject oriented ; this means that it is speciﬁcally designed to attack a particular business domain. domain. A data wa wareho rehouse use is integrated ; it is a repository of data from multiple, possibly heterogeneous data sources, presented with consistent and coherent semantics. Data in a data warehouse comes from one or more source systems. These are usually OLTP or online analytical processing systems that handle day to day transactions of a business or organization. A data warehouse is time-variant ; where each unit of data in a data warehouse is relevant to some moment in time. A data warehouse is non-volatile ; it contains historic historic snapshots of various operational system data, and is durable. Data in the data warehouse is usually neither updated nor deleted but new rows are rather uploaded, usually in batches on a regular basis. A data warehouse supports management’s decision making process; the main
reason for building a data warehouse is to be able to query it for business intelligence ligence and other other analyti analytical cal activitie activities. s. Users Users use various arious front-en front-end d tools such as spreadsheets, pivot tables, reporting tools, and SQL query tools to probe, retrieve and analyze (slice and dice) the data in a data warehouse to get a deeper understandin standingg about their businesse businesses. s. They They can analyze analyze the sales by time, time, customer customer,, and product. product. Users Users can also also analyze analyze the revenue revenue and cost for a certain certain month, month, region, and product type. ETL: Data from source systems are moved into the data warehouse by a process known as ETL (Extract, Transform and Load). It is basically a system that connects to the source systems, read the data, transform the data, and load it into a target target system. system. It is the ETL system system that integrate integrates, s, transforms transforms,, and loads the data into a dimensional data store (DDS). A DDS is a database that stores the data warehouse data in a diﬀerent format than OLTP. The data is moved from the source system into the DDS because data in the DDS is arranged in a dimensional format that is more suitable for analysis and helps to avoid querying the source system directly. Another reason is because a DDS is a one-stop shop for data from several source systems.
Figure 1.1: A Data Warehouse in its simplest form.
What is Data Warehousing? This is the process of designing, building, and maintaining a data warehouse system.
Why build a Data Warehouse? The most compelling reason why an organization should want a data warehouse, would be to help it make sense of the vast amount of transactional data that the business is generating, the volume of which is growing tremendously on a day to day basis. Typically before the advent of data warehousing, data from OLTP systems were regularly archived onto magnetic disk and kept in storage over a period of time, in case something goes wrong and they need to restore the data or as in 9
case with the banking and insurance industries as required by regulations and also for performance enhancing purposes. It was not until much later that it was realized, the potential that these data hold for analysis of business activities over time, as well as forecasting and analyzing trends. trends. Even Even then, it wa wass not quite quite feasible feasible to get a consoli consolidate dated d or integrate integrated d overview of the data due to the lack of the available technology and also because most of the information, often times come from several disparate systems and available reporting tools were not able to deal with them. Technology has come a long way and so also has data warehousing matured. Any organization that implements an OLTP system in the day to day running of its business business,, knows knows that the value of informa information tion contain contained ed within within these these system systems, s, when analyzed properly can help leverage the business and support management decision decision making. making. It is important to mention at this very early juncture, that the decision to build a data warehouse should to a large extent be a purely business decision and not one of technology technology.. Early data warehouse warehouse projects pro jects failed because, pro ject managers focused more on delivering a technology and at the end of the day they succeeded. But what they delivered was beautiful nonsense; Nice to look at and state of the art but of little beneﬁt to business users. The business users and their needs were not properly aligned and well incorporated in the data warehousing; instead the focus was on delivering the technology. These These projects projects failed failed not because, because, the data warehou warehouse se wa wass not delivered delivered.. On the contrary, they delivered a product that did not meet or satisfy the needs of the business users and as a result, they were abandoned. It is of utmost importance to get business users involved in every stage of the data warehouse development cycle and to put in place a mechanism for constant interaction and feedback sessions. From the moment a need is identiﬁed until the ﬁnal delivery of a fully functional data warehouse. warehouse.
The Classic Car Case study During the course of this project, we will be building a data warehouse for a ﬁctitio ﬁctitious us company company called called Classic Classic Cars Inc. we try to cove coverr all the core aspects of data warehousing; architecture, methodology, requirements, data modeling, ETL, metadata, reports. Building Building a complete complete data warehouse warehouse given our time frame and human resources is not feasible. It is very important that we deﬁne a scope for our project and this we do by analyzing the source system to know what kind of data resides in it and 10
what we can derive out of it. The classic car source database contains sales order transactions data which makes it ideal for constructing a sales data mart. Classi Classicc car Inc. is a compan company y that is in the business business of sellin sellingg scale models of classic/vintage cars, aero planes, ships, trucks, motorbikes, trains and busses. Their customer base spans across the globe. They sell only to retailers in diﬀerent regions regions.. There There is usually usually more than one customer customer in a country country.. The compan company y itself is headquartered in the USA and has branch oﬃces in several countries. Each branch oﬃce is responsible for diﬀerent geographical regions. Customers send in their orders and the company ships it to them via courier. Each Each custome customerr has a responsi responsible ble employe employeee that deals with it. The company company also also gives gives credit facilities to the customers and each customer has a credit limit depending on their their level level of standin standingg with with the company company. The customers customers usually usually mail mail in their payment checks after they receive their orders. The company itself does not manufacture the products it sells but there is no information in the database about its suppliers. suppliers. We can only assume that its operations are not fully computerized computerized or that it runs several disparate systems.
Summary In this chapter we gave a breakdown breakdown of what data warehousing warehousing is. We explained what should be the driving force behind every decision to build a data warehouse. We ﬁnished ﬁnished by giving giving an introductio introduction n to our case study study. In the next next chapte chapterr we will look at the various data warehousing architecture.
Chapter 2 The Data Warehouse Architecture
In this chapter we will give a brief overview of data warehouse elements. We will explain typical data warehouse architectures and explain which one we have chosen and why. A data warehouse system comprises 2 architectures; the data ﬂow architecture and the system system architect architecture. ure. System System architect architecture ure deals with with the physical physical conﬁguration of the servers, network, software, storage, and clients and will not be discussed in this report. Choosing Choosing what what archit architectu ecture re to implem implement ent when buildi building ng a data warehouse warehouse is largely based on the business environment that the warehouse will be operating in. For example, example, how how many many source source systems systems feed into into the data wa wareho rehouse use or how how the data ﬂows within the data stores to the users or what kind of data will be requested by end users applications. The ﬁgure 2.1 illustrates the basic elements of a data warehouse.
Figure 2.1: Basic elements of a Data Warehouse
Data Flow Architecture. Accor Accordi ding ng to to , 3], there are four data ﬂow architectures: single Dimensional Data Store Store (DDS (DDS), ), Norm Normal aliz ized ed Da Data ta Store Store (NDS (NDS)) + DD DDS, S, Operat Operatio ional nal Da Data ta Store Store (ODS) (ODS) + DDS, DDS, and federated federated data wa wareho rehouse use.. The ﬁrst three three use a dimensi dimensional onal model as their back-end data stores, but they are diﬀerent in the middle-tier data store. store. The federated federated data warehouse warehouse architect architecture ure consist consistss of several several data wa warerehouses integrated by a data retrieval layer. We have chosen to implement the single DDS architecture because our data warehouse will be fed from only one source system. Not only is the single DDS the simplest, quickest and most straightforward architecture to implement, but also because our DDS will consist of only the sales data mart. The Architecture is by every every means means extensibl extensible. e. It can quite easily easily be scaled scaled up to be fed by more than one source system and the DDS can also comprise several data marts.
Figure 2.2: A single DDS Data Warehouse architecture.
A data store is one or more databases or ﬁles containing data warehouse data, arranged in a particular particular format and involve involved d in data warehouse warehouse processes . 3]. The stage is an internal data store used for transforming and preparing the data obtained from the source systems, before the data is loaded to into the DDS. Extracting data into the stage minimizes the connection time with the source system and allows processing to be done in the staging area without undue strain on the OLTP OLTP systems systems.. We have have incorporate incorporated d the staging staging area to make make the design extensible as well because if in the future the DDS will be fed from multiple source systems, the staging area is vital for the processing and transformation. The dimensional data store (DDS) is a user-facing data store, in the form of a database, made up of one or more data marts, with each data mart comprising of dimension and fact tables arranged in dimensional format for the purpose of supporting analytical queries. We will describe the format of the DDS later. For applications that require the data to be in the form of a multidimensional database (MDB) rather than a relational database an MDB is incorporated into our design. An MDB is a database where the data is stored in cells and the position tion of each cell is deﬁned by a numbe numberr of variab variables les called called dimensio dimensions ns . Each Each cell represents a business event, and the value of the dimensions shows when and where this event happened. The MDB is populated from DDS. In between the data stores sits the ETL processes that move data from one data store (source) (source) into into another another data store store (target) (target).. Embedde Embedded d within within the ETL are logics logics to extract, extract, transfor transform m and load the data. data. Informa Information tion about each ETL process is stored in metadata. This includes the source and target info, the transformation applied, the parent process, and each ETL process to run schedule. The technology we have chosen for this data warehousing project is Microsoft SQL Server Integration Services and Analysis Services (MSSIS, MSSAS). It provides vides a platform platform for buildi building ng data integr integrati ation on and workﬂo workﬂow w applica applicatio tions. ns. It is an integration of tools that provides database, updating multidimensional cube data, data, ETL and Reporting Reporting capabiliti capabilities. es. It also also includ includes es the Busine Business ss Intell Intelligen igence ce Development Studio (BIDS). which allows us to edit SSIS packages.
Summary In this this chapte chapterr we explained explained what consists consists a data warehous warehousee archit architectu ecture. re. We mentioned the 4 types of data ﬂow architectures available and explained why we adopted the Single DDS architecture and went on to describe it in detail. We also introduced the technology we will be using . In the next chapter we will explain the methodology we will be following to build the data warehouse and why we have adopted the particular approach.
Chapter 3 The Methodology
In this chapter we discuss the process which we will be adopting in building our data warehouse. warehouse. We have have chosen chosen to go with Ralph Kimball’s Kimball’s Four-Step Four-Step Dimensional sional Desig Design n Process Process . 2]. The approach was mentioned and recommended in all the diﬀerent literatures we read. read. It is foll follow owed ed by expert expertss in the ﬁeld ﬁeld and and it was quite quite easy to see why why afafter consulting The Data Warehouse Toolkit, The Complete Guide to DImensional Model Modelli ling ng  ourselves. Dimensional Modelling was well outlined and quite straightforward and we felt it provided us with the right footing to literally hit the ground running when it came to building our own data warehouse.
Figure 3.1: Key Input to the four-step dimensional design process
The Four-Step Design Process. STEP 1: Selectin Selectingg a busine business ss process to model. A process is a natural natural business business activity performed in an organization that is typically supported by a source system . . 17
It should not be confused with a business department. department. Orders, purchasing, purchasing, shipshipments, ments, invoicing invoicing and inventory inventory all fall under business processes. For example, a single dimensional model is built to handle orders rather than building separate models for the sales and marketing departments. That way both departments can access orders data. Data is published once and inconsistencies can be avoided. After a careful analysis of our source system database, we have selected sales as a business process to model because this is the only model that can supported by the data available to us in the source system. We will build a sales data mart for the Classis Cars Co., which should allow business users to analyze individual and overall product sales and individual stores performances. The norm would have been to set up a series of meetings with the prospective users of the data warehouse as a means of gathering the requirements and selecting which model to implement but because we do not have this opportunity, we are conﬁned to selecting a model which we feel can best be implemented based on the data available from our source system database. Declaring ng the grain grain of the business business process. process. Here we identif identify y what STEP 2: Declari exactly exactly constitu constitutes tes a row in a fact fact table. table. The grain conve conveys ys the level level of detail detail associat associated ed with the fact table measurem measurement entss . 3]. Kimball and Ross recommend that a dimensional model be developed for the most atomic information captured by a business process. Typical examples of suitable candidates: •
An Individual line item on a customer’s retail sales ticket as measured by a scanner device.
A daily snapshot of the inventory levels of each product in a warehouse.
A monthly monthly snapshot snapshot for each each bank accoun account.. t.. . .
When data is at its atomic form, it provides maximum analytic ﬂexibility because it can be rolled up and cut through (sliced and diced) in every possible manner. manner. Detail Detailed ed data in a dimensi dimensional onal model is most suitable suitable for ad hoc user queries. A must if the data warehouse is to be accepted by the users.
STEP 3: Choosing the Dimensions. By choosing the correct grain for the fact table, table, the dimensi dimensions ons automatical automatically ly become become eviden evident. t. These These are basical basically ly ﬁelds ﬁelds that describe describe the grain grain items. items. We try to create create very robust dimensio dimensions ns and this this means means juicin juicingg it up with with descrip descriptiv tivee textlik textlikee attribu attributes. tes. Fields Fields like like order order date which represents the date the order was made, product Description, which helps to describe the product and so on. 18
As we understand the problem better, more dimensions will be added as required. Sometimes adding a new dimension causes us a take a closer look at the fact table. Adding additional dimensions should however not cause additional fact rows to be generated.
STEP 4: Identifying the numeric facts that will populate each fact table row; Numeric facts are basically business performance measures. According to , , all candidate facts in a design must be true to the grain deﬁned in step step 2. In our case, case, an indi indivi vidua duall order order detai details ls line include include such such facts facts lik like, quantity sold, unit cost amount and total sale amount. These facts are numeric additive ﬁgures, and will allow for slicing and dicing, their sums will be correct across dimensions and more additional measures can be derived or computed from them. With the proper facts, things like gross proﬁt (cost amount - sales amount) can be easily computed and this derived ﬁgure is also additive additive across dimensions. dimensions. In building a data warehouse, it is highly important to keep the business users’ requirements and the realities of the source data in tandem. One should normally use an understanding of the business to determine what dimensions and facts are required to build the dimensional model. We will do our best to apply Kimball and Ross’ four-step methodology to what we believe would be the normal business requirements for this project.
Summary In this chapter, we outlined Ralph kimball’s four-step methodology and presented why it is very popular amongst the data warehousing community. We talked brieﬂy about our constraint of not having business users to interact with as a means of gathering business requirements for this project and how we hope to work around this. Next chapter, we will discuss the functional requirements for the data warehouse.
Chapter 4 Functional Requirements
Before diving into the process of data warehousing, it is important to deﬁne what is expected expected from the completed completed data mart. i.e what do the business business users expect to be able to do with our system or as in our case, what we believe will help Classic Cars achieve their business objectives. Functional requirements mean deﬁning what the system does.By deﬁning the functional requirements, we have a measure of success at the completion of the project, as we can easily look at the data warehouse and determine how well it conforms or provides answers to the various requirements posed in table 4.1. In trying to deﬁne the functional requirements, we explored the source system and tried to analyze the business operations of Classic Cars. In the end, we agreed that the data warehouse should be able to help users provide answers to: No. Requirement Priority 1 Customer Purchase history High 2 Pro duct order history High 3 Produ oduct sales per geographic region High 4 Store sales p erformance High 5 Customer payment history High 6 Buying patte atterrns per per geog geogrraph aphic regi egion High Table 4.1: Functional requirements for the Classic Cars Data Warehouse.
In this short but very important chapter, we tried to outline what the business users expect from our ﬁnished data warehouse. This will very much be the yardstick, which will determine whether the data warehouse will be accepted by the users or not. A data warehouse that does not meet the expectation of the business users would not be used and from that perspective would be deemed to have failed. In the next chapter, we combine the functional requirements and the methodology and try to come up with a dimensional model of our data warehouse.
Chapter 5 Data Modelling 5.1 We start oﬀ this chapter by explaining some dimensional modeling terms. We will design the data stores. By looking at the functional requirements, we are able to know what to include in our data stores. We will be using the dimensional modeling approach approach and follow the Four-Ste Four-Step p Dimensional Dimensional Design Process  outlined in the previous chapter. We will ﬁrst deﬁne and then build our fact table surrounded by the dimensional table tables. s. The The conten contents ts of our our fact fact and dimens dimensio ion n tabl tables es will will be dicta dictated ted by the functional requirements deﬁned in the previous chapter. We will construct a data hierarchy and also construct a metadata database.
Fact Table: A fact table is the primary table in a dimensional model where the numerical performance measurements of the business are stored . . Measurements from a single business process are stored in a single data mart. FACT represents a business measure e.g. quantities sold, dollar sales amount per product, per day in a store. The most useful useful facts in a FACT ACT table table are numeric numeric and additive additive.. This This is due to the fact that the usual operation on warehouse data is retrieving thousands of rows and adding them up. Fact tables contain a primary key which is a combination of primary keys from the dimension dimension tables (foreign (foreign keys). keys). Also Also known known as a composit compositee or concaten concatenated ated key, this helps to form a many-to-many relationship between the fact table and the
dimensional tables. Not every foreign key in the fact table is needed to guarantee uniqueness. Fact tables may also contain contain a degenerate degenerate dimension (DD) column. This is a dimension dimension with only one attribute attribute and as such is added to the fact table as opposed to having a dimension table of its own with only one column. These contai contain n textual textual descriptor descriptorss that accompan accompany y the Dimension Tables: These data in the fact table. table. The aim is to include include as much much descripti descriptive ve attribute attributess as possible because they serve as the primary source of query constraints, groupings, and report labels. E.g. when a user states to see a model sales by country by region, country and region must be available as dimension attributes. They are the key to making the data warehouse usable and understandable, and should contain verbose business terminology as opposed to cryptic abbreviations . . Dimension tables are highly denormalized and as a result contain redundant data but this is a small price to pay for the trade oﬀ. What we achieve is ease of use and better query performance as less joins are required. The data warehous warehousee is only as good as its dimens dimension ion attribute attributes. s. Dimensi Dimension on tables also represent hierarchical relationships in the business.
5.2. 5.2.1 1
Dime Dimens nsio iona nall Model Model
When we join the fact table together with the corresponding dimension tables, we get what is know known n as a data data mart. mart. This This forms forms a kind kind of star star like like struc structur turee and is also referred referred to as the star join schema schema . . The star schema schema is based on simpli simplicit city y and symmet symmetry ry.. It is very very easy to understa understand nd and navig navigate. ate. Data in the dimension tables are highly denormalized and contain meaningful and verbose business business descriptors, users can quickly quickly recognize that the dimensional model properly represents their business. Another advantage of using a dimensional model is that it is gracefully extensible sible to accomm accommodate odate chang changes es . It can easily easily withstand withstand unexpected unexpected changes changes in user behavior. We can easily add completely new dimensions to the schema as long as a single value of that dimension is deﬁned for each existing fact row. It has no built-in bias as to query expectations and certainly no preferences for likely business questions. All dimensions are equal and present a symmetrical equal entry points into the fact table. The schema should not have to be adjusted
every time users come up with new ways to analyze the business. The key to achieving this lies in the process of choosing the granularity as the most granular or atomic data has the most dimensionality . . According According to , , atomic data that has not been aggregated is the most expressive and the fact table incorporates atomic data, and so should be able to withstand ad hoc user queries; a must if our warehouse is to useful and durable. Creating a report should be as simple as dragging and dropping dimensional attributes and facts into a simple report.
5.2. 5.2.2 2
Meta Metada data ta
Metada Metadata ta is the encyclope encyclopedia dia of a data data wareh warehous ouse. e. It contai contains ns all the info inform rmat atio ion n about about the the da data ta in the the da data ta wareh arehou ouse se.. It suppo support rtss the the various activities required to keep the data warehouse functioning, be it techni technical cal;; (infor (informat mation ion about about source source system systems, s, source source tables tables,, target target tables tables,, load load times, times, last last succes successfu sfull load, load, transf transform ormati ation on on data, data, etc), etc), administra administrativ tive; e; (indexes, (indexes, view deﬁnitions, deﬁnitions, security security privileges privileges and access rights, ETL run schedules, run-log results, usage statistics, etc) or business users support (user documentation, business names and deﬁnition, etc). We build a metadata database, which will serve as the catalogue of the data warehouse.
Desi Design gnin ing g the the Dime Dimens nsio iona nall Data Data Sto Store re
In order to do a good DDS design, we must ensure that the design of the DDS is driven by the functional requirements deﬁned in the previous chapter cha pter.. This is because the functional functional requireme requirements nts represen representt the kind of analysis that the business users will want to perform on the data in the warehouse.
Figure 5.1: Key Input to the four-step dimensional design process
5.3. 5.3.1 1
STEP STEP 1: 1: Sele Select ctin ing g the the Busi Busine ness ss Mode Modell
Understanding the business requirements coupled with analysis of the avail availabl able e data data helps helps us to choose what business business process process to model. model. In a normal real life situation, we would choose an area that would have an immediate and the most impact on business users as a means of getting them to adopt the system quite easily. However, we are constrained by the fact that the only data available to us in our source source system system is sales data. data. So our busine business ss process to model model is product sales. We will build a Product-sales Product-sales data mart. mart. A data mart is simply a fact table surrounded by its corresponding dimens dimension ion tables tables that that model model a busine business ss process process.. It will allow allow users users to answer questions posed in the functional requirements. The product sales event happens when a customer, through a sales rep rep plac places es an order order for for some some of the the pr produ oduct cts. s. The The role roless (who (who,, what what,, wher where) e) in this this ca case se are are the the cu cust stom omer er,, pr prod oduc uct, t, an and d the the stor store. e. The The measures are the quantity, unit price and value of sales. We will put the measures into the fact table and the roles (plus dates) in the dimension tables. The business events become individual rows in the fact table.
Figure 5.2: Preliminary Sales Data Mart
5.3. 5.3.2 2
STEP STEP 2: Decl Declar arin ing g the the Gr Grai ain n
Declaring the grain means deciding what level of data detail should be avail availabl able e in the dimension dimensional al model. model. The goal being being to create create a dimendimen26
sional model for the most atomic information captured by the business process outlined in step 1. Diﬀerent arguments abound about how low or the atomicity of the grain should be. According According to Ralph Kimball, Kimball, tackling tackling data at its lowest, most atomic grain makes sense on multiple fronts. Atomi Atomicc data data is highly highly dimension dimensional. al. The more detail detailed ed and atomic atomic the fact measurement, the more things we know for sure. Atomic data provides maximum analytic ﬂexibility because it can be cons constr trai aine ned d an and d rolle rolled d up in ever every y possi possibl ble e way. Deta Detaile iled d da data ta in a dimensional model is poised and ready for the ad hoc attack by the business users. Selecting a higher-level grain limits the potential to less detailed dimensions and makes the model vulnerable to unexpected user requests to drill down into into the details. The same would would also be b e true if summary or aggregated data is used. In our Classic Car study, we have chosen an individual line item in the order details transaction table as the most granular data item. In other words, the grain or one row of the Product-Sales fact table corresponds to one unit of a model sold (car, truck, motorcycle, etc). By choosing such low level grain, we are not restricting the potentials of the data warehouse by anticipating user queries but ensuring maximum dimensionality and ﬂexibility because queries need to cut through details (slicing and dicing) in precise ways, whether they want to compare sales between particular days, or compare models sale according to scale model size. While users will probably not want to analyze every single line item sale sale in a par partic ticula ularr order, order, provid providing ing acc access ess to summar summarize ized d data data only only would not be able to answer such questions.
5.3. 5.3.3 3
STEP STEP 3: Choos Choosin ing g the the dim dimen ensi sion onss
After we have identiﬁed what constitutes the business measure of the even eventt we are modelin modeling g (Produc (Product-S t-Sale ales), s), certai certain n ﬁelds ﬁelds which which descri describe be or qualify the event event (roles) become obvious: obvious: product, product, store, store, customer, customer, date will form the dimensions. We will also have the Order Number as a dimension, but because it 27
does not have any other attributes of its own, it will sit in our fact table as a degene degenerat rate e dimens dimension ion.. It will will help help to ident identify ify products products belongin belonging g to a particular order. Dimens Dimension ion tables tables need need to be robust robust and as verbose verbose as possible possible.. Dimensions implement the user interface to a data warehouse and It is not uncommon to have a dimension table containing 50 - 100 columns. Unlike fact tables, they are updated infrequently and updates are usually minor additions like adding a new product or customer or updating prices and etc.
Slo Slowly wly Cha Chang ngin ing g Dim Dimen ensi sion onss
This brings us to the problem of slowly changing changing dimensions dimensions or SCD and how how it is han handle dled. d. If we recall recall from from our deﬁnit deﬁnition ion of what what a data data warehouse is, we know that it stores historical data, so what then happens for example if the value of a dimensional attribute changes? Say for exampl Say example, e, an oﬃce oﬃce that that was was overs overseei eeing ng a par partic ticula ularr region region or a custom customer er ch chang anges es add addres ress? s? Surely Surely,, merely merely updating updating this this dimendimension by simply changing the address will mean all previous transactions carried out under the old region or address can no longer be isolated and we might not be able to analyze the information because queries would have no means to refer to them explicitly since they are now part of the new region or address and a fundamental function of our data warehouse of storing historical data is no longer true. Acco Accorrding ding to , the the pr prob oble lem m of SCD SCD can can be ha hand ndle led d by eith either er overwriting existing values (type 1 SCD) or preserving the old attribute values as rows (type 2), or storing them as columns (type 3). Type 1 response is only suitable if the attribute change is a correction tion or ther there e is no value alue of reta retain inin ing g the the old old descr descrip ipti tion on.. This This is not not usually desirable and should be up to the business users to determine if they want to be able to keep it or not. Type 2 response is the most common technique as it is the most ﬂexible to implement and does not limit the number of times we can easily easily reﬂect reﬂect a ch chang ange e in a dimens dimension ion attribut attribute. e. It invo involv lves es add adding ing a 28
new dimension row every time an attribute changes, the current value is preserved in the current row and the new value is reﬂected in the new row. Using this method we are able to stay true to our deﬁnition of a data warehouse keeping historical data and also allowing users to be able to track historical changes and perform analysis constrained on either or both values. Let us suppose in our case study, that a particular car model is only sold in region 1 up until a certain period and then Classic Cars decided to discon discontin tinue ue its sale there there and move move it to region region 2. Obvio Obviously usly under under type type 1 respons response, e, from from the moment moment the ﬁeld attrib attribute ute is correc corrected ted to reﬂect region 2 as the new region, there will be no way of analyzing car X model sales performance prior to when it was moved to Region 2. Furthermore, analysis on the sales ﬁgures in Region 2 will reﬂect albeit, incorrectly car X model’s sales ﬁgure from when it was in Region 1 as part of Region 2’s. Using type 2 approach, when the region changed, we will add a new dimens dimension ion row row to reﬂect reﬂect the ch chang ange e in region region attribut attribute. e. We will will then then have two product dimensions for car X model: Produ Product ct Key Key Produ Product ct Desc Descri ript ptio ion n Regi Region on 1 2 33 Ferrari Blazer Region 1 2 3 46 Ferrari Blazer Region 2
Produ Product ct Co Code de FERR-12 FERR-12
Table 5.1: Type 2 response to SCD
The above table also helps us to see why we have to introduce surrogate keys into our dimension tables as opposed to using the natural keys. The surrogate keys keys can help to identify a unique product attribute proﬁle proﬁle that that was was true for a span span of time . Plus we do not also need to go into the fact table to modify the product keys and the new dimension row also helps to automatically partition history in the fact table. Constraining a query by Region 1, on car x prior to the change date will only reﬂect product key 1233 when car x was still in Region 1 and constraining by a date after the change will no longer reﬂect the same Product key because it now rolls up in Region 2. We also introduce a date stamp column on the dimension row which will will help help track track new rows rows that that are added, added, a valid alid or inv invalid alid attrib attribute ute is also also add added ed to indicat indicate e the state state of the attribu attributes tes.. Eﬀectiv Eﬀective e and expiration dates are necessary in the staging area because they help to determine which surrogate keys are valid when the ETL is loading historical fact records. Type 3 response uses a technique that requires adding a new column to the dimensio dimension n table table to reﬂect reﬂect the new attribu attribute. te. The advan advantag tage e it oﬀers is that unlike type 2 response, it allows us to associate the new value with old fact history or vice versa . . If we reme remem mbere bered d that that in type ype 2 resp respon onse se,, the the new new row row ha had d to be assigned a new product key (Surrogate Key) so as to guarantee uniqueness, the only way to connect them was through the product code (Naturall Key). ura Key). Using Using a type type 3 respons response, e, the solution solution would would look like, like, Produ Product ct Key Key Produ Product ct Desc Descri ript ptio ion n Regi Region on 1233 Ferrari Blazer Region 2
Prio Priorr Regi Region on Produ Product ct Code Code Region 1 FERR-12
Table 5.2: Type 3 response to SCD
Type 3 response is suitable when there’s a need to support both the current current and previous previous view of an attribute attribute value simultaneo simultaneously usly.. But it is quite obvious that adding a new column will involve some structural changes to the physical design of the underlying dimension table and so might be preferable if the business users decide that only the last 2 to 3 prior attribute values would need to be tracked. Also, the biggest drawback would be if we needed to the track the impact of the intermedi intermediate ate attribute attribute values values [ 2]. There are hybrid methods for solving the problem of SCD which combine features of the above techniques but while they can oﬀer greater ﬂexibility, they usually introduce more complexity and if possible, accord cordin ing g to to , should be avoided. We introduce surrogate keys into our dimension tables and use them as the primar primary y keys. keys. This This app approa roach ch is more more suitab suitable le because because for one reason, it helps to tackle the problem of SCD. It is also essential for the Stage process ETL especially because we have chosen type the 2 response to dealing with SCDs. Surrogate keys help the ETL process keep track of rows that already exist in the data warehouse warehouse and avoids avoids reloading reloading same. Surrogate Surrogate keys are very easy to automate and assign because they are usually integer values and the last assigned value is stored in metadata and is easily retrieved and incremented on the next run.
Data Data Hier Hierar arcchy
Dimension tables often also represent hierarchical relationships in the business. Hierarchies help us to roll up and drill down to analyze information based on related facts. For example state rolls up into country country and country country into into region. region. Or in the date dimension, days roll up into week and weeks into month, months months into period etc. Products Products roll up into into product line, product line into vendor. vendor. Having hierarchy hierarchy translates into better query performance and more eﬃcient slicing and dicing through grouping along a path. Users are able to for example, view a products performance during a week and later on roll it up into a month and further into a quarter or period. All our four dimensions dimensions ha have ve hierarchy hierarchy.. 31
Figure 5.3: The Product dimension hierarchy
Figure 5.4: The Customer dimension hierarchy
Figure 5.5: The Date dimension hierarchy
Figure 5.6: The Oﬃce dimension hierarchy with multiple paths.
The The Date Date Dime Dimens nsio ion n
Every business event that takes place, happens on a particular date and so the date dimensi dimension on is very very importan importantt to a data data wareh warehous ouse. e. It is the primary basis of every report and virtually every data mart is a time series . It is also common to every data mart in a data warehouse and as a result must be designed correctly. When modeling the date dimension, care must be taken to make sure that it is ﬁlled with attributes that are necessary for every fact table that will be using it. Assigning the right columns will make it possible to create reports that that will will for for exam exampl ple, e, compa compare ress sale saless on a Mond Monda ay with with sale saless on a Sunday, or comparing a particular one month versus another. Acco Accord rdin ing g to , the columns or attributes in a date dimension can be categorized into four groups: •
Date Date format formats: s: The date date format format column columnss contai contain n dates dates in various arious formats. Calend Calendar ar date date attrib attribute utes: s: The calend calendar ar date date attrib attribute utess contai contain n various elements of a date, such as day, month name, and year. Fiscal Fiscal attrib attribute utes: s: The ﬁscal ﬁscal attrib attribute ute columns columns contai contain n elemen elements ts related to the ﬁscal calendar, such as ﬁscal week, ﬁscal period, and ﬁscal year. Indica Indicator tor column columns: s: These These contai contain n Boolean Boolean values values used used to deterdetermine whether whether a particular particular date satisﬁes satisﬁes a certain certain condition, e.g. a nation national al holida holiday y.. . .
Figure 5.7: The Date dimension table.
The The Oﬃce Oﬃce Dime Dimens nsio ion n
The oﬃce dimension describes every branch oﬃce outlet in the business. It is a geographic dimension. Each outlet is a location and so can be rolled up into city, city, state or country country.. Each Each oﬃce can easily be rolled rolled up into its corresponding geographic region as well. To accommodate the movement of an oﬃce’s coverage region, we have introduced the store key as a surrogate key and this will be used to implement a type 2 SCD response.
Figure 5.8: The Oﬃce dimension table.
The The Produ Product ct Dime Dimens nsio ion n
The product product dimens dimension ion descri describes bes the comple complete te portfol portfolio io of product productss sold by the company company.. We have introduced introduced the product key as the surrogate rogate key. key. It is mapped mapped to the product product code in the source source system system(s) (s).. This helps to integrate product information sourced from diﬀerent operatio erational nal systems. systems. It also also helps helps to overc overcome ome the proble problem m that that arises arises when the company discontinues a product and assigns the same code to a new product and as we have mentioned earlier, the problem of SCD. Apart from very few dimension attributes changing over time, most attributes stay the same over time. Hierarch Hierarchies ies are also very apparent apparent in our product dimension. Products roll up into product line, product scale and product vendor, business ness users users will will normal normally ly constr constrain ain on a product product hierar hierarch chy y attrib attribute ute.. Drilling down simply means adding more row headers and drilling up is just the opposit opposite. e. As with all dimens dimension ion tables, tables, we try to make our attributes as rich and textually verbose as possible, since they will also be used to construct row headers for reports.
Figure 5.9: The product dimension table.
The The Cust Custom omer er Dime Dimens nsio ion n
The customer customer forms an important important part of the product sales event. event. The custom customer er is actual actually ly the initiat initiator or of this this even event. t. All classic classic Cars’ Cars’ cuscustomers are commercial entities, since they are all resellers. The The cu cust stom omer er na name me ﬁeld ﬁeld in this this respe respect ct mak makes sens sense e as only only one one column. But we do have a contact ﬁrst name and contact last name for corres correspond pondenc ence. e. The customer customer key is a surrog surrogate ate key that that helps helps with with SCD. Attributes are chosen based on the business users requirements outlined in the functional requirements.
Figure 5.10: The customer dimension table.
Step Step 4: Id Iden enti tify fyin ing g the the Fac acts ts..
The ﬁnal step is identifying the facts that will form the columns of the fact table. The facts are actually dictated dictated by the grain declared in step 2. Accordi According ng to , the facts must be true to the grain; which in our case, is an individua individuall order order line item. item. The facts availa available ble to us are the sales sales quantity, buy price per unit and the sales amount, all purely additive across all dimensions. We will be able to calculate gross proﬁt (sales amount - buy price) on items sold, also known known as revenu revenue. e. We can calculate calculate the gross proﬁt of any combination of products sold in any set of stores on any set number of days. And in the cases where where stores sell products at slightly slightly diﬀerent diﬀerent prices prices from the recommen recommended ded retail retail price, price, we should should also be able able to calculate the average selling price for a product in a series of stores or across a period of time. Kimball et al recommends that these computed facts be stored in the physica physicall databa database se to elimin eliminate ate the possibil possibilit ity y of user user error. error. The cost of a user incorrectly representing gross proﬁt overwhelms the minor increment incremental al storage cost. We agree with this, since storage storage cost is no longer an issue as it once was. Since the fact table connects to our dimension tables to form a data mart mart,, it is nece necess ssar ary y that that it conta contain in attr attrib ibut utes es that that link link it with with the the dimension table, in order words, attributes that enforce Referential Insurrogate keys in the dimension tables tables are present present in the tegrity . All the surrogate fact fact table table as foreign foreign keys. keys. A combin combinati ation on of these these keys will help help us deﬁne a primary key for our fact table to guarantee uniqueness. Our fact table also contains 2 degenerate dimensions, namely the order number and the order line number.
Figure 5.11: The Product Sales Data mart.
Sour Source ce Syst System em Mapp Mappin ing. g.
Afte Afterr comp comple leti ting ng the the DDS DDS desi design gn,, the the next next step step will will be to map map the the sour source ce syst system em column columnss to the DDS DDS colu column mns. s. This This will will aid aid the the ETL ETL process during the extraction phase to know which columns to extract from and the target columns to populate. Since the fact table columns comprise attributes from diﬀerent tables, which in turn could also be from diﬀerent source systems, we need to have in place a source system code in order to identify the source system where the record comes from and for the ETL to be able to map map to the the colu column mn in what whatev ever er syst system em it migh might resi reside de in. in. The The only only requirement is that the source system code and its mapping be stored in the metadata database. At this this stage stage we also also conside considerr the necess necessary ary transf transform ormati ations ons and calcul calculati ations ons to be perform performed ed by the ETL logic during during extracti extraction. on. But because our source system database is rather simple and straight forward, we will not be performing any.
Summ ummar ary y
In this chapter we went in depth into data modeling and what guides the modeling process and then designed our DDS. We started by deﬁning some data modeling jargons. We used the Kimball Kimball four step approach approach in our DDS construction process. We also looked at how columns from the source system are mapped to the columns in the DDS. In the next chapter, we will be looking at the physical elements of our data warehouse.
Chapter 6 The Physical Database Design 6.1 In this chapter we look at the physical structure of our data warehouse and its support supporting ing technol technology ogy.. We will will show show how how we implem implemen entt our DDS and data warehouse structure using Microsoft SQL server. We will not be discussing the hardware structure or requirements as these are beyond our deﬁned scope for this project. In a normal business environment, the source system, the ETL server and the DDS would would ideally ideally be runnin running g on separa separate te systems. systems. More More so, because the source systems is an OLTP system and we must not interfere with its smooth running. For the purpose of implementation, we needed to ﬁnd a way to simulate multiple systems on a single computer. Our solution is to represent each element as a separate database running on a single SQL server installation installed on one computer. What we have is an environment where each database behaves like an individual system and using MSSIS, we could connect and move data between the diﬀerent elements of the data warehouse through OLEDB, just as we would if the databases were residing on separate systems.
The The sou sourc rce e syst system em da data taba base se..
This will simulate our source system and is a database of transactions that is pulled from the order management system of the Classic Car Compan Company y. It is an OLTP OLTP system system and records records the day to da day y transa transacc44
tion of receiving and dispatching orders, as well as inventories and all the the suppo support rtin ing g da data ta.. In our case case it is a singl single e syst system em but as classic classic cars have stores in various regions, it would ideally be OLTP data from the various various stores. We as data warehouse warehouse designers designers and implementers implementers do not create the source systems but it is the ﬁrst place we start our feasibility study for the functional requirements of the data warehouse. Careful thought must be put into providing the OLTP source system with as little interference as possible from the other elements of the data warehous warehouse. e. According According to Kimball Kimball and Ross, a well designed data warehouse can help to relive OLTP systems of the responsibility of storing historical data.
The The Stag Stagin ing g ar area ea da data taba base se..
In trying to conform to the last sentence of the above paragraph, it is very very essen essentia tiall that that we have have a stagin staging g area. area. A data data wareh warehous ouse e diﬀers diﬀers from an OLTP system in that, the data in a data warehouse is accurate up until until the last time it was was updated updated.. A data data warehou arehouse se does not contain live data and is not updated in real time. Updating the data in a data warehouse might mean uploading hundreds of megabytes to tens of gigabytes of data from OLTP systems on a daily basis. OLTP OLTP systems are not designed designed to be tolerant tolerant to this kind of extraction. extraction. So in order order to avoid avoid slowing slowing the source source systems down, we create a stage database from where our stage ETL will connect to the source system at a predeﬁned time of the day (usually at a time of low transaction traﬃc) extract the data, dump it into the stage database and immediately disconnect from the source system database. The internal structure of the staging database is basically the same as that of the source system, except that the tables have been stripped of all constraints and indexes. We ha hav ve add added ed the column columns: s: source source system system code and date date of record record creation as a means of identifying the originating source of the data and the date of extraction as a bookmark. These are for auditing and ETL purposes. That way the ETL can avoid reload reloading ing the same data data on the next next load. load. The stage stage ETL performs performs all the necessary transformations on the extracted data in this area and then loads them into the dimension and fact tables of the DDS. 45
The stage database area is akin to a workshop in that it is not accessible to user queries. queries. It is just an intermediate intermediate place that data warehou warehouse se data pass through on their way to the DDS.
The The DDS DDS da data taba base se..
The DDS database houses the Classic Cars DDS that contains our dimension mension and fact tables. tables. Our data mart contains contains four dimensions dimensions and 1 fact table but ideally in a real world, it could house tens of data marts and an d we would ould reco recomm mmen end d it ha havi ving ng a stan standa dalo lone ne syst system em of its its own. own. This is our data presentation area and will be accessed by various report writers, analytic applications, data mining and other data access tools. We aim to design a DDS that is unbiased and transparent to the accessing applicat application ion or tool. This This way way, users users are not tied tied to an any y par partic ticula ularr tool for querying or analysis purposes. Due to referential integrity, it is important to create the dimensions before the fact tables.
The The Meta Metada data ta da data taba base se..
The metadata database maintains all the information in a data warehouse house that is not actual actual data itself. itself. It is data data about about data. data. Kim Kimba ball ll liken likenss it to the the ency encyclo cloped pedia ia of the the da data ta wareh arehou ouse se.. Unde Underr normal circumstances, it would be ﬁlled with tons of information about everyth everything ing that is done in the warehou warehouse se and how it is done. done. It will support all user groups from technical to administrative to business users. Our metadata database is a stripped down version and its primary purpose purpose is to support support our ETL processes. processes. We store store inform informati ation on about about source system and columns mapping and ETL scheduling. scheduling. Informatio Information n about date and time last successful and unsuccessful load is recorded. The last increments increments of surrogate surrogate keys are also recorded. recorded. The metadata metadata is the starting point of every ETL process.
A view is a database object akin to a table with rows and columns but is not ph physi ysical cally ly stored stored on disk. It is a virtual virtual table that that is formed formed by using a join to select subsets of table(s) rows and columns. We created a view in order to be able to link a sale to a particular store. store. This This was was because because the store store table does not connect connect to the orders orders transaction table and the only way to deduct which store the transaction took place was through through the employe employee e making the sale. To extract this information, we had to join the order transaction table to the employees table through the salesRepEmployee number and from that we could retrieve the store ID.
This chapter looked at the physical components of our data warehouse. We explained how we are able to achieve the simulation of the various element elementss and environmen environmentt of a data warehouse warehouse in a single system. system. We built our databases and can now look forward to the next phase in our implemen implementatio tation; n; populating populating the data warehous warehouse. e. In the next chapter, chapter, we will be looking at how to move data from our source system into the DDS.
Chapter 7 Populating the Data Warehouse 7.1 In this chapter we will look at how we move data from our source system into the data warehouse. Popul opulat atin ing g our our da data ta wareh arehou ouse se is done done in the the follo followi wing ng step steps: s: the the ﬁrst step is to move the data from our source database to the staging database, here the necessary transformations are applied and thereafter all the data is transferred to the DDS. While transferring the data from the staging database to the DDS we need to denormalize it ﬁrst. This is a necessary step in preparing it for the DDS. To achieve this goal we have implemented two ETL processes: •
The Stage ETL: this connects to the source system, moves moves the data to the stage database and disconnects from the source system. The DDS ETL: this denormalizes the data, and then loads it into to the the DDS DDS.. .. . .
Both steps are illustrated in the ﬁgure below
Figure 7.1: Data ﬂow through the data warehouse showing ETL processes
Popul opulat atin ing g the the Sta Stage ge da data taba base se
As we mentioned in an earlier chapter, our decision to include the stage database into our data warehouse architecture is to primarily reduce the amount of time during which our ETL is connected to the source system. In order to minimize this burden time on the source database, we have chosen to implement the incremental extract method in the Stage ETL. Using this approach, only the initial load process of the ETL will require that all the data in the source system be moved into the stage database, thereafter, at regular intervals, usually once a day and normally at a time when the OLTP system is handling less transactions, the ETL will connect and only picks up new or updated records since its last connection from the source system and load them into the data warehouse, hence the name incremental extract method. To enable enable the ETL recogn recognize ize and extrac extractt the data data increm incremen ental tally ly we have ha ve added added the created and lastUpdated timest timestamp amp columns columns to each each table in our source database. Below is an extract from the Customer table:
Figu Figure re 7.2: 7.2: Sampl Samplee custo customer mer table table show showin ingg the the create created d and the the lastU lastUpdat pdated ed columns.
We use the metadata database to store the the times for the last successful extraction time LSET and current extraction time CET for each table in the source system. This is a mechanism to help the ETL process ﬁgure out where to begin the next run of incremental extraction and also to help in the case of error recovery, if there is a failure during an extract and the process does does not com compl plet ete e .
Figure Figure 7.3: Snapsho Snapshott from the Metadata Metadata data ﬂow table showin showingg the LSET and the CET.
From ﬁgure 7.3, 7.3, we can clearly see that the last ETL run successfully loaded all the records from the source database to the stage until the 11th of November, 2008 LSET. Therefore, in the next extraction session we need the ETL process to only load those records which were created or updated after the last successful extraction time LSET. As an example, lets assume that we are running our ETL process on the 11th of December 2008,CET, from the three customers in shown in the above Figure, only one will be transferred to our staging area i.e. reason being that this record record was was last updated updated Atelier graphicue. The reason on (2008/12/08), which was after our last successful extraction time for the Customers table(2008/11/11). In orde orderr to pic pick all all the the new new or updat updated ed recor records ds from from the the sour source ce database, we must ﬁrst save the current time as the CET in our metadata data ﬂow table. Next we need to get the LSET for a particular table, for example the Customers Customers table. table. This is ach achiev ieved ed by using a simple SQL query like this: SELECT LSET from metadata.data ﬂow where name = Customers The query query return returnss the LSET for the Custome Customers rs table. table. Armed Armed with with these two parameters, we can then proceed to extract new or updated customers from our source database with the following query: SELECT * FROM Customers WHERE (created >LSET AND created <= CET) OR (lastUpdated >LSET AND lastUpdated <= CET) To pic pick logic logical ally ly corre correct ct da data ta here here also also requ requir ires es that that the the reco record rd should have the created or the lastUpdated timestamp ﬁelds not bigger than current extraction time. For even better performance of our stage ETL process, our staging database database tables do away away with constrain constraints ts and indexes. While transfertransferring the data to the staging area we do not want our ETL process to be bogged down by unnecessarily checking for any constraint violations. We use constraints to ensure data quality only when inserting new or updating old data in our source systems. This way we are sure that the data, that comes to our staging area, is correct and there is no need to
double check it again.
Data Data Map Mappin pings
Another important step in the design of the ETL process is data mappings. It is a common occurrence to ﬁnd columns that make up a table in the DDS to be derived from multiple tables in the source systems. Mapping them helps the ETL process to populate the tables with their right rightful ful columns. columns. Column Columnss are mapped mapped to tables tables in the stage area and mappings are also done between the stage area and the DDS tables. Below is the data mappings of our source-to-stage ETL process:
55 Figure Figure 7.4: Column Column mappings mappings in the ETL process for source syste system m to staging staging database.
Con Control trol Flo Flow
The The next next ﬁgur ﬁgure e sho shows the the contr control ol ﬂow ﬂow diag diagra ram m of our our ETL ETL pr proce ocess. ss. It is based on the incremental extraction algorithm that we described earlie earlierr in this this ch chapt apter. er. The whole procedure procedure is divide divided d into into four four main main steps: 1. Set the current extraction time (CET = current time). 2. Get the last successful extraction time (LSET) of some particular table 3. Pick Pick the records records that were created or updated during time interval interval T,(LSET <T ? CET), and load them into the stage database. 4. Set the last last successfu successfull extracti extraction on time (LSET (LSET = CET).. CET).. . . This ETL process extracts and loads the data for each table in parallel. allel. If some error occurs while while execut executing ing one of those those steps then the operation is halted and marked as failed. Howe Howeve verr individ individual ual tables tables are not depende dependen nt on each each other. other. If an error occurs while loading the Customers table then only the procedure that works with this table is marked as failed, while the others continue as usua usual. l. Usin Using g this this ap appr proa oacch we are are sure sure that step step 4 is exec execut uted ed if and only if there there were were no errors errors during during the whole procedur procedure, e, i.e. i.e. the extraction was successful.
Figure 7.5: ETL process for populating the staging database.
Moving ing Dat Data a to to the the DDS
Now that we have the data in our staging database, it is time to apply some transformations, if needed, and move the data to the DDS. This is the task that our second ETL process is responsible for. One of the ﬁrst things to do when moving data to the DDS is to populate the dimension tables before . That is because we have a fact table which references every dimension table through a foreign key. Trying to populate the fact table ﬁrst would result in a violation of its referential integrity. Having Havin g succes successfu sfully lly populat populated ed the dimens dimension ion tables tables,, we can safely safely load the fact table. If no errors occurred and all the data has been successfu cessfully lly transfer transferred red to the DDS we clear the stagin staging g databa database. se. Since Since we have that data in the data warehouse we no longer need to keep a copy of it in the stage database. Our DDS contains contains four dimensions: dimensions: Customer, Customer, Product, Oﬃce and Time. Time. This This ETL process process will will only only populat populate e three three of them them because because the Time Time dime dimens nsion ion is popul populat ated ed only only once once - when when our our DDS is crea create ted. d. This dimension rarely changes, except for when an organization needs to redeﬁne redeﬁne its ﬁscal ﬁscal times or update update a holida holiday y. Hence, Hence, we do not need to update it very very often. While designing designing the control control ﬂow architecture architecture of this this ETL process process in Busine Business ss Intel Intellig ligenc ence e Develo Developme pment nt Studio Studio we placed every dimension populating procedure in a sequence container and an d made made it as the star starti ting ng point point of our ETL ETL pr proce ocess. ss. This This is to load the dimension tables ﬁrst and separate this task from the loading the fact table. If an error occurs in one of the procedures that are in the sequence container, all further execution is halted and the entire ETL process results in a failure. failure. The Figure below illustrates illustrates the control ﬂow diagram diagram from Business Intelligence Development Studio.
Figure 7.6: ETL process for populating the DDS.
Popul opulat atin ing g the the Dime Dimens nsio ion n tabl tables es
While designing this ETL process, one of the most important requirements ments was was to incorpor incorporate ate slowly slowly ch chang anging ing dimens dimension ionss into into our data data wareh warehous ouse. e. This This featur feature e makes makes the loading loading of this this ETL process process quite quite not as straig straight htfor forwa ward rd as was was the case case while while populat populating ing the stagin staging g database. For this project we are implementing Type 1 and Type 2 SCD. Slowly changing dimension is used only when populating dimension tables. The reason being that dimension records are updated more frequently than fact table records. Implemen Implementing ting SCD using Microsoft Microsoft Business Business Intellige Intelligence nce DevelopDevelopment Studio is rather an easy task. For that purpose we use the Slowly Changing Dimension data ﬂow transformation as seen in the next ﬁgure. The same dataﬂow dataﬂow architecture architecture is used to popula p opulate te each dimension dimension in Populate Dimensions sequence container.
Figure 7.7: Data ﬂow architecture for populating Customer Dimension. 61
While populating the Customer dimension we can select some columns to correspond to SCD Type 1 or Type 2 response. The following ﬁgure illustrates the SCD response type for those columns in the Customer dimension.
Figure 7.8: Handling SCD in the Customer Dimension.
The Changing repres esen ents ts a Type 1 SCD SCD respo respons nse; e; New New Changing attribute attribute repr values overwrit overwrite e existing existing values values and no history history is preserve preserved. d. While the Historical Historical attribute corresponds to a Type 2 SCD response; Changes in these these column column values alues are saved saved in new record record rows. rows. Previo Previous us values alues are saved in records marked as outdated . To show a records current status, we use the currentStatus column with possible values of Active, meaning the record is up to date and Expired , meaning the record is outdated. When data passes through the SCD data ﬂow transformation, it can go to three diﬀerent outputs: 1. If we receive an updated version of changing attribute then the data is directed to Changing Attribute Updates Output and the old old value alue in the the DDS DDS is updat updated ed with with the the new new one. one. We do not need to insert anything into our DDS this time 2. If the the hist histor oric ical al attr attrib ibut ute e is updat updated ed,, then then the the da data ta is dire direct cted ed to Histor Historica icall Attri Attribut bute e Insert Insertss Output Output.. Since Since we wan want to keep a history of a previous value of this attribute we do not update this record but, instead, create a new Customer record with the new data and mark it as Active. The The old old reco record rd is then then mark marked ed as Both of those those records records reside reside in the DDS. Expired . Both 3. Data Data is redi redire rect cted ed to the the New New Outp Output ut when when we rece receiv ive e a new new record, that is not currently in our DDS. Those records by default are marked as Active and inser inserted ted into into the DDS.. DDS.. . .
Popul opulat atin ing g the the Fac Factt tabl table e
The next step after populating the dimension tables is to load data into the fact table. table. Remem Remember ber dimension dimensionss are connect connected ed to the fact table throug through h surrog surrogate ate keys. keys. This This means means that that we can cannot not just load the fact table with the data we get from the source database. Firstly, we need to ﬁnd each record’s matching surrogate key in the dimension table and only then can we be able to link those tables together through foreign keys. What distinguishes this procedure from the ones mentioned earlier is that here, the dataﬂow dataﬂow architectu architecture re has two two data sources. sources. This is the result of our fact table being composed of columns from two diﬀerent tables from the data source,namely the Orders and OrderDetails tables. This This is not uncommo uncommon n in data data wareh warehous ousing ing.. So when populatin populating g the fact fact table table we need need to join those those two two tables tables ﬁrst. ﬁrst. We do that with the help of ”Merge Join” dataﬂow transformation as shown below,
Figure 7.9: Joining the Orders and OrderDetails tables with a Merge Join.
Before joining the two data sets, it is required that they be sorted ﬁrst. Here, Sort 1 is the OrderDetails OrderDetails table and Sort is a sorted version of Orders table. The orderNumber naturally forms the join key. After joining the two tables, we now have all the data we need to form form our fact table. table. The only thing thing left to be done done is to ﬁnd surroga surrogate te keys to be able to join every fact table record with all of the dimensions. Since our fact table is connected to four dimensions we need to contain four of those keys in every record of our fact table, very much like a composite key . At this moment our fact table can join directly with the following dimensions: •
Product dimension on productCode.
Customer dimension on customerNumber.
Date Date dimens dimension ion on orderD orderDate ate.. .. . .
At this point, we do not have a column that directly joins the Oﬃce dime dimens nsio ion. n. This This is the the case case becau because se neit neithe herr the the Orde Orders rs nor nor the the OrOrderDetails tables have a direct link with the Oﬃce table in the source database. To help help us work ork arou around nd this this situa situati tion on,, we crea create ted d a view view that that join joinss every order to the oﬃce where the order was made through the Sales Rep number. number. This This is because because the order order record record contain containss the employe employee e number for the person who handled the sale, so we use that employee number as a handle to get the oﬃce code and voila problem solved!
Figure 7.10: Joining the Orders, customers, employees and oﬃce tables.
Using the data that we have now in our fact table we can successfully join with each dimension using a business key. key. However, However, we need to be using surrogate keys for our star-join. Since all the dimensions have a business key (not used as a primary key) here but we can use it to get the surrogate key using the ”Lookup” dataﬂow dataﬂow transformat transformation. ion. The lookup transforma transformation tion just joins the fact table with one of the dimensions using a business key and then retrieves a surrogate key and adds it into the fact table. The ﬁgure below shows a lookup data ﬂow transformation for getting the Customer dimension surrogate key.
Figure 7.11: Retrieving the Customer dimension surrogate key.
The busines businesss key key in this exampl example e is custom customerN erNum umber. ber. Using Using this colu column mn we join join our our fact fact tabl table e with with the the Cust Custom omer er dime dimens nsion ion an and d reretrieve trieve it’s surrogate surrogate key i.e. customerDim customerDimKey Key.. We insert insert this key into our fact table table as a new column. column. Using Using this same same app approa roach ch,, we get the remaining three surrogate keys. After getting the surrogate keys for all four dimensions we can ﬁnally insert insert our ﬁrst record record into the fact table. table. The only thing thing left left to do is to get rid of the business keys that we currently store in the data set that is going to be inserted into our fact table. This This is easi easily ly solv solved ed usin using g da data ta mapp mappin ing. g. We simpl simply y lea leave out out the the Business keys from the mappings because we do not want to include them in the fact table now that we have all the surrogate keys.
Figure 7.12: Data mappings for the Fact table.
Prep Prepar arin ing g for for the the next next upl uploa oad d
After populating the fact table with data from the staging database, we complete the data warehouse population task. The Last step will be to clear the staging database for the next scheduled data extraction. extraction. The complete data ﬂow arc architect hitecture ure diagram for populating the fact table is depicted in Figure 7.13.
Figure 7.13: Data ﬂow diagram for populating the Fact table.
Sche Sc hedu duli ling ng the the ETL ETL
To keep our data warehouse up to date we need to run our ETL processes cesses regula regularly rly.. Using Using increm incremen ental tal extrac extract, t, the best b est option option would would be to run it on a daily basis, usually at a time when business transaction is low or at the end of a work day. day. Although Although the latter argument argument holds no water in these days of online shopping, where as one part of the world is shutting down another is resuming work. The general idea is that we do not want to interfere with the smooth running running of the OLTP OLTP source systems. So, it is purely a business business decision when an organization would like the ETL process that connects to the source systems to run. On the technical aspect, to schedule an ETL and execute all of our previously built SSIS packages, we need create a new SQL Server Agent Job . It is a multi step process with one step per ETL process. Figure 7.14 illustrates.
Figure 7.14: Creating an SQL Agent Job.
The whole job process is atomic, ﬁrst it populates the staging database. If it fails at any point, it does not continue to step 2 for the obvious reason that it makes no sense trying to load the data from stage database to the DDS if no data was transfer transferred red to the staging staging area. area. Instea Instead, d, we just quit the job, reporting failure. If step 1 succee succeeds, ds, it move movess onto onto step 2. If both steps steps succeed succeed then then the whole job is marked as successful execution of our ETL processes. There are also options for notifying the administrator in case of a failure. These notiﬁcations can be sent by mail, network message or written to the metadata or Windows Application event log. In tryi trying ng not not to inte interf rfer ere e with with the the smoot smooth h ru runn nnin ing g of the the sour source ce system, we have scheduled our ETL processes to run at 3:00AM.
Figure 7.15: Scheduling the ETL.
At this this poin pointt we ha hav ve an up to da date te data data wareh arehou ouse se.. Our Our ETL ETL process processes es are built. built. The data data ﬂows ﬂows into the DDS regula regularly rly.. The data warehouse is ready and fully functional.
Summ ummar ary y
In this chapter, we explained our implementation in detail with well illustrated lustrated diagrams. diagrams. This was the part that took he longest to complete. Being able to complete is was a milestone for us. In the next chapter we will look at reporting from our completed data warehouse.
Chapter 8 Building Reports
Now that our data warehouse is up and running, it is ready to be tested. We have chosen to build some sample report as a means of seeing our data da ta wareh warehou ouse se in acti action on.. Bu Butt our our main main goal goal is not not to limit limit our our da data ta warehouse to certain pre-built reports. We believe a data warehouse should not be biased towards any particular reporting or analysis tool and should be ﬂexible enough to handle user userss slici slicing ng an and d dicin dicing g in what whatev ever er mann manner er they they ch choos oose. e. This This to us will represent how successful our planning and design process was. One of the most common ways to study data in the data warehouse is by building building reports. reports. Using Using this this app approa roach ch the data data is gather gathered ed from the DDS and presen presented to the user in a very very conve convenie nient nt way way. Instea Instead d of just plain data ﬁelds in the DDS the user can use charts, diagrams and other ways ways of repres represen entin ting g data. data. This This makes makes it much much more easier easier to inspect and analyze the data that resides in the DDS. For creating our reports we are using SQL Server 2008 Reporting Servic Services. es. Using Using these these servic services, es, there there is a possibil possibilit ity y to deplo deploy y the reports so that they can be accessed via a web site. We are not using this feature because we do not have access to an http server with SQL server 2008 Enterprise edition installed.
Selecting the Report ﬁelds The ﬁrst thing to do when building a report is to decide what tables and columns columns to includ include e in our report. report. This This of course course depends depends on what what kind of information we expect from our report. If we want to build a report that shows the sales of a particular product in diﬀerent countries than we should build a SQL query that extracts all the sales and then groups the result by product and country. Before continuing with the reports we need to make another view. A view that relates every order line item of a particular product and a value indicating the proﬁt from that sale. To create create such a view we have have to collect collect data from two two tables: tables: fact table and product dimension. dimension. To calculate calculate the proﬁt for one particular particular order line in the fact table we use the expression: Proﬁt = (selling price of one unit - buying price of one unit) * quantity of units sold We then cast this value as Money and insert this column into our view. view. So the view now contai contains ns three columns columns:: orderNumb orderNumber, er, proﬁt proﬁt and productDimKey . Why Why not not use use a nest nested ed query query,, one one migh mightt ask. ask. We are are forc forced ed to tak take this approach because grouping by nested queries is forbidden in SQL Server 2008 while creating reports.
Figure 8.1: Creating a view to relate each order line with the proﬁt it made
Now that we have this view created we can build some reports to test test our data wareho warehouse use.. The ﬁrst report we create created d shows shows product product sales by country over a period of time. Time Time period period is year columnss from from the date date dimens dimension ion.. year and quarte quarter r column Two other important columns are proﬁt and country . Figure 8.2 shows how the query groups the data by year, quarter and coun countr try y an and d then then sums sums up the the pr proﬁ oﬁt. t. The The resu result lt of this this opera operati tion on is exactly what we need to analyze the company’s sales in diﬀerent countries over time.
Figure 8.2: A query used to provide data to the Sales by country report.
The ﬁnal step is to wrap this data into a nice and readable format, i.e. i.e. build build a report report or paint paint a pictur picture. e. Once Once we have have the data we need, need, building a report is a very straightforward process. In this this repor reportt we are are usin using g a matr matrix ix styl style e lay layout. out. This This ba basic sical ally ly means that we have three dimensions that we can use; columns, rows and details:
The time year and quarter are represented as columns.
Countries are displayed as rows
Proﬁt Proﬁt is shown shown as detail details.. s.. . .
The matrix design template is depicted in Figure 8.3. 8.3.
Figure 8.3: Designing the report matrix.
The main group in time dimension is the year ﬁeld. It also contains ﬁeld quarter as a child group. This technique groups the data in a hierarchical approach, where every year column is a parent of four quarter columns. There is also an additional row: Total per quarter quarter that shows us the total proﬁt made during each quarter. A picture they say, speaks a thousand words and to present the data in a more easily readable format, we also included a chart in this report. Notice that having three dimensions (time, proﬁt and country) in the matrix, matrix, we also need a three three dimensional dimensional chart. chart. A simple column or pie chart would be unsuitable because those only present two dimensional data. data. We use a stack stacked ed cylinder cylinder chart chart instead instead.. To achie achieve ve the third third dimension the cylinder is split in parts, that in our case represents different countries.
Figure 8.4: The Sales by country report.
Including charts with reports is a very helpful practice. Sometimes, when we have a big matrix and large numbers it is very hard to analyze the data by just looking at the plain digits. For example, by just taking a quick look at the chart in previous report it is very easy to notice that two of the most proﬁtable countries in our case are France and USA. The following two reports were built while testing our data warehouse house to demons demonstra trate te the ﬂexibilit ﬂexibility y of the data wareh warehous ouse. e. These These reports were built using the same methods as a previous one. The report in Figure 8.5: 8.5: Sales by model type gives us an overview of the sales sales grouped grouped by model type type (cars, (cars, planes, planes, ships ships and etc.) etc.) and the report in Figure 8.6: 8.6: Sales by manufactu manufacturer rer shows us the performance performance of the manufacturers over time.
Figure 8.5: Sales by model type report.
Figure 8.6: Sales by model type report.
Summary In this chapter, we have demonstrated creating reports from our data wareh warehous ouse. e. While While we only only demons demonstra trated ted by creati creating ng 3 reports reports,, endl endles esss amou amoun nts of repor reports ts can can be gene genera rate ted d acco accord rdin ing g to the the user userss want. It is also possible to use third party reporting and analysis tools connect to the data warehouse and slice and dice through the data as requ requir ired ed.. That That is ba base sed d on soli solid d desig design n pr prin incip ciple less that that was ad adop opte ted d during the implementation of the data warehouse.
Bibliography  W.H Inmon, Inmon, Building the Data Warehouse. Wiley and Sons, Inc. 3rd Edition, 2002.  Ralph Kimball Kimball and Margy Margy Ross, The Data Warehouse Toolkit .The .The complete guide to Dimensional Modelling. Wiley and Sons, Inc. 2nd Edition, 2002.  Vincent Vincent Rainardi, Building a Data Warehouse. Apress, CA. 2008.  Paulraj Paulraj Ponniah, Ponniah, Data Warehousing fundamentals. A comprehensive guide for IT professionals. Wiley and Sons, Inc. 2001.  C.Imho C.Imhoﬀ, ﬀ, N.Gale N.Galemmo mmo,, J.G.Ge J.G.Geige iger, r, Mastering Mastering Data Warehouse Warehouse Design . Wiley and Sons, Indiana. 2003.