Data Warehouse

A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of decisionmaking process
ƒ Modeling and analysis of data for decision makers, not for transaction processing (OLAP vs. OLTP) ƒ Constructed by integrating multiple heterogeneous data sources ƒ Keep data with a historical perspective ƒ Permanently store data imported from the operational data sources

Data Warehousing and OLAP
Dr. Weining Zhang

OLTP users function DB design data clerk, IT professional day to day operations application-oriented current, up-to-date detailed, flat relational isolated repetitive read/write index/hash on prim. key short, simple transaction tens thousands 100MB-GB transaction throughput OLAP knowledge worker decision support subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans complex query millions hundreds 100GB-TB query throughput, response

Wrapper/mediator vs. DW for information integration
ƒ Mediator is query driven. Data are stored in heterogeneous resources. Mediator distributes queries and integrates answers ƒ DW is update-driven. Data are loaded to DW for query processing


DW vs. Operational DBMS
ƒ DBMS is for OLTP (on-line transaction processing) and daily operation ƒ DW is for OLAP (on-line analytical processing) and decision-making

usage access unit of work # records accessed #users DB size metric

Why Separate DW From DBMS?

Data Cube Model

High performance for both systems
ƒ DBMS is tuned for OLTP ƒ Warehouse is tuned for OLAP

Fact table and dimension tables
Location c1 c3 c1 c2 c1 c2 … Product p1 p1 p2 p2 p1 p1 … Amount 12 50 11 8 44 4 …

Time d1 d1 d1 d1 d2 d2 …


Special requirements for decision making
ƒ Query historical data not found in typical databases ƒ Consolidate (aggregate, summarize) data from heterogeneous sources ƒ Must reconcile inconsistency in formats representations, and codes for data from heterogeneous data sources

cid c1 c2 c2 …

city Dallas Toronto Chicago …

state TX ON IL …

country USA Canada USA …

Data Cube Model

Define DW Schema

View data as cubes
Location c1 c3 c1 c2 c1 c2 … c1 … all Product p1 p1 p2 p2 p1 p1 … p1 … all Amount 12 50 11 8 44 4 … 56 … 129
Types of schemas
ƒ Star schema. A fact table with a set of simple dimension tables ƒ Snowflake schema. Refine a star schema by allowing some dimension to be modeled by a set of tables ƒ Fact constellation. Multiple fact tables share dimension tables. Viewed as a collection of stars, thus called galaxy schema or fact constellation

Time d1 d1 d1 d1 d2 d2 … all … all
d2 d1 p1 p2 all


56 4 50 110 44 4 12 50 48 50 8 50 12 11 50 8 50 12 11 62 8 50 11 11 8 19 23 8 c2 50 c3 81 all


Example of Star Schema
time_key day day_of_the_week month quarter year

Defining a Star Schema
item_key item_name brand type supplier_type

dimensions Sales Fact Table time_key item_key branch_key

branch_key branch_name branch_type

location location_key units_sold dollars_sold avg_sales Facts
location_key street city province_or_street country

define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country)

Example of Snowflake Schema
time_key day day_of_the_week month quarter year


Example of Fact Constellation
Sales Fact Table time_key item_key branch_key branch
branch_key branch_name branch_type item_key item_name brand type supplier_type

Sales Fact Table time_key item_key branch_key
item_key item_name brand type supplier_key

supplier_key supplier_type

time_key day day_of_the_week month quarter year

Shipping Fact Table time_key item_key shipper_key from_location

branch_key branch_name branch_type

location location_key units_sold dollars_sold avg_sales Measures
location_key street city_key

location_key units_sold dollars_sold avg_sales Measures

location_key street city province_or_street country

to_location dollars_cost units_shipped shipper
shipper_key shipper_name location_key shipper_type

city_key city province_or_street country

Types of Aggregate of Measures
Distributive. Obtain identical result no matter applied to aggregated values or to fact values e.g., count(), sum(), min(), max(). ™ Algebraic. Can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function, e.g., avg(), min_N(), standard_deviation(). ™ Holistic. Has no constant bound on the storage size needed to describe a subaggregate, e.g.,median(), mode(), Rank().
A Sample Concept Hierarchy
all region Europe all




Germany ...





city office



Vancouver ... L. Chan


... M. Wind

Lattice of Cuboids
™ ™

Cube-based OLAP Operations

Each cube is a cuboid Cuboids form a lattice based on dimensions in the cubes
all product time location 0-D(apex) cuboid 1-D cuboids

Roll-up, drill-down, dice, slice, pivot, etc
all 56 4 50 110 44 4 12 50 48 50 8 50 12 11 50 8 50 12 11 62 8 50 11 11 8 19 23 c1 8 c2 50 c3 81 all Roll up Drill down all all 67 c1 Drill down on product all 56 11 11 c1
d2 d1 p1 p2 all

12 c2

50 c3 all all 129 all



time, location 2-D cuboids 3-D(base) cuboid

Roll up on time all p1 4 8 c2 8 50 c3 p2

product, time, location
Roll up on location

p1 p2

110 19 all Roll up on produce

Data Warehousing & OLAP

Cube-based OLAP Operations
d2 d1 p1 p2 all all 56 4 50 110 44 4 12 50 48 50 8 50 12 11 50 8 50 12 11 62 8 50 11 11 8 19 23 c1 8 c2 50 c3 81 all pivot d1 p1 Slice on p2 d1 12 11 11 c1 8 c2 8 50 c3

Design of a Data Warehouse
™ ™

A difficult and expensive process Involve business and technical considerations. Viewed from different perspectives:
ƒ ƒ ƒ ƒ What information are needed Models of data sources Models of warehouse data How to use the data

Dice on d1, d2; p1, p2; and c1, c2 d2 44 4 d1 12 12 11 p1 8 11 11 8 p2
d1 c3 c2 c1
Data Warehousing & OLAP

50 8 11 p2 12 p1

Need many experiments and prototypes



Data Warehouse Design Process

Data Warehouse Architecture
other sources Metadata Monitor & Integrator OLAP Server Analysis Query Reports Data mining

ƒ Choose a business process to model, e.g., orders, invoices, etc. ƒ Choose the grain (atomic level of data) of the business process ƒ Choose the dimensions that will apply to each fact table record ƒ Choose the measure that will populate each fact table record

Operational Extract Data DBs Transform Warehouse Load Refresh


Data Marts Data Sources
Data Storage

Three Data Warehouse Models

Data Warehouse Development
Distributed Data Marts Multi-Tier Data Warehouse

Enterprise warehouse
ƒ collects all of the information about subjects spanning the entire organization


Data Mart
ƒ a subset of corporate-wide data that is of value to a specific groups of users. • Independent vs. dependent (directly from warehouse) data mart Data Mart

Data Mart

Enterprise Data Warehouse


Virtual warehouse
ƒ A set of views over operational databases ƒ Only some of the possible summary views may
Model refinement

Model refinement

Define a high-level corporate data model
OLAP Server Architectures

Compute Cube in ROLAP

Relational OLAP (ROLAP)
ƒ Use RDBMS or ORDBMS to store data and OLAP middleware to support missing pieces ƒ Greater scalability

Extend SQL to have a cube operator
define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales


Multidimensional OLAP (MOLAP)
ƒ Store cubes in multidimensional arrays ƒ Fast indexing to summarized data

™ ™

Compute each cuboid using a group by
ƒ Time consuming

™ ™

ƒ E.g., low level: relational, high-level: array

Pre-compute cuboids
ƒ For n dimensions, there are at least 2n cuboids, mach more if dimensions have hierarchies • Takes too much space, is costly to maintain

Specialized SQL servers
ƒ Support SQL queries over star/snowflake schemas
Partial Materialization

Compute Cube in MOLAP
Partition a multidimensional array into chunks that fits in memory ™ Compress sparse array to conserve space ™ How to compute the entire cube by reading each chunk only once?

Pre-compute some cuboids and compute other cuboids when they are needed
ƒ Select cuboid for materialization • Tradeoff among space limits, maintenance cost, & usefulness to query processing ƒ Exploit materialized cuboids to answer queries • Choose cuboids to use, using indexes, translate cube operations on chosen cuboids ƒ View (cuboid) maintenance • Update the materialized cuboids when the source data is changed
ƒ Read chunks in some ordering ƒ Different ordering needs different buffer space (chunk memory) ƒ A multi-way array aggregation tries to computer all K-D cuboids in parallel

Multi-way Array Aggregation
all A c3 61 62 63 64 c2 45 46 47 48 c1 29 130 2 31 2 32 2 1 1 1 1 c0 60 1 1 1 1 2 b3 13 14 15 16 1 44 56 1 28 240 b2 9 10 11 12 1 52 B 1 24 236 6 7 8 220 b1 5 2 2 3 4 b0 1 C a0 a1 a2 A a3 A B B

Multi-way Array Aggregation

Assume reading chunks 1 to 64 in order
ƒ For BC plane, compute 1 chunk at a time. Need to keep partial result for each cell in the chunk ƒ For AC plane, compute 4 chunks in parallel. Must keep partial results for each cell in the 4 chunks ƒ For AB plane, compute all 16 chunks in parallel. Must keep partial results for all cells in that plane ƒ Size=AB plane + 1 row of AC plane + 1 chunk of BC plane

C all

Indexing Cube Using Bitmaps

Indexing Cube Using Join Index
Join index: JI(R-id, S-id) where R (R-id, …) joins with S (S-id, …) on a join condition ™ Relates the values of the dimensions of a star schema to rows in the fact table.

Each value in the column has a bit vector
ƒ Each tuple has a bit in the bit vector ƒ The i-th bit is set if the i-th tuple of the base table has the value for the indexed column


Use bit operations to search for tuples
Index on Region Index on Type

Base table
Cust C1 C2 C3 C4 C5 Region Asia Europe Asia America Europe

Type RecIDAsia Europe America RecID Retail Dealer Retail 1 1 0 0 1 1 0 Dealer 2 0 1 0 2 0 1 Dealer 3 1 0 0 3 0 1 Retail 4 0 0 1 4 1 0 0 1 0 5 0 1 Dealer 5
ƒ E.g. fact table Sales and two dimensions location and product • A join index on location maintains for each distinct city a list of sales tuples recording the Sales in the city ƒ Join indices can span multiple dimensions

Processing OLAP Queries

Data Warehouse Tools & Utilities
™ ™ ™ ™

Determine cube operations to perform on available cuboids:
ƒ transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection

Data extraction:
ƒ get data from heterogeneous external sources

Data cleaning:
ƒ detect & rectify errors in the data

Determine to which materialized cuboid(s) the relevant operations should be applied. ™ Exploring indexing structures and compressed vs. dense array structures in MOLAP

Data transformation:
ƒ convert data from host format to DW format

ƒ sort, summarize, consolidate, compute views, check integrity, & build indicies and partitions


ƒ propagate source updates to the warehouse
DW and Data Mining
™ ™


DW prepares data for mining
ƒ Integrated, consistent, cleaned

Data warehouse
ƒ A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decisionmaking process

DW provides infrastructure for data collection and analysis ™ OLAP is a simple form of mining & data exploration ™ Mining should be a part of DW operations
ƒ Provide more powerful tools to analyze data and extract knowledge


A multi-dimensional model of a data warehouse
ƒ Star schema, snowflake schema, fact constellations ƒ A data cube consists of dimensions & measures

OLAP operations: drilling, rolling, slicing, dicing and pivoting ™ OLAP servers: ROLAP, MOLAP, HOLAP ™ Efficient computation of data cubes

ƒ Partial vs. full vs. no materialization ƒ Multi-way array aggregation ƒ Bitmap index and join index implementations

Further development of data cube technology
ƒ Discovery-drive and multi-feature cubes ƒ From OLAP to OLAM (on-line analytical mining)

