A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of decisionmaking process
Modeling and analysis of data for decision makers, not for transaction processing (OLAP vs. OLTP) Constructed by integrating multiple heterogeneous data sources Keep data with a historical perspective Permanently store data imported from the operational data sources
Data Warehousing and OLAP
Dr. Weining Zhang
W. Zhang
Data Warehousing & OLAP
2
DW vs. DBMS
OLTP vs. OLAP
OLTP users function DB design data clerk, IT professional day to day operations application-oriented current, up-to-date detailed, flat relational isolated repetitive read/write index/hash on prim. key short, simple transaction tens thousands 100MB-GB transaction throughput OLAP knowledge worker decision support subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans complex query millions hundreds 100GB-TB query throughput, response
4
Wrapper/mediator vs. DW for information integration
Mediator is query driven. Data are stored in heterogeneous resources. Mediator distributes queries and integrates answers DW is update-driven. Data are loaded to DW for query processing
DW vs. Operational DBMS
DBMS is for OLTP (on-line transaction processing) and daily operation DW is for OLAP (on-line analytical processing) and decision-making
usage access unit of work # records accessed #users DB size metric
W. Zhang
Data Warehousing & OLAP
3
W. Zhang
Data Warehousing & OLAP
Why Separate DW From DBMS?
Data Cube Model
High performance for both systems
DBMS is tuned for OLTP Warehouse is tuned for OLAP
Special requirements for decision making
Query historical data not found in typical databases Consolidate (aggregate, summarize) data from heterogeneous sources Must reconcile inconsistency in formats representations, and codes for data from heterogeneous data sources
cid c1 c2 c2 …
city Dallas Toronto Chicago …
state TX ON IL …
country USA Canada USA …
dimensions
W. Zhang Data Warehousing & OLAP 5 W. Zhang
Facts/measure
Data Warehousing & OLAP 6
Data Mining: Concepts and Techniques
1
Data Cube Model
Define DW Schema
View data as cubes
Location c1 c3 c1 c2 c1 c2 … c1 … all Product p1 p1 p2 p2 p1 p1 … p1 … all Amount 12 50 11 8 44 4 … 56 … 129
Data Warehousing & OLAP 7
Types of schemas
Star schema. A fact table with a set of simple dimension tables Snowflake schema. Refine a star schema by allowing some dimension to be modeled by a set of tables Fact constellation. Multiple fact tables share dimension tables. Viewed as a collection of stars, thus called galaxy schema or fact constellation
Types of Aggregate of Measures
Distributive. Obtain identical result no matter applied to aggregated values or to fact values e.g., count(), sum(), min(), max(). Algebraic. Can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function, e.g., avg(), min_N(), standard_deviation(). Holistic. Has no constant bound on the storage size needed to describe a subaggregate, e.g.,median(), mode(), Rank().
W. Zhang Data Warehousing & OLAP 13
A Sample Concept Hierarchy
all region Europe all
...
North_America
country
Germany ...
Spain
Canada
...
Mexico
city office
Frankfurt
...
Vancouver ... L. Chan
Toronto
... M. Wind
14
W. Zhang
Data Warehousing & OLAP
Lattice of Cuboids
Cube-based OLAP Operations
Each cube is a cuboid Cuboids form a lattice based on dimensions in the cubes
all product time location 0-D(apex) cuboid 1-D cuboids
Roll-up, drill-down, dice, slice, pivot, etc
all 56 4 50 110 44 4 12 50 48 50 8 50 12 11 50 8 50 12 11 62 8 50 11 11 8 19 23 c1 8 c2 50 c3 81 all Roll up Drill down all all 67 c1 Drill down on product all 56 11 11 c1
W. Zhang
d2 d1 p1 p2 all
12 c2
50 c3 all all 129 all
product,time
product,location
time, location 2-D cuboids 3-D(base) cuboid
15
Roll up on time all p1 4 8 c2 8 50 c3 p2
product, time, location
W. Zhang Data Warehousing & OLAP
A difficult and expensive process Involve business and technical considerations. Viewed from different perspectives:
What information are needed Models of data sources Models of warehouse data How to use the data
Dice on d1, d2; p1, p2; and c1, c2 d2 44 4 d1 12 12 11 p1 8 11 11 8 p2
W. Zhang
d1 c3 c2 c1
Data Warehousing & OLAP
50 8 11 p2 12 p1
17
Need many experiments and prototypes
c1
c2
W. Zhang
Data Warehousing & OLAP
18
Data Mining: Concepts and Techniques
3
Data Warehouse Design Process
Data Warehouse Architecture
other sources Metadata Monitor & Integrator OLAP Server Analysis Query Reports Data mining
Typically
Choose a business process to model, e.g., orders, invoices, etc. Choose the grain (atomic level of data) of the business process Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record
Operational Extract Data DBs Transform Warehouse Load Refresh
Serve
Data Marts Data Sources
W. Zhang Data Warehousing & OLAP 19 W. Zhang
Data Storage
OLAP Engine Front-End Tools
20
Data Warehousing & OLAP
Three Data Warehouse Models
Data Warehouse Development
Distributed Data Marts Multi-Tier Data Warehouse
Enterprise warehouse
collects all of the information about subjects spanning the entire organization
Data Mart
a subset of corporate-wide data that is of value to a specific groups of users. • Independent vs. dependent (directly from warehouse) data mart Data Mart
Data Mart
Enterprise Data Warehouse
Virtual warehouse
A set of views over operational databases Only some of the possible summary views may
W. Zhang Data Warehousing & OLAP 21 W. Zhang
Model refinement
Model refinement
Define a high-level corporate data model
Data Warehousing & OLAP 22
OLAP Server Architectures
Compute Cube in ROLAP
Relational OLAP (ROLAP)
Use RDBMS or ORDBMS to store data and OLAP middleware to support missing pieces Greater scalability
Extend SQL to have a cube operator
define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales
Multidimensional OLAP (MOLAP)
Store cubes in multidimensional arrays Fast indexing to summarized data
Compute each cuboid using a group by
Time consuming
Pre-compute cuboids
For n dimensions, there are at least 2n cuboids, mach more if dimensions have hierarchies • Takes too much space, is costly to maintain
Specialized SQL servers
Support SQL queries over star/snowflake schemas
W. Zhang Data Warehousing & OLAP 23
W. Zhang
Data Warehousing & OLAP
24
Data Mining: Concepts and Techniques
4
Partial Materialization
Compute Cube in MOLAP
Partition a multidimensional array into chunks that fits in memory Compress sparse array to conserve space How to compute the entire cube by reading each chunk only once?
Pre-compute some cuboids and compute other cuboids when they are needed
Select cuboid for materialization • Tradeoff among space limits, maintenance cost, & usefulness to query processing Exploit materialized cuboids to answer queries • Choose cuboids to use, using indexes, translate cube operations on chosen cuboids View (cuboid) maintenance • Update the materialized cuboids when the source data is changed
W. Zhang Data Warehousing & OLAP 25
Read chunks in some ordering Different ordering needs different buffer space (chunk memory) A multi-way array aggregation tries to computer all K-D cuboids in parallel
W. Zhang
Data Warehousing & OLAP
26
Multi-way Array Aggregation
all A c3 61 62 63 64 c2 45 46 47 48 c1 29 130 2 31 2 32 2 1 1 1 1 c0 60 1 1 1 1 2 b3 13 14 15 16 1 44 56 1 28 240 b2 9 10 11 12 1 52 B 1 24 236 6 7 8 220 b1 5 2 2 3 4 b0 1 C a0 a1 a2 A a3 A B B
Multi-way Array Aggregation
Assume reading chunks 1 to 64 in order
For BC plane, compute 1 chunk at a time. Need to keep partial result for each cell in the chunk For AC plane, compute 4 chunks in parallel. Must keep partial results for each cell in the 4 chunks For AB plane, compute all 16 chunks in parallel. Must keep partial results for all cells in that plane Size=AB plane + 1 row of AC plane + 1 chunk of BC plane
C all
W. Zhang
all
C
Data Warehousing & OLAP 27 W. Zhang Data Warehousing & OLAP 28
Indexing Cube Using Bitmaps
Indexing Cube Using Join Index
Join index: JI(R-id, S-id) where R (R-id, …) joins with S (S-id, …) on a join condition Relates the values of the dimensions of a star schema to rows in the fact table.
Each value in the column has a bit vector
Each tuple has a bit in the bit vector The i-th bit is set if the i-th tuple of the base table has the value for the indexed column
Use bit operations to search for tuples
Index on Region Index on Type
Base table
Cust C1 C2 C3 C4 C5 Region Asia Europe Asia America Europe
E.g. fact table Sales and two dimensions location and product • A join index on location maintains for each distinct city a list of sales tuples recording the Sales in the city Join indices can span multiple dimensions
W. Zhang
W. Zhang
Data Warehousing & OLAP
30
Data Mining: Concepts and Techniques
5
Processing OLAP Queries
Data Warehouse Tools & Utilities
Determine cube operations to perform on available cuboids:
transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection
Data extraction:
get data from heterogeneous external sources
Data cleaning:
detect & rectify errors in the data
Determine to which materialized cuboid(s) the relevant operations should be applied. Exploring indexing structures and compressed vs. dense array structures in MOLAP
Data transformation:
convert data from host format to DW format
Refresh:
propagate source updates to the warehouse
W. Zhang Data Warehousing & OLAP 32
W. Zhang
Data Warehousing & OLAP
31
DW and Data Mining
Summary
DW prepares data for mining
Integrated, consistent, cleaned
Data warehouse
A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decisionmaking process
DW provides infrastructure for data collection and analysis OLAP is a simple form of mining & data exploration Mining should be a part of DW operations
Provide more powerful tools to analyze data and extract knowledge
A multi-dimensional model of a data warehouse
Star schema, snowflake schema, fact constellations A data cube consists of dimensions & measures
W. Zhang
Data Warehousing & OLAP
33
W. Zhang
Data Warehousing & OLAP
34
Summary
OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP, HOLAP Efficient computation of data cubes
Partial vs. full vs. no materialization Multi-way array aggregation Bitmap index and join index implementations
Further development of data cube technology
Discovery-drive and multi-feature cubes From OLAP to OLAM (on-line analytical mining)