Data Warehousing and Olap1644

Published on January 2017 | Categories: Documents | Downloads: 55 | Comments: 0 | Views: 442
of 6
Download PDF   Embed   Report

Comments

Content

Data Warehouse
™

A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of decisionmaking process
ƒ Modeling and analysis of data for decision makers, not for transaction processing (OLAP vs. OLTP) ƒ Constructed by integrating multiple heterogeneous data sources ƒ Keep data with a historical perspective ƒ Permanently store data imported from the operational data sources

Data Warehousing and OLAP
Dr. Weining Zhang

W. Zhang

Data Warehousing & OLAP

2

DW vs. DBMS
™

OLTP vs. OLAP
OLTP users function DB design data clerk, IT professional day to day operations application-oriented current, up-to-date detailed, flat relational isolated repetitive read/write index/hash on prim. key short, simple transaction tens thousands 100MB-GB transaction throughput OLAP knowledge worker decision support subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans complex query millions hundreds 100GB-TB query throughput, response
4

Wrapper/mediator vs. DW for information integration
ƒ Mediator is query driven. Data are stored in heterogeneous resources. Mediator distributes queries and integrates answers ƒ DW is update-driven. Data are loaded to DW for query processing

™

DW vs. Operational DBMS
ƒ DBMS is for OLTP (on-line transaction processing) and daily operation ƒ DW is for OLAP (on-line analytical processing) and decision-making

usage access unit of work # records accessed #users DB size metric

W. Zhang

Data Warehousing & OLAP

3

W. Zhang

Data Warehousing & OLAP

Why Separate DW From DBMS?
™

Data Cube Model
™

High performance for both systems
ƒ DBMS is tuned for OLTP ƒ Warehouse is tuned for OLAP

Fact table and dimension tables
Location
Location c1 c3 c1 c2 c1 c2 … Product p1 p1 p2 p2 p1 p1 … Amount 12 50 11 8 44 4 …

Sales
Time d1 d1 d1 d1 d2 d2 …

™

Special requirements for decision making
ƒ Query historical data not found in typical databases ƒ Consolidate (aggregate, summarize) data from heterogeneous sources ƒ Must reconcile inconsistency in formats representations, and codes for data from heterogeneous data sources

cid c1 c2 c2 …

city Dallas Toronto Chicago …

state TX ON IL …

country USA Canada USA …

dimensions
W. Zhang Data Warehousing & OLAP 5 W. Zhang

Facts/measure
Data Warehousing & OLAP 6

Data Mining: Concepts and Techniques

1

Data Cube Model
™

Define DW Schema
™

View data as cubes
Location c1 c3 c1 c2 c1 c2 … c1 … all Product p1 p1 p2 p2 p1 p1 … p1 … all Amount 12 50 11 8 44 4 … 56 … 129
Data Warehousing & OLAP 7

Types of schemas
ƒ Star schema. A fact table with a set of simple dimension tables ƒ Snowflake schema. Refine a star schema by allowing some dimension to be modeled by a set of tables ƒ Fact constellation. Multiple fact tables share dimension tables. Viewed as a collection of stars, thus called galaxy schema or fact constellation

Sales
Time d1 d1 d1 d1 d2 d2 … all … all
W. Zhang

d2 d1 p1 p2 all

all

56 4 50 110 44 4 12 50 48 50 8 50 12 11 50 8 50 12 11 62 8 50 11 11 8 19 23 8 c2 50 c3 81 all

c1

W. Zhang

Data Warehousing & OLAP

8

Example of Star Schema
time
time_key day day_of_the_week month quarter year

Defining a Star Schema
item
item_key item_name brand type supplier_type

dimensions Sales Fact Table time_key item_key branch_key

branch
branch_key branch_name branch_type

location location_key units_sold dollars_sold avg_sales Facts
location_key street city province_or_street country

define cube sales_star [time, item, branch, location]: dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*) define dimension time as (time_key, day, day_of_week, month, quarter, year) define dimension item as (item_key, item_name, brand, type, supplier_type) define dimension branch as (branch_key, branch_name, branch_type) define dimension location as (location_key, street, city, province_or_state, country)

W. Zhang

Data Warehousing & OLAP

9

W. Zhang

Data Warehousing & OLAP

10

Example of Snowflake Schema
time
time_key day day_of_the_week month quarter year

time

Example of Fact Constellation
item
Sales Fact Table time_key item_key branch_key branch
branch_key branch_name branch_type item_key item_name brand type supplier_type

item
Sales Fact Table time_key item_key branch_key
item_key item_name brand type supplier_key

supplier
supplier_key supplier_type

time_key day day_of_the_week month quarter year

Shipping Fact Table time_key item_key shipper_key from_location

branch
branch_key branch_name branch_type

location location_key units_sold dollars_sold avg_sales Measures
location_key street city_key

location_key units_sold dollars_sold avg_sales Measures

location
location_key street city province_or_street country

to_location dollars_cost units_shipped shipper
shipper_key shipper_name location_key shipper_type
12

city
city_key city province_or_street country
11

W. Zhang

Data Warehousing & OLAP

W. Zhang

Data Warehousing & OLAP

Data Mining: Concepts and Techniques

2

Types of Aggregate of Measures
Distributive. Obtain identical result no matter applied to aggregated values or to fact values e.g., count(), sum(), min(), max(). ™ Algebraic. Can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function, e.g., avg(), min_N(), standard_deviation(). ™ Holistic. Has no constant bound on the storage size needed to describe a subaggregate, e.g.,median(), mode(), Rank().
™
W. Zhang Data Warehousing & OLAP 13

A Sample Concept Hierarchy
all region Europe all

...

North_America

country

Germany ...

Spain

Canada

...

Mexico

city office

Frankfurt

...

Vancouver ... L. Chan

Toronto

... M. Wind
14

W. Zhang

Data Warehousing & OLAP

Lattice of Cuboids
™ ™

Cube-based OLAP Operations
™

Each cube is a cuboid Cuboids form a lattice based on dimensions in the cubes
all product time location 0-D(apex) cuboid 1-D cuboids

Roll-up, drill-down, dice, slice, pivot, etc
all 56 4 50 110 44 4 12 50 48 50 8 50 12 11 50 8 50 12 11 62 8 50 11 11 8 19 23 c1 8 c2 50 c3 81 all Roll up Drill down all all 67 c1 Drill down on product all 56 11 11 c1
W. Zhang

d2 d1 p1 p2 all

12 c2

50 c3 all all 129 all

product,time

product,location

time, location 2-D cuboids 3-D(base) cuboid
15

Roll up on time all p1 4 8 c2 8 50 c3 p2

product, time, location
W. Zhang Data Warehousing & OLAP

Roll up on location

p1 p2

110 19 all Roll up on produce
16

Data Warehousing & OLAP

Cube-based OLAP Operations
d2 d1 p1 p2 all all 56 4 50 110 44 4 12 50 48 50 8 50 12 11 50 8 50 12 11 62 8 50 11 11 8 19 23 c1 8 c2 50 c3 81 all pivot d1 p1 Slice on p2 d1 12 11 11 c1 8 c2 8 50 c3

Design of a Data Warehouse
™ ™

A difficult and expensive process Involve business and technical considerations. Viewed from different perspectives:
ƒ ƒ ƒ ƒ What information are needed Models of data sources Models of warehouse data How to use the data

Dice on d1, d2; p1, p2; and c1, c2 d2 44 4 d1 12 12 11 p1 8 11 11 8 p2
W. Zhang

d1 c3 c2 c1
Data Warehousing & OLAP

™
50 8 11 p2 12 p1
17

Need many experiments and prototypes

c1

c2

W. Zhang

Data Warehousing & OLAP

18

Data Mining: Concepts and Techniques

3

Data Warehouse Design Process
™

Data Warehouse Architecture
other sources Metadata Monitor & Integrator OLAP Server Analysis Query Reports Data mining

Typically
ƒ Choose a business process to model, e.g., orders, invoices, etc. ƒ Choose the grain (atomic level of data) of the business process ƒ Choose the dimensions that will apply to each fact table record ƒ Choose the measure that will populate each fact table record

Operational Extract Data DBs Transform Warehouse Load Refresh

Serve

Data Marts Data Sources
W. Zhang Data Warehousing & OLAP 19 W. Zhang

Data Storage

OLAP Engine Front-End Tools
20

Data Warehousing & OLAP

Three Data Warehouse Models
™

Data Warehouse Development
Distributed Data Marts Multi-Tier Data Warehouse

Enterprise warehouse
ƒ collects all of the information about subjects spanning the entire organization

™

Data Mart
ƒ a subset of corporate-wide data that is of value to a specific groups of users. • Independent vs. dependent (directly from warehouse) data mart Data Mart

Data Mart

Enterprise Data Warehouse

™

Virtual warehouse
ƒ A set of views over operational databases ƒ Only some of the possible summary views may
W. Zhang Data Warehousing & OLAP 21 W. Zhang

Model refinement

Model refinement

Define a high-level corporate data model
Data Warehousing & OLAP 22

OLAP Server Architectures
™

Compute Cube in ROLAP
™

Relational OLAP (ROLAP)
ƒ Use RDBMS or ORDBMS to store data and OLAP middleware to support missing pieces ƒ Greater scalability

Extend SQL to have a cube operator
define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales

™

Multidimensional OLAP (MOLAP)
ƒ Store cubes in multidimensional arrays ƒ Fast indexing to summarized data

™ ™

Compute each cuboid using a group by
ƒ Time consuming

™ ™

Hybrid OLAP (HOLAP)
ƒ E.g., low level: relational, high-level: array

Pre-compute cuboids
ƒ For n dimensions, there are at least 2n cuboids, mach more if dimensions have hierarchies • Takes too much space, is costly to maintain

Specialized SQL servers
ƒ Support SQL queries over star/snowflake schemas
W. Zhang Data Warehousing & OLAP 23

W. Zhang

Data Warehousing & OLAP

24

Data Mining: Concepts and Techniques

4

Partial Materialization
™

Compute Cube in MOLAP
Partition a multidimensional array into chunks that fits in memory ™ Compress sparse array to conserve space ™ How to compute the entire cube by reading each chunk only once?
™

Pre-compute some cuboids and compute other cuboids when they are needed
ƒ Select cuboid for materialization • Tradeoff among space limits, maintenance cost, & usefulness to query processing ƒ Exploit materialized cuboids to answer queries • Choose cuboids to use, using indexes, translate cube operations on chosen cuboids ƒ View (cuboid) maintenance • Update the materialized cuboids when the source data is changed
W. Zhang Data Warehousing & OLAP 25

ƒ Read chunks in some ordering ƒ Different ordering needs different buffer space (chunk memory) ƒ A multi-way array aggregation tries to computer all K-D cuboids in parallel

W. Zhang

Data Warehousing & OLAP

26

Multi-way Array Aggregation
all A c3 61 62 63 64 c2 45 46 47 48 c1 29 130 2 31 2 32 2 1 1 1 1 c0 60 1 1 1 1 2 b3 13 14 15 16 1 44 56 1 28 240 b2 9 10 11 12 1 52 B 1 24 236 6 7 8 220 b1 5 2 2 3 4 b0 1 C a0 a1 a2 A a3 A B B

Multi-way Array Aggregation
™

Assume reading chunks 1 to 64 in order
ƒ For BC plane, compute 1 chunk at a time. Need to keep partial result for each cell in the chunk ƒ For AC plane, compute 4 chunks in parallel. Must keep partial results for each cell in the 4 chunks ƒ For AB plane, compute all 16 chunks in parallel. Must keep partial results for all cells in that plane ƒ Size=AB plane + 1 row of AC plane + 1 chunk of BC plane

C all

W. Zhang

all

C
Data Warehousing & OLAP 27 W. Zhang Data Warehousing & OLAP 28

Indexing Cube Using Bitmaps
™

Indexing Cube Using Join Index
Join index: JI(R-id, S-id) where R (R-id, …) joins with S (S-id, …) on a join condition ™ Relates the values of the dimensions of a star schema to rows in the fact table.
™

Each value in the column has a bit vector
ƒ Each tuple has a bit in the bit vector ƒ The i-th bit is set if the i-th tuple of the base table has the value for the indexed column

™

Use bit operations to search for tuples
Index on Region Index on Type

Base table
Cust C1 C2 C3 C4 C5 Region Asia Europe Asia America Europe

Type RecIDAsia Europe America RecID Retail Dealer Retail 1 1 0 0 1 1 0 Dealer 2 0 1 0 2 0 1 Dealer 3 1 0 0 3 0 1 Retail 4 0 0 1 4 1 0 0 1 0 5 0 1 Dealer 5
Data Warehousing & OLAP 29

ƒ E.g. fact table Sales and two dimensions location and product • A join index on location maintains for each distinct city a list of sales tuples recording the Sales in the city ƒ Join indices can span multiple dimensions

W. Zhang

W. Zhang

Data Warehousing & OLAP

30

Data Mining: Concepts and Techniques

5

Processing OLAP Queries
™

Data Warehouse Tools & Utilities
™ ™ ™ ™

Determine cube operations to perform on available cuboids:
ƒ transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection

Data extraction:
ƒ get data from heterogeneous external sources

Data cleaning:
ƒ detect & rectify errors in the data

Determine to which materialized cuboid(s) the relevant operations should be applied. ™ Exploring indexing structures and compressed vs. dense array structures in MOLAP
™

Data transformation:
ƒ convert data from host format to DW format

Load:
ƒ sort, summarize, consolidate, compute views, check integrity, & build indicies and partitions

™

Refresh:
ƒ propagate source updates to the warehouse
W. Zhang Data Warehousing & OLAP 32

W. Zhang

Data Warehousing & OLAP

31

DW and Data Mining
™ ™

Summary
™

DW prepares data for mining
ƒ Integrated, consistent, cleaned

Data warehouse
ƒ A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decisionmaking process

DW provides infrastructure for data collection and analysis ™ OLAP is a simple form of mining & data exploration ™ Mining should be a part of DW operations
ƒ Provide more powerful tools to analyze data and extract knowledge

™

A multi-dimensional model of a data warehouse
ƒ Star schema, snowflake schema, fact constellations ƒ A data cube consists of dimensions & measures

W. Zhang

Data Warehousing & OLAP

33

W. Zhang

Data Warehousing & OLAP

34

Summary
OLAP operations: drilling, rolling, slicing, dicing and pivoting ™ OLAP servers: ROLAP, MOLAP, HOLAP ™ Efficient computation of data cubes
™

ƒ Partial vs. full vs. no materialization ƒ Multi-way array aggregation ƒ Bitmap index and join index implementations
™

Further development of data cube technology
ƒ Discovery-drive and multi-feature cubes ƒ From OLAP to OLAM (on-line analytical mining)

W. Zhang

Data Warehousing & OLAP

35

Data Mining: Concepts and Techniques

6

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close