02-1 DWh Data Marts - Design

Published on June 2016 | Categories: Documents | Downloads: 37 | Comments: 0 | Views: 163
of 24
Download PDF   Embed   Report

Comments

Content

2 Data Marts – “The Front Room”
Data Warehousing
Spring Semester 2011

R. Marti

Data Marts in the DWh Reference Architecture
Reports

Data Warehousing
Data
Mart

Source
Database

Interactive Analysis

Source
Database

Data
Warehouse
Dashboards

Data
Mart

Source
Database

Focus
•  Design of a single “stand-alone” data mart, analyzing the performance of a single business process
•  Typical queries
•  Access paths and special physical data structures for these queries
R. Marti

2 Data Marts – Data Warehousing 2011

2

Recap: Designing Operational Databases
•  Goals
•  Support execution of business processes, i.e. many short and small
transactions: point and range queries, plus single-row inserts and/or updates
•  Avoid (uncontrolled) redundancies => normalized schemas

•  Typical database design process (see also Appendix)
•  Use a variant of a “semantic” data model, usually a variant of the ER model
•  Transform the result into (normalized) tables
(depending on the “flavor” of the ER model, this involves more or less work)
•  Tune using indexes and other physical design options
•  Maybe denormalize tables if overall throughput / performance improves

•  Many practitioners do not seem to know much normalization theory
(but usually get by fairly well none the less!)
R. Marti

2 Data Marts – Data Warehousing 2011

3

Some Personal DB Design Principles & Conventions
•  Use of Barker’s CASE*Method Notation:
min 0 max 1 T2-s
per T1

min 1 max 1 T2-s
per T1

•  On diagrams, an entity type D depending
on another entity type E is (usually)
shown „below“ entity type E
(„sometimes exception“: Star Schemas)
•  Subtypes are often used in order to
avoid nulls in attributes,
especially in foreign key attributes

min 0 max ∞ T2-s
per T1

min 1 max ∞ T2-s
per T1

•  Cyclic dependencies are usually avoided
•  Entity types are usually normalized (meaning they are at least in 3NF)
R. Marti

2 Data Marts – Data Warehousing 2011

4

Recap: Relational Normalization in a Nutshell …
1NF:

“atomic” data values, no internal structure, esp. no “repeating fields”.

2NF and 3NF:
(look at functional dependencies between key and nonkey attributes)
A nonkey attribute must provide information about the key, the whole
key, and nothing but the key – so help me Codd.
(Edgar F. Codd was the “inventor” of the relational model and 1NF – 3NF.)

2NF:

Nonkey attributes must not functionally depend on a part of the key.

3NF:

Nonkey attributes must not functionally depend on nonkey attributes.

BCNF: (looks at functional dependencies in relations with key attributes only)
4NF:

(looks at multivalued dependencies)

5NF:

(looks at join dependencies)

R. Marti

2 Data Marts – Data Warehousing 2011

5

Designing Analytic Databases (i.e., Data Marts)
•  Goals
•  Support analysis of one or more business processes, i.e.,
(1) queries that may need to look at a substantial amount of data –
even if they may return relatively small (aggregated) amount of data, and
(2) relatively infrequent but often large periodic batch inserts
•  Maximize query performance => use caching mechanisms – a controlled form
of redundancy – including denormalization

•  Typical design process for data marts
•  Separate analytic databases from operational databases
•  Determine how to measure the execution of a given business process,
and look at the context (“dimensions”) of this business process
•  Design a corresponding Star Schema or Snowflake Schema (see later)
•  Tune using indexes and other physical design options
R. Marti

2 Data Marts – Data Warehousing 2011

6

Analysis of Business Processes
Typical Business Questions

Corresponding Business
Process

Gross margin
by product category
in February 2011?

Sales

Average account balance
by education level?

Account Management

Number of sick days
by employees in marketing
in 2010?

Time Management in
Human Resources

Sum of outstanding payables
by supplier?

Supplier Payment Processing

Product return rate
by customer?

Client Returns Processing

Ingredients:
measure
R. Marti

context (category)

time

2 Data Marts – Data Warehousing 2011

+ aggregation, selection
7

Excursion: Different Purposes of Attributes
•  identification
•  categorization
often, objects can naturally be assigned to one of a relatively small number of
categories, based on the value of an attribute
(e.g. an employee can be categorized according to rank or job title)
•  quantification / measurement
•  calculation
•  textual description
Attribute

Identification
Attribute

R. Marti

Category
Attribute

Measure
Attribute

Calculation
Parameter

2 Data Marts – Data Warehousing 2011

Descriptive
Attribute

Other
Attribute

Page 8

Object Identification Attribute
• 

A object identification attribute is an is an attribute which is used to uniquely identify
individual objects such as clients, employees, loss events, contracts etc.
Usually, the value of an object identification attribute is a “meaningless” number which
is not necessarily exposed to the user.
Such a “meaningless” number is also called a surrogate because it acts as a
placeholder of proxy of a real world object.
A surrogate must not change during the lifetime of the object which it identifies /
represents, and after the lifetime of this object, it must not be re-used for another
(new) object.

R. Marti

2 Data Marts – Data Warehousing 2011

Page 9

Category Attribute
• 

A category attribute is an independent attribute that primarily serves to group or
segment business data captured as a result of conducting business transactions.
Categories are often arranged in taxonomic hierarchies
(see next slide).
Note:
This is sometimes also called a dimensional attribute, see later.

•  the “Line of Business” segments business data such as “Premium earned” or
“Combined ratio” as presented in the annual report.
•  the “Line Management Structure” segments the “headcount” for the company.

R. Marti

2 Data Marts – Data Warehousing 2011

Page 10

Category Attribute with Taxonomy (Example)
Set:

LineOfBusiness = { ‘property’, ‘engineering’, ‘marine’, ‘casualty’, … }

LineOfBusiness

Tree:

‘all lines’

‘P&C line’

‘property’

‘special line’

‘engineering’

A

R. Marti

‘L&H line’

B

‘casualty’

‘life’

‘health’

‘marine’

Taxonomic Relationship: A is a more specific concept than B

2 Data Marts – Data Warehousing 2011

Page 11

Calculation Parameter
• 

A calculation parameter is a numeric (integer, decimal, fractional, floating point)
attribute used to compute values of other attributes (typically measures, see below),
e.g.
- an index
- a factor
- a rate
Its value is usually (but not always) determined by the values of one or more category
attributes and a timestamp

•  the exchange rate USD → CHF fixed on 5 May 2006
< ‘USD’, ‘CHF’, ‘2006-05-05 04:34:53 GMT’, 1.2313 >
•  the inflation rate in Switzerland expected for 2007.
< ‘SUI’, ‘2007’, 0.023 >

R. Marti

2 Data Marts – Data Warehousing 2011

Page 12

Measure
• 

A measure is an attribute used to express quantitative properties of objects such as
monetary amounts, magnitudes and so on.
A measure may have an associated unit, e.g., Euros, meters, tons, miles per hour, or it
may be a count, a ratio or a percentage without an associated unit.

•  the annual base salary of an employee as recorded in a payroll system is usually
treated as an observed measure.
•  total annual compensation, computed from the observed measures annual base salary,
family and child allowance, and annual bonus, is a derived measure.

R. Marti

2 Data Marts – Data Warehousing 2011

Page 13

Star Schema Diagram: Example

Notes
<pi>: primary identifier
<ai>: alternative identifier
(foreign keys not shown)
R. Marti

2 Data Marts – Data Warehousing 2011

Page 14

Star Schema: Explanations
Dimension Tables (Product, Store, Customer)
•  attributes
-  surrogate (ending in _Id by convention)
-  more or less natural (often mnemonic) key
(as used in operational DB)
-  many descriptive fields, often text
•  may violate 2NF or 3NF (e.g., FD Product_Group → Product_Line, or
FD Manager_EmpNo → Manager_Name)
•  table is usually relatively “short” (few rows) but relatively “wide” (many colums)
•  provide “business context”, used for categorization

R. Marti

2 Data Marts – Data Warehousing 2011

Page 15

Star Schema: Explanations (cont.)
Special dimension time (Day)
•  attributes
-  surrogate (ending in _Id by convention)
-  descriptive fields (also e.g. flags like “is holiday”, “is day before holiday”)

•  Note:
Dates and timestamps are built-in types in SQL, with quite a few built-in functions to
operate on them.
Treating a Day as an entity is a form of caching.
For example, selecting only Friday sales can be done using a simple filter predicate,
without invoking any function!

R. Marti

2 Data Marts – Data Warehousing 2011

Page 16

Star Schema: Explanations (cont.)
Fact Table (Sale)
•  attributes
-  measures, either basic measures,
e.g., Quantity, Amount_Per_Item,
or (redundant) derived measures,
e.g., Amount_Received := (Quantity * Amount_Per_Item) * (1 – Discount_Rate)
foreign key attributes
-  surrogates (ending in _Id by convention), each of which serves as a foreign key
to one dimension table, and which colllctively establish the business context
-  a primary key, either a surrogate or a more or less mnemonic key
•  table is usually relatively „long“ (many rows) but not very „wide“
•  A fact table has a grain defining the most atomic unit of data being captured, e.g.
-  transaction level: measures + their context are captured for each transaction
-  daily / weekly / monthly / quarterly summaries
R. Marti

2 Data Marts – Data Warehousing 2011

Page 17

Fact + Dimensions result in a multi-dimensional cube

Customer defined in CRM System

Customer
Customer_Name
Doe

Sale
Amount_Received

125’948

Product defined in ERP System

Day
Year

188
Clothing
Product
Product_Line

R. Marti

2010
60’709

2 Data Marts – Data Warehousing 2011

18

Cube (Example showing hierarchies in dimensions)

© IBM SYSTEMS JOURNAL, VOL 41, NO 4, 2002

R. Marti

2 Data Marts – Data Warehousing 2011

Page 19

More on Dimensions
•  Usage in queries
-  Provide context in reports
-  Define “master-detail” organization, i.e., grouping and subgrouping criteria, often
multiple hierarchical levels
-  Filter predicates (points and ranges)
-  Controlling scope of aggregation
-  Ordering criteria

•  Often textual columns, but sometimes numerical
e.g., zip codes, sizes, ages ( ge bands), incomes ( income bands)
•  Design principle: provide rich set of (sometimes redundant) attributes to
support above usages, and ensure values are easily understood (e.g. flags!)
Example: provide components of multi-part attributes + their combination
e.g. ‘+41 79 999 9999’ as well as ’+41’, ‘+41 79’
R. Marti

2 Data Marts – Data Warehousing 2011

20

More on Facts
•  Usage in queries
-  Subsets of are defined by including/excluding dimensions, and filtering
dimensions on specific values (see also below)
-  Aggregations (often summing) based on additive measures
e.g., Sale.Amount_Received is additive while Sale.Discount is not

•  Typical queries
-  slicing: filter condition in one or more dimensions
-  drill-down: going from coarser level of aggregation to finer (more detailed) level
-  roll-up: going from finer level of aggregation to coarser level
-  dicing: e.g. adding / removing / exchanging dimensions in a spreadsheet

•  Design principle:
capture all measures, including redundant (derived) measures

R. Marti

2 Data Marts – Data Warehousing 2011

21

Typical SQL Query on a Star Schema
select
PRODUCT.PRODUCT_LINE,
PRODUCT.PRODUCT_GROUP,
sum(SALE.AMOUNT_RECEIVED) as TOT_AMOUNT
from
SALE
join DAY
on DAY.DAY_ID = SALE.DAY_ID
join PRODUCT
on PRODUCT.PRODUCT_ID = SALE.PRODUCT_ID
where
DAY.MONTH_CODE = ‘Jan’
and DAY.YEAR = 2011
group by
PRODUCT.PRODUCT_LINE,
PRODUCT.PRODUCT_GROUP
order by
PRODUCT.PRODUCT_LINE,
PRODUCT.PRODUCT_GROUP
R. Marti

Dimensions in result
(Aggregated) measure in result
Fact table
Dimension tables,
joined on surrogate FKs / PKs

Filter on dimension table

Dimensions over which
to aggregate

2 Data Marts – Data Warehousing 2011

22

Snowflake Schema Diagram: Example
“outrigger” table
(partial or complete)
normalization of
selected dimension
tables of a
Star Schema

R. Marti

2 Data Marts – Data Warehousing 2011

Page 23

Previous Query on Snowflake Schema
select
PRODUCT_GROUP.PRODUCT_LINE,
PRODUCT_GROUP.PRODUCT_GROUP,
sum(SALE.AMOUNT_RECEIVED) as TOT_AMOUNT
from
SALE
join DAY
on DAY.DAY_ID = SALE.DAY_ID
join PRODUCT
on PRODUCT.PRODUCT_ID = SALE.PRODUCT_ID
join PRODUCT_GROUP
on PRODUCT_GROUP.PRODUCT_GROUP_ID = PRODUCT.PRODUCT_GROUP_ID
where
DAY.MONTH_CODE = ‘Jan’
and DAY.YEAR = 2011
group by
PRODUCT_GROUP.PRODUCT_LINE,
PRODUCT_GROUP.PRODUCT_GROUP
order by ...
R. Marti

2 Data Marts – Data Warehousing 2011

This is even worse with
•  more dimensions
•  multiple dimension levels
24

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close