2 Data Marts – “The Front Room”
Data Warehousing
Spring Semester 2011
R. Marti
Data Marts in the DWh Reference Architecture
Reports
Data Warehousing
Data
Mart
Source
Database
Interactive Analysis
Source
Database
Data
Warehouse
Dashboards
Data
Mart
Source
Database
Focus
• Design of a single “stand-alone” data mart, analyzing the performance of a single business process
• Typical queries
• Access paths and special physical data structures for these queries
R. Marti
2 Data Marts – Data Warehousing 2011
2
Recap: Designing Operational Databases
• Goals
• Support execution of business processes, i.e. many short and small
transactions: point and range queries, plus single-row inserts and/or updates
• Avoid (uncontrolled) redundancies => normalized schemas
• Typical database design process (see also Appendix)
• Use a variant of a “semantic” data model, usually a variant of the ER model
• Transform the result into (normalized) tables
(depending on the “flavor” of the ER model, this involves more or less work)
• Tune using indexes and other physical design options
• Maybe denormalize tables if overall throughput / performance improves
• Many practitioners do not seem to know much normalization theory
(but usually get by fairly well none the less!)
R. Marti
2 Data Marts – Data Warehousing 2011
3
Some Personal DB Design Principles & Conventions
• Use of Barker’s CASE*Method Notation:
min 0 max 1 T2-s
per T1
min 1 max 1 T2-s
per T1
• On diagrams, an entity type D depending
on another entity type E is (usually)
shown „below“ entity type E
(„sometimes exception“: Star Schemas)
• Subtypes are often used in order to
avoid nulls in attributes,
especially in foreign key attributes
min 0 max ∞ T2-s
per T1
min 1 max ∞ T2-s
per T1
• Cyclic dependencies are usually avoided
• Entity types are usually normalized (meaning they are at least in 3NF)
R. Marti
2 Data Marts – Data Warehousing 2011
4
Recap: Relational Normalization in a Nutshell …
1NF:
“atomic” data values, no internal structure, esp. no “repeating fields”.
2NF and 3NF:
(look at functional dependencies between key and nonkey attributes)
A nonkey attribute must provide information about the key, the whole
key, and nothing but the key – so help me Codd.
(Edgar F. Codd was the “inventor” of the relational model and 1NF – 3NF.)
2NF:
Nonkey attributes must not functionally depend on a part of the key.
3NF:
Nonkey attributes must not functionally depend on nonkey attributes.
BCNF: (looks at functional dependencies in relations with key attributes only)
4NF:
(looks at multivalued dependencies)
5NF:
(looks at join dependencies)
R. Marti
2 Data Marts – Data Warehousing 2011
5
Designing Analytic Databases (i.e., Data Marts)
• Goals
• Support analysis of one or more business processes, i.e.,
(1) queries that may need to look at a substantial amount of data –
even if they may return relatively small (aggregated) amount of data, and
(2) relatively infrequent but often large periodic batch inserts
• Maximize query performance => use caching mechanisms – a controlled form
of redundancy – including denormalization
• Typical design process for data marts
• Separate analytic databases from operational databases
• Determine how to measure the execution of a given business process,
and look at the context (“dimensions”) of this business process
• Design a corresponding Star Schema or Snowflake Schema (see later)
• Tune using indexes and other physical design options
R. Marti
2 Data Marts – Data Warehousing 2011
6
Analysis of Business Processes
Typical Business Questions
Corresponding Business
Process
Gross margin
by product category
in February 2011?
Sales
Average account balance
by education level?
Account Management
Number of sick days
by employees in marketing
in 2010?
Time Management in
Human Resources
Sum of outstanding payables
by supplier?
Supplier Payment Processing
Product return rate
by customer?
Client Returns Processing
Ingredients:
measure
R. Marti
context (category)
time
2 Data Marts – Data Warehousing 2011
+ aggregation, selection
7
Excursion: Different Purposes of Attributes
• identification
• categorization
often, objects can naturally be assigned to one of a relatively small number of
categories, based on the value of an attribute
(e.g. an employee can be categorized according to rank or job title)
• quantification / measurement
• calculation
• textual description
Attribute
Identification
Attribute
R. Marti
Category
Attribute
Measure
Attribute
Calculation
Parameter
2 Data Marts – Data Warehousing 2011
Descriptive
Attribute
Other
Attribute
Page 8
Object Identification Attribute
•
A object identification attribute is an is an attribute which is used to uniquely identify
individual objects such as clients, employees, loss events, contracts etc.
Usually, the value of an object identification attribute is a “meaningless” number which
is not necessarily exposed to the user.
Such a “meaningless” number is also called a surrogate because it acts as a
placeholder of proxy of a real world object.
A surrogate must not change during the lifetime of the object which it identifies /
represents, and after the lifetime of this object, it must not be re-used for another
(new) object.
R. Marti
2 Data Marts – Data Warehousing 2011
Page 9
Category Attribute
•
A category attribute is an independent attribute that primarily serves to group or
segment business data captured as a result of conducting business transactions.
Categories are often arranged in taxonomic hierarchies
(see next slide).
Note:
This is sometimes also called a dimensional attribute, see later.
• the “Line of Business” segments business data such as “Premium earned” or
“Combined ratio” as presented in the annual report.
• the “Line Management Structure” segments the “headcount” for the company.
Taxonomic Relationship: A is a more specific concept than B
2 Data Marts – Data Warehousing 2011
Page 11
Calculation Parameter
•
A calculation parameter is a numeric (integer, decimal, fractional, floating point)
attribute used to compute values of other attributes (typically measures, see below),
e.g.
- an index
- a factor
- a rate
Its value is usually (but not always) determined by the values of one or more category
attributes and a timestamp
• the exchange rate USD → CHF fixed on 5 May 2006
< ‘USD’, ‘CHF’, ‘2006-05-05 04:34:53 GMT’, 1.2313 >
• the inflation rate in Switzerland expected for 2007.
< ‘SUI’, ‘2007’, 0.023 >
R. Marti
2 Data Marts – Data Warehousing 2011
Page 12
Measure
•
A measure is an attribute used to express quantitative properties of objects such as
monetary amounts, magnitudes and so on.
A measure may have an associated unit, e.g., Euros, meters, tons, miles per hour, or it
may be a count, a ratio or a percentage without an associated unit.
• the annual base salary of an employee as recorded in a payroll system is usually
treated as an observed measure.
• total annual compensation, computed from the observed measures annual base salary,
family and child allowance, and annual bonus, is a derived measure.
R. Marti
2 Data Marts – Data Warehousing 2011
Page 13
Star Schema Diagram: Example
Notes
<pi>: primary identifier
<ai>: alternative identifier
(foreign keys not shown)
R. Marti
2 Data Marts – Data Warehousing 2011
Page 14
Star Schema: Explanations
Dimension Tables (Product, Store, Customer)
• attributes
- surrogate (ending in _Id by convention)
- more or less natural (often mnemonic) key
(as used in operational DB)
- many descriptive fields, often text
• may violate 2NF or 3NF (e.g., FD Product_Group → Product_Line, or
FD Manager_EmpNo → Manager_Name)
• table is usually relatively “short” (few rows) but relatively “wide” (many colums)
• provide “business context”, used for categorization
R. Marti
2 Data Marts – Data Warehousing 2011
Page 15
Star Schema: Explanations (cont.)
Special dimension time (Day)
• attributes
- surrogate (ending in _Id by convention)
- descriptive fields (also e.g. flags like “is holiday”, “is day before holiday”)
• Note:
Dates and timestamps are built-in types in SQL, with quite a few built-in functions to
operate on them.
Treating a Day as an entity is a form of caching.
For example, selecting only Friday sales can be done using a simple filter predicate,
without invoking any function!
R. Marti
2 Data Marts – Data Warehousing 2011
Page 16
Star Schema: Explanations (cont.)
Fact Table (Sale)
• attributes
- measures, either basic measures,
e.g., Quantity, Amount_Per_Item,
or (redundant) derived measures,
e.g., Amount_Received := (Quantity * Amount_Per_Item) * (1 – Discount_Rate)
foreign key attributes
- surrogates (ending in _Id by convention), each of which serves as a foreign key
to one dimension table, and which colllctively establish the business context
- a primary key, either a surrogate or a more or less mnemonic key
• table is usually relatively „long“ (many rows) but not very „wide“
• A fact table has a grain defining the most atomic unit of data being captured, e.g.
- transaction level: measures + their context are captured for each transaction
- daily / weekly / monthly / quarterly summaries
R. Marti
2 Data Marts – Data Warehousing 2011
Page 17
Fact + Dimensions result in a multi-dimensional cube
More on Dimensions
• Usage in queries
- Provide context in reports
- Define “master-detail” organization, i.e., grouping and subgrouping criteria, often
multiple hierarchical levels
- Filter predicates (points and ranges)
- Controlling scope of aggregation
- Ordering criteria
• Often textual columns, but sometimes numerical
e.g., zip codes, sizes, ages ( ge bands), incomes ( income bands)
• Design principle: provide rich set of (sometimes redundant) attributes to
support above usages, and ensure values are easily understood (e.g. flags!)
Example: provide components of multi-part attributes + their combination
e.g. ‘+41 79 999 9999’ as well as ’+41’, ‘+41 79’
R. Marti
2 Data Marts – Data Warehousing 2011
20
More on Facts
• Usage in queries
- Subsets of are defined by including/excluding dimensions, and filtering
dimensions on specific values (see also below)
- Aggregations (often summing) based on additive measures
e.g., Sale.Amount_Received is additive while Sale.Discount is not
• Typical queries
- slicing: filter condition in one or more dimensions
- drill-down: going from coarser level of aggregation to finer (more detailed) level
- roll-up: going from finer level of aggregation to coarser level
- dicing: e.g. adding / removing / exchanging dimensions in a spreadsheet
• Design principle:
capture all measures, including redundant (derived) measures
R. Marti
2 Data Marts – Data Warehousing 2011
21
Typical SQL Query on a Star Schema
select
PRODUCT.PRODUCT_LINE,
PRODUCT.PRODUCT_GROUP,
sum(SALE.AMOUNT_RECEIVED) as TOT_AMOUNT
from
SALE
join DAY
on DAY.DAY_ID = SALE.DAY_ID
join PRODUCT
on PRODUCT.PRODUCT_ID = SALE.PRODUCT_ID
where
DAY.MONTH_CODE = ‘Jan’
and DAY.YEAR = 2011
group by
PRODUCT.PRODUCT_LINE,
PRODUCT.PRODUCT_GROUP
order by
PRODUCT.PRODUCT_LINE,
PRODUCT.PRODUCT_GROUP
R. Marti
Dimensions in result
(Aggregated) measure in result
Fact table
Dimension tables,
joined on surrogate FKs / PKs
Filter on dimension table
Dimensions over which
to aggregate
2 Data Marts – Data Warehousing 2011
22
Snowflake Schema Diagram: Example
“outrigger” table
(partial or complete)
normalization of
selected dimension
tables of a
Star Schema
R. Marti
2 Data Marts – Data Warehousing 2011
Page 23
Previous Query on Snowflake Schema
select
PRODUCT_GROUP.PRODUCT_LINE,
PRODUCT_GROUP.PRODUCT_GROUP,
sum(SALE.AMOUNT_RECEIVED) as TOT_AMOUNT
from
SALE
join DAY
on DAY.DAY_ID = SALE.DAY_ID
join PRODUCT
on PRODUCT.PRODUCT_ID = SALE.PRODUCT_ID
join PRODUCT_GROUP
on PRODUCT_GROUP.PRODUCT_GROUP_ID = PRODUCT.PRODUCT_GROUP_ID
where
DAY.MONTH_CODE = ‘Jan’
and DAY.YEAR = 2011
group by
PRODUCT_GROUP.PRODUCT_LINE,
PRODUCT_GROUP.PRODUCT_GROUP
order by ...
R. Marti
2 Data Marts – Data Warehousing 2011
This is even worse with
• more dimensions
• multiple dimension levels
24