Aggregation

Published on February 2017 | Categories: Documents | Downloads: 49 | Comments: 0 | Views: 431
of 6
Download PDF   Embed   Report

Comments

Content

Aggregation, Historical information, Query Facility, OLAP function and Tools. OLAP Servers, ROLAP, MOLAP, HOLAP, Data Mining interface, Security, Backup and Recovery, Tuning Data Warehouse, Testing Data Warehouse.

Aggregation Aggregations are the way by which information can be divided so queries can be run on the aggregated part and not the whole set of data. These are precalculated summaries derived from the most granular fact table. These summaries form a set of separate aggregate fact tables. You may create each aggregate fact table as a specific summarization across any number of dimensions.

Aggregations are a way of dividing the information so queries can be run on the aggregated part and not the whole set of data. The warehouse manager is responsible for creating aggregations. Most aggregations can be created in a single complex query, but the time it saves not having to do that every day is significant. An example would be an aggregation that keeps track of all of the customers who have bought athletic shoes in the past year. This would allow you to do queries about which people have bought Nike shoes or which people have bought shoes over $100. Having the aggregation would save you the time of waiting for the query to search through every customer every time to see if they bought athletic shoes and then do another check to see what the price or brand was. Having preaggregated data improves performance and allows users to spot trends that might have otherwise gone unnoticed. [1] Aggregations also aid in the process of creating summary tables, which are used to speed up query time by storing aggregated values in columns. If a department store wants to keep track of weekly sales, there would be an aggregation of total sales for each product at each store location. The summary table would

possibly consist of a product id number, a store id number, total revenue for the product for that week and the quantity sold. Using the aggregation to quickly obtain the sales figures saves time and makes updating the summary table easier. The summary tables need to have a date on them because as access to the tables diminishes, the tables will be deleted to save space. Summary tables are always changing along with the needs of the users so it is important to define the aggregations according to what summary tables might be of use.

Aggregates are precalculated summaries derived from the most granular fact table. These summaries form a set of separate aggregate fact tables. You may create each aggregate fact table as a specific summarization across any number of dimensions. Let us begin by examining a sample STAR schema. Choose a simple STAR schema with the fact table at the lowest possible level of granularity. Assume there are four dimension tables surrounding this most granular fact table. Figure 11-11 shows the example we want to examine. When you run a query in an operational system, it produces a result set about a single customer, a single order, a single invoice, a single product, and so on. But, as you know, the queries in a data warehouse environment produce large result sets. These queries retrieve hundreds and thousands of table rows, manipulate the metrics in the fact tables, and then produce the result sets. The manipulation of the fact table metrics may be a simple addition, an addition with some adjustments, a calculation of averages, or even an application of complex arithmetic algorithms. Aggregates Aggregates are precalculated summaries derived from the most granular fact table. These summaries form a set of separate aggregate fact tables. You may create each aggregate fact table as a specific summarization across any number of dimensions. Let us begin by examining a sample STAR schema. Choose a simple STAR schema with the fact table at the lowest possible level of granularity. Assume there are four dimension tables surrounding this most granular fact table. Figure 11-11 shows the example we want to examine. When you run a query in an operational system, it produces a result set about a single

customer, a single order, a single invoice, a single product, and so on. But, as you know, the queries in a data warehouse environment produce large result sets. These queries retrieve hundreds and thousands of table rows, manipulate the metrics in the fact tables, and then produce the result sets. The manipulation of the fact table metrics may be a simple addition, an addition with some adjustments, a calculation of averages, or even an application of complex arithmetic algorithms. Let us review a few typical queries against the sample STAR schema shown in Figure 11-11. Query 1: Total sales for customer number 12345678 during the first week of December 2000 for product Widget-1.

Query 2: Total sales for customer number 12345678 during the first three months of 2000 for product Widget-1. Query 3: Total sales for all customers in the South-Central territory for the first two quarters of 2000 for product category Bigtools. Scrutinize these queries and determine how the totals will be calculated in each case. The totals will be calculated by adding the sales quantities and sales dollars from the qualifying rows of the fact table. In each case, let us review the qualifying rows that contribute to the total in the result set. Query 1: All fact table rows where the customer key relates to customer number 12345678, the product key relates to product Widget-1, and the time key relates to the seven days in the first week of December 2000. Assuming that a customer may make at most one purchase of a single product in a single day, only a maximum of 7 fact table rows participate in the summation. Query 2: All fact table rows where the customer key relates to customer number 12345678, the product key relates to product Widget-1, and the time key relates to about 90 days of the first quarter of 2000. Assuming that a customer may make at most one purchase of a single product in a single day, only about 90 fact table rows or less participate in the summation. Query 3: All fact table rows where the customer key relates to all customers in the South-Central territory, the product key relates to all products in the product

category Bigtools, and the time key relates to about 180 days in the first two quarters of 2000. In this case, clearly a large number of fact table rows participate in the summation. Obviously, Query 3 will run long because of the large number of fact table rows to be retrieved. What can be done to reduce the query time? This is where aggregate tables can be helpful. Before we discuss aggregate fact tables in detail, let us review the sizes of some typical fact tables in real-world data warehouses.
Summaries and Aggregates
Data warehouse customers always have a common complaint—performance. Data warehouses always have a common problem—performance. Database tuning, SQL tuning, indexing, and optimizer improvements increase the performance of a data warehouse. Two methods, though, are applied in almost every data warehouse – Summaries and Aggregates.32 A Summary is a table that stores the results of a SQL arithmetic SUM statement that has been applied to a Fact table. The arithmetic portion of a Fact table is summed, while simultaneously one or more hierarchical levels of detail are removed from the data in a Fact table. For example: Intraday Fact data is summed at the Day level. The resulting data is stored in a Daily Summary table. For that data, the lowest grain is the Day. Store Fact data is summed at the Region level. The resulting data is stored in a Region Summary table. For that data, the lowest grain is the Region. The intention of a Summary table is to perform the summation of arithmetic Fact data only once, rather than many times. By incurring the resource consumption necessary to summarize a Fact table, data warehouse customers will receive the previously summarized data they want quickly. An Aggregate is a table that stores the results of SQL JOIN statements, which have been applied to a set of Dimension tables. The hierarchies and attributes above an entity are prejoined and stored in a table. For example: The Product entity, its levels of hierarchy and management area prejoined into a single table that stores the result set. The grain of this result set is the Product. The Facility entity, its levels of geographic and management hierarchy are prejoined into a single table that store the result set. The grain of this result set is the Facility. The intention of an Aggregate table is to perform the joins of large sets of Dimension data only once. By incurring the resource consumption necessary to join a series of Dimension tables, data warehouse customers will receive data that uses those levels of hierarchy quickly. An Aggregate is not a pure Dimension table as it would appear in a Dimensional Data Model. An Aggregate is a physical table that holds the result set of join statements, which are commonly used by data warehouse customers and are high system resource consumers. The point of an Aggregate is to incur the high system resource consumption once during off-peak hours to avoid multiple consumptions of system resources during peak hours. That being the case, an Aggregate table can denormalize along multiple hierarchies. The intersection of those multiple hierarchies is the grain of an Aggregate table. The hierarchical intersection and lowest level of granular detail must be the same because they are the grain of an Aggregate table.

On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user. OLAP functionality is characterized by dynamic multi-dimensional analysis of consolidated enterprise data supporting end user analytical and navigational activities including:
     

calculations and modeling applied across dimensions, through hierarchies and/or across members trend analysis over sequential time periods slicing subsets for on-screen viewing drill-down to deeper levels of consolidation reach-through to underlying detail data rotation to new dimensional comparisons in the viewing area

OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless of database size and complexity. OLAP helps the user synthesize enterprise information through comparative, personalized viewing, as well as through analysis of historical and projected data in various "what-if" data model scenarios. This is achieved through use of an OLAP Server. OLAP allows business users to slice and dice data at will. Normally data in an organization is distributed in multiple data sources and are incompatible with each other. A retail example: Point-of-sales data and sales made via call-center or the Web are stored in different location and formats. It would a time consuming process for an executive to obtain OLAP reports such as What are the most popular products purchased by customers between the ages 15 to 30? Part of the OLAP implementation process involves extracting data from the various data repositories and making them compatible. Making data compatible involves ensuring that the meaning of the data in one repository matches all other repositories. An example of incompatible data: Customer ages can be stored as birth date for purchases made over the web and stored as age categories (i.e. between 15 and 30) for in store sales. It is not always necessary to create a data warehouse for OLAP analysis. Data stored by operational systems, such as point-of-sales, are in types of databases called OLTPs. OLTP, Online Transaction Process, databases do not have any difference from a structural perspective from any other databases. The main difference, and only, difference is the way in which data is stored. Examples of OLTPs can include ERP, CRM, SCM, Point-of-Sale applications, Call Center. OLTPs are designed for optimal transaction speed. When a consumer makes a purchase online, they expect the transactions to occur instantaneously. With a database design, call data modeling, optimized for transactions the record 'Consumer name, Address, Telephone, Order Number,

Order Name, Price, Payment Method' is created quickly on the database and the results can be recalled by managers equally quickly if needed.

OLAP SERVER
An OLAP server is a high-capacity, multi-user data manipulation engine specifically designed to support and operate on multi-dimensional data structures. A multi- dimensional structure is arranged so that every data item is located and accessed based on the intersection of the dimension members which define that item. The design of the server and the structure of the data are optimized for rapid ad-hoc information retrieval in any orientation, as well as for fast, flexible calculation and transformation of raw data based on formulaic relationships. The OLAP Server may either physically stage the processed multi-dimensional information to deliver consistent and rapid response times to end users, or it may populate its data structures in realtime from relational or other databases, or offer a choice of both. Given the current state of technology and the end user requirement for consistent and rapid response times, staging the multi-dimensional data in the OLAP Server is often the preferred method.

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close