Storage Characteristics of Call Data Records in Column Store Databases

Published on February 2017 | Categories: Documents | Downloads: 86 | Comments: 0 | Views: 87
of 28
Download PDF   Embed   Report

Comments

Content

 

STORAGE CHARACTERISTICS OF CALL DATA RECORDS IN COLUMN CO LUMN STORE DA D ATABASES DAVID M WALKER D AT ATA M A N A G E M E N T & W A R E H O U S I N G

 

OVERVIEW   This presentation gives a brief overview of the storage characteristics of Call Data Records in Column Store Databases   It discusses





  What are Call Data Records (CDRs)?   What is a Column Store Database?   How efficient is a column store database for storing CDR







and other (similar) machine generated data?

  It does not:



  Examine performance in any detail   Compare column store to traditional row-based





Jan 2012

© 2012 Data Management & Warehou Warehousing sing

2

 

WHAT ARE CALL DATA RECORDS (CDRs) ?   Every time a telephone call is made data about that call is recorded. At its most basic this will include:



       









The Calling Number (who made the call) The Called Number (who was called) call ed) The Start Time The End Time (or the duration)



ariouswas pieces of mobile technical information (which   V switch used, mobil e handset identifier identifier, , call network direction, is it a free x800 type call etc.)

Jan 2012

© 2012 Data Management & Warehou Warehousing sing

3

 

CDRs AT MULTIPLE LEVELS   A CDR is created at the switch, each switch involved in a call creates its own CDRs, these are often called Network CDRs  CDRs 



  The Network CDRs are joined together into a record of an end to end call record through a process known as mediation. These are Unrated CDRs  CDRs 



  Finally the cost of the call is calculated and added to the Unrated CDRs to create Rated CDRs



Jan 2012

© 2012 Data Management & Warehou Warehousing sing

4

 

MORE CDR COMPLEXITY   There are CDRs that are used for billing the subscriber, often called Retail CDRs



  There are also CDRs that are used to charge other operators when their call travels over your network (e.g. when you make a mobile call that finishes on land line from another operator) These are known as Interconnect CDRs  or Wholesale CDRs CDRs



  There are also differences between Mobile and Fixed (Land) Line CDRs



  Finally each Switch Manufacturer (there are over 60) and each Mediation and/or Billing system (again at least 50) uses their own format



Jan 2012

© 2012 Data Management & Warehou Warehousing sing

5

 

FOR THIS EXERCISE …   We are using a European Telephone Company (Telco) (T elco) Mobile Rated Interconnect CDRs



  We have 12,902 files, containing 435,242,447 CDRs over a 181 day period from 482,883 subscribers



  Each CDR has 80 fields and 583 characters in a fixed length record format file. In addition we have added an additional mandatory field to hold the source file name from which the record came



Jan 2012

© 2012 Data Management & Warehou Warehousing sing

6

 

DATA DISTRIBUTION IN THE CDR RECORDS (1)   The structure of the data in the record has a massive impact on its storage. There are a number of factors to look at:



  Data Types, Padding, Place Holders and Data Cardinality



  The example data we are using has 2 Datetime fields, 11 Char fields, 10 Numeric fields, f ields, 33 Integer



fields Varchar fields which is a fairly typical mix forand this25 type of machine generated data. In the source file these are all held as ASCII text.

Jan 2012

© 2012 Data Management & Warehou Warehousing sing

7

 

DATA DISTRIBUTION IN THE CDR RECORDS (2)   Fixed length records are padded. In our data set the ‘Calling Number’ fixed length field is defined as 24 characters long however the maximum field length in the actual data is only 11 characters long. This means that there always 13 space characters of padding afterwards





of our 80 fields have no information in them at all,   24 43 of the fields are mandatory and are 100% populated. The remaining 13 fields have between 25% and 75% of the records filled.

Jan 2012

© 2012 Data Management & Warehou Warehousing sing

8

 

DATA DISTRIBUTION IN THE CDR RECORDS (3)   Finally the number of discreet values (cardinality) a field has affects storage. One flag field has possible values of 0 or 1 and therefore a (low) cardinality of 2, another field has a nearly unique value for every record and therefore a very high cardinality. Of the 57 fields with data there are 20 fields with high cardinality, 5 fields with medium cardinality and the



remaining 32 fields have a low cardinality

Jan 2012

© 2012 Data Management & Warehou Warehousing sing

9

 

WHAT IS A COLUMN STORE DA DAT TABASE?   Tradit Traditionally ionally databases are ‘row-based’ i.e. each field of data in a record is stored next to each other o ther..



Forename

Surname

Gender

David

Walker

Male

Helen

Wa Walker lker

Female

Sheila

Jones

Female

  Column store databases store the values in columns and then hold a mapping to form for m the record   This is transparent to the user, who queries a table with SQL in exactly the same way as they would a row-based database





Jan 2012

© 2012 Data Management & Warehou Warehousing sing

10

 

COLUMN STORAGE First Name

F Token

Value

David Helen

PPP QQQ

Sheila

RRR

Surname Value

Note: To the user this appears as a conventional row-based table that can be queried by standard SQL, it is only the underlying storage that is different

F Token

S Token

G Token

PPP

YYY

BBB

S Token

QQQ

YYY

AAA

Jones

XXX

RRR

XXX

AAA

Walker

YYY

Gender Value

G Token

Female

AAA

Male

BBB

Jan 2012

© 2012 Data Management & Warehou Warehousing sing

11

 

EFFICIENCIES OF COLUMN STORE DATABASES   Column store databases offer significant storage optimisation opportunities opportunities especially where there is low or medium cardinality character strings (e.g. the telephone numbers and reference data) because long strings are not repeatedly stored   In addition it is possible to compress the data column stores very efficiently   It is possible, in some column store implementations, that the column storage holds additional metadata that can be used to speed up specific queries (e.g. the number of records associated with each value in a column)   Reduced the data volume stored means reduced I/O when querying the database, this consequently gives query performance improveme improvements nts









Jan 2012

© 2012 Data Management & Warehou Warehousing sing

12

 

INEFFICIENCIES OF COLUMN STORE DATABASES   In general manipulating individual rows for updates is expensive as it has to go to each of the columns and then update the mapping table



  Some column store databases have specific technologies to limit the impact of this by caching updates



  Consequently Column Store Databases are not efficient at OLTP type applications – however they



are very efficient for DWH/BI/Archive type applications because the data is bulk loaded rather than individual row inserts, it is not frequently updated and used in large set based queries Jan 2012

© 2012 Data Management & Warehou Warehousing sing

13

 

HOW EFFICIENT IS IT TO STORE THIS DATA?   What hardware was used and what would woul d be needed for a production environment?



 



How was the data loaded?   What was the storage characteristics?



Jan 2012

© 2012 Data Management & Warehou Warehousing sing

14

 

THE TEST ENVIRONMENT   The test environment was designed to measure storage and not system performance   This test was done using Sybase IQ 15.4





  Sybase has had a column storage database called IQ since 1996 and is one of the most established of the 25 or so currently listed on Wikipedia   The server was running CentOS 5.7 x64, a Redhat Linux derivative   The hardware consisted of:







  Intel Xeon Quad-Core X3363





  16GB Memory   Adaptec 5405 RAID Controller with 2x 1TB 7200rpm Hard Disk (RAID1)   The database was built on file systems rather than raw devices

• •

  Total hardware cost was less than US$3000   Software licences were provided on evaluation

• •

Jan 2012

© 2012 Data Management & Warehou Warehousing sing

15

 

A PRODUCTION ENVIRONMENT?   To make this into a production environment would depend on the volume of data per month and the number of months data to be held and the type of CDR   The biggest performance perfor mance driver would be to have more disk spindles adding more (faster) drives or using solid state disks. This would improve improve performance perfor mance as well as adding greater capacity





  e.g. 16 1Tb drives in RAID10 configuration would provide around 7.75Tb of space and store 75 Billion of these CDRs   Using raw devices instead or file systems would also improve performance





  Other performance perfor mance enhancements would include



  Moving from 1 to 2 or 4 Quad Core CPUs   Adding another 16Gb of memory

• •

Jan 2012

© 2012 Data Management & Warehou Warehousing sing

16

 

LOADING THE DATA   The data was loaded using PELT, an ETL tool written and used by Data Management & Warehousing



  The loading was done to production level quality



  Data is loaded into a load table (CDR_LOAD) which has a view (CDR_CONVERT) over it that applies data quality checks. The data is then selected from the view and inserted into the main table (CDRs)



  Each step is fully logged and audited



Jan 2012

© 2012 Data Management & Warehou Warehousing sing

17

 

THE LOADING STEPS •



• •





  Copy a compressed (Unix Compress .Z) flat file (as provided) from the incoming directory to the workspace   Record the size of the .Z file in bytes   Uncompress the file   Record the size in bytes and the number of records in the uncompressed file iSQL ‘Load’ command   Use to insert the data into a CDR_LOAD table   Record the size of the CDR_LOAD table in kilobytes

Jan 2012

  Insert into the main CDR table from the DQ view CDR_CONVERT over the CDR_LOAD table





the size of the CDR   Record table in kilobytes   Truncate the CDR_LOAD table   Compress the source file with ‘gzip -9’ (maximum compression, longest execution) the size of the .gz file in   Record bytes   Move the compressed .gz file to an archive directory









© 2012 Data Management & Warehou Warehousing sing

18

 

RESULTS   12,902 files were loaded with zero data quality errors



  435,583,388 CDRs

  27.48 Gb of un-indexed storage in the database



  8.6:1 Compression Ratio



Gb of fully indexed   41.47 storage in the database





  5.7:1 Compression Ratio

  236.50 Gb of raw files





  Loading: 33 hours, 22



  20.03 Gb of storage in the original .Z files



minutes, 12 second

  Indexing: 2 hours, 13 minutes, 9 seconds



  11.8:1 Compression Ratio



  12.42 Gb of storage in the archive .gz files



  19.0:1 Compression Ratio



Jan 2012

© 2012 Data Management & Warehou Warehousing sing

19

 

ADDING INDEXES   By default the table has no indexes



  This is the same in most databases



 



For this test every every field  field was indexed added 63 indexes that took up an additional 24Gb   This •

  The total space used was still 5.7 times smaller than the space used by the raw files   These indexes would significantly improve query performance





  However not all the indexes would be required in a production system as not all fields would be actively queried and this would reduce the space used



Jan 2012

© 2012 Data Management & Warehou Warehousing sing

20

 

DISK SPACE USED

Jan 2012

© 2012 Data Management & Warehou Warehousing sing

21

 

LOAD PERFORMANCE   The average file had 33,760 records   The ETL to load an average file took 11 seconds





 



2 seconds to copy to the working wor king directory and decompress   3 seconds import into CDR_LOAD table   3 seconds copy from CDR_CONVERT table to CDRS table   2 seconds to gzip -9 and archive









  1 second logging and truncating tables

  None of the tables were indexed during the load



Jan 2012

© 2012 Data Management & Warehou Warehousing sing

22

 

OBSERVATIONS (1)   The results were approximately in the middle of our expectations and previous experience of other similar data sets where the raw data has been compressed between 5 and 10 times   Even low end hardware gives acceptable load performance suitable for archive functionality but production scale hardware is needed for BI/DWH





Jan 2012

© 2012 Data Management & Warehou Warehousing sing

23

 

OBSERVATIONS (2)   Some database tuning techniques are needed for truly massive data sets but can be designed in from the outset at low cost (e.g. which indexes/index types)



  It is worth considering putting each month (or some other similar date based partitioning) in separate tables for systems management purposes as it makes it easy to remove the data at the end of the archiving process   Smaller reference tables added to the schema would have little/no compression but they are also very small and therefore not contribute greatly to the space used





Jan 2012

© 2012 Data Management & Warehou Warehousing sing

24

 

ALTERNA TERNATIVE TIVE SCENAR SCENARIOS IOS AL   This presentation uses information gathered on specific data used for a specific purpose by a client   Companies wonder their dataterms would work in bothmay storage and how performance   Vendors may also wonder how their technologies compare in both storage and performance terms   If you are interested in finding out please contact us with these or any other Data Warehousing/Business Intelligence enquiries









Jan 2012

© 2012 Data Management & Warehou Warehousing sing

25

 

CONTACT US   Data Management & Warehousing



  Website: http://www.datamgmt.com http://www.datamgmt.com     Telephone: +44 (0) 118 321 5930





  David Walker



[email protected]     E-Mail: [email protected]   Telephone: +44 (0) 7990 594 372   Skype: datamgmt







  White Papers: http://scribd.com/davidmwalker  



Jan 2012

© 2012 Data Management & Warehou Warehousing sing

26

 

ABOUT US Data Management & Warehousing is a UK based consultancy that has been delivering successful business intelligence and data warehousing solutions since 1995. Our consultants have worked with major corporations around the world including the US, Europe, Africa and the Middle East. We have worked in many industry sectors such as telcos, manufacturing, retail,management financial and transport. provide governance and project as well as We expertise in the leading technologies.

Jan 2012

© 2012 Data Management & Warehou Warehousing sing

27

 

THANK YOU  © 2 0 1 2 - D A T A M A N A G E M E N T & W A R E H O U S I N G HTTP://WWW.DATAMGMT.COM

Sponsor Documents

Or use your account on DocShare.tips

Hide

Forgot your password?

Or register your new account on DocShare.tips

Hide

Lost your password? Please enter your email address. You will receive a link to create a new password.

Back to log-in

Close