Column-Store Strings

Published on January 2022 | Categories: Documents | Downloads: 6 | Comments: 0 | Views: 181

of 100

Content

Cache Conscious Column Organization in In-Memory Column Stores David Schwalb, Jens Krüger, Hasso Plattner

Technische Berichte Nr. 67 des Hasso-Plattner-Instituts für Softwaresystemtechnik an der Universität Potsdam Potsdam

Technische Berichte des Hasso-Plattner-Instituts für Softwaresystemtechnik an der Universität Potsdam

Technische Berichte des Hasso-Plattner-Instituts für Softwaresystemtechnik an der Universität Potsdam | 67 67

David Schwalb | Jens Krüger | Hasso Plattner

Cache Conscious Column Organization in In-Memory Column Stores

Universitätsverlag Potsdam

Bibliografische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb.de/ abrufbar.

Universitätsverlag Potsdam 2013 Universitätsverlag http://verlag.ub.uni-potsdam.de/ Am Neuen Palais Palais 10, 14469 Potsdam Potsdam Tel.: +49 (0)331 977 2533 / Fax: 2292 E-Mail:   [email protected] E-Mail: Die Schriftenreihe Technische Berichte des Hasso-Plattner-Instituts Hasso-Plattner-Instituts für Softwaresystemtechnik Softwaresyste mtechnik an der Universität Potsdam wird Potsdam wird herausgegeben von den Professoren des Hasso-Plattner-Instituts für Softwaresystemtechnik Softwaresystemtechnik an der Universität Potsdam. ISSN (print) 1613-5652 ISSN (online) 2191-1665 Das Manuskript ist urheberrech urheberrechtlich tlich geschützt. Online veröffentlicht auf dem Publikationsserver Publikationsserver der Universität Potsdam URL http://pub.ub.uni-potsdam.de/voll http://pub.ub.uni-potsdam.de/volltexte/2013/6389 texte/2013/6389// URN urn:nbn:de:kobv:517-opus-63890 http://nbn-resolving.de/urn:nbn http://nbn-resolvi ng.de/urn:nbn:de:kobv:517-o :de:kobv:517-opus-63890 pus-63890 Zugleich gedruckt erschienen im Universitätsverlag Universitätsverlag Potsdam: ISBN 978-3-86956-228-5

Contents List of Figures . . List of Tables . . . List of Algorithms Abstract   . . . . . . Abstract

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

iii iv iv v

1 Introd Introduc uctio tion n 1.1 Problem Problem Statem Statemen entt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Assumptions Assumptions and Simpliﬁcati Simpliﬁcations ons . . . . . . . . . . . . . . . . . . . . . . . . . .

1 2 2

1.3 Deﬁnit Deﬁnition ion of Key Key Terms erms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Structu Structure re of this this Report Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 5

2 Backgr Backgroun ound d and Relate Related d Work Work 2.1 In-Mem In-Memory ory Column Column Stores Stores   . . . 2.2 Index Index Struct Structure uress . . . . . . . . 2.3 2.3 Cost Cost Mode Models ls   . . . . . . . . . . . 2.4 2.4 Cac Caches hes . . . . . . . . . . . . . . 3 Syst System em Deﬁn Deﬁnit ition ion 3.1 Para Parame mete ters rs . . . . . . . . . . . 3.2 Physi Physical cal Colu Column mn Organi Organizat zation ion 3.2.11 Dictio 3.2. Dictionar nary y Encodin Encodingg . . 3.2.22 Bit-P 3.2. Bit-Pac ackin kingg . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

3.3 Operatio Operations ns on on Data Data Stru Structu ctures res . . . 3.4 Plan Plan Operat Operators ors . . . . . . . . . . . . 3.4.11 Scan 3.4. Scan with with Equali Equality ty Select Selection ion 3.4.22 Scan 3.4. Scan with with Range Range Select Selection ion   . . 3.4. 3.4.33 Look Lookup up . . . . . . . . . . . . 3.4. 3.4.44 In Inse sert rt . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

4 Param Paramete eterr Evalua Evaluation tion 4.1 Numb Number er of Rows Rows . . . . . . 4.2 Number Number of of Disti Distinct nct Values alues   . 4.3 Value alue Disorde Disorderr . . . . . . . 4.4 Val alue ue Le Leng ngth th . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . i

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

7 8 10 11 12

. . . .

. . . .

15 15 17 17 19

. . . . . .

. . . . . .

20 21 21 23 25 25

. . . .

27 27 30 31 32

. . . .

. . . .

4.5 Value alue Skewn Skewness ess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.66 Conc 4. Conclu lusi sions ons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5.1 Estimat Estimating ingound Cache Cac Misse Misses Backgr Backgroun d he on on Caches Cac hes s . . . . . . . . . . . . 5.1. 5. 1.11 Memo Memory ry Cell Cellss . . . . . . . . . . . . . 5.1.2 5.1 .2 Memory Memory Hierarc Hierarchy hy . . . . . . . . . . 5.1.3 5.1 .3 Cache Cache Inter Internal nalss . . . . . . . . . . . . 5.1.4 5.1 .4 Addres Addresss Translati ranslation on . . . . . . . . . 5.1.5 5.1 .5 Prefet Prefetch ching ing . . . . . . . . . . . . . . 5.2 Cache Cache Eﬀects Eﬀects on Applica Application tion Performance Performance   . 5.2.1 5.2 .1 The Stride Stride Experim Experimen entt . . . . . . . . 5.2.2 5.2 .2 The Size Size Experim Experimen entt . . . . . . . . . 5.3 A Cach Cache-M e-Miss iss Based Based Cost Cost Model Model . . . . . . 5.3.1 5.3 .1 Scan Scan with with Equa Equalit lity y Selec Selectio tion n . . . . 5.3.2 5.3 .2 Scan Scan with with Range Range Select Selection ion   . . . . . . 5.3. 5.3.33 Look Lookup up . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

37 37 37 38 39 40 41 41 41 43 43 45 46 48

Inse Insert rt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

6 Index Index Struct Structure uress 6.1 Diction Dictionary ary Index Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.22 Colu 6. Column mn In Inde dex x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51 52 54

7 Partit Partition ioned ed Column Columnss 7.11 Merge 7. Merge Proce Process ss . . . . . . . . . . . . . . . 7.2 Mergin Mergingg Algorit Algorithm hm . . . . . . . . . . . . . 7.2.1 7.2 .1 Mergin Mergingg Diction Dictionari aries es   . . . . . . . . 7.2.2 7.2 .2 Updatin Updatingg Compr Compress essed ed Values alues . . . 7.2.3 Initial Initial Performance Performance Improvem Improvement entss 7.3 Merge Merge Implem Implemen entati tation on . . . . . . . . . . . 7.3.1 7.3 .1 Scalar Scalar Implem Implemen entati tation on . . . . . . .

5.3. 5.3.44

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

59 60 61 63 64 65 66 67

Exploiting Exploi ting Thread-lev Threa Paral 7.4 7.3.2 Performanc Performance e Evaluati Ev aluation on d-level . . el . .Parallelis . . .lelism . . m. .. .. .. .. .. .. .. .. .. .. .. .. .. 7.4.1 7.4 .1 Impact Impact of Delt Deltaa Parti Partition tion Size Size . . . . . . . . . . . . . . . 7.4.2 Impact of Value-L Value-Length ength and and Percen Percentage tage of Unique Unique values values 7.5 Merge Merge Strate Strategie giess . . . . . . . . . . . . . . . . . . . . . . . . . . .

.. . . .

.. . . .

.. . . .

.. . . .

.. . . .

.. . . .

.. . . .

.. . . .

6791 72 73 74

. . . . . . .

8 Conclus Conclusions ions and and Futu Future re Work Work

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . . . . . . .

33 35

. . . . . . .

. . . . . . .

. . . . . . .

References

ii

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

77 79

List of Figures 3.1 3.2 3.3 3.4 3.5

Inﬂuence Inﬂuence of parameter parameter skew skewness ness on on value value distrib distributions utions.. . . . . . . . . . . . . Organiz Organizati ation on of an uncomp uncompres ressed sed colu column. mn. . . . . . . . . . . . . . . . . . . . Organization Organization of a diction dictionary ary encoded encoded column column with an unsorted unsorted dictionary dictionary . Organization Organization of a diction dictionary ary encoded encoded column column with a sorted dictionary dictionary . . . Extraction Extraction of a bit-pack bit-packed ed valuevalue-id id . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

16 17 18 18 19

4.1 Operator performance performance with with varying varying number number of rows rows  . . . . . . . . . . . . . . . . 4.2 Operator performance performance with varying varying number number of of distinct distinct values values . . . . . . . . . .

28 29

4.3 Operator Operator performance performanc performance arying v value alue length disorde disorder r . .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. 4.4 performan cee with with with vvarying varying 4.5 Operator performanc performancee with varying value skew skewness ness . . . . . . . . . . . . . . . .

3331 34

5.1 5.2 5.3 5.4 5.5 5.6

38 39 42 42 43 49

Memory Hierarchy Hierarchy on on Intel Intel Nehalem Nehalem architectu architecture. re.  . . . . . . . . . . . . . . Parts Parts of a memory memory addres address. s. . . . . . . . . . . . . . . . . . . . . . . . . . . Cycles Cycles for cache cache accesses accesses with with increasi increasing ng stride. stride. . . . . . . . . . . . . . . . Cache Cache misses misses for cache cache access accesses es with increasing increasing stride. stride. . . . . . . . . . . . Cycles Cycles and cache cache misses for for cache accesses accesses with with increasing increasing working working sets. sets. . Evaluati Evaluation on of predicted predicted cache cache misses misses.. . . . . . . . . . . . . . . . . . . . . .

6.1 Exampl Examplee dict diction ionary ary index. index.   . . . . . . . . . . . 6.2 Operator Operator performanc performancee with with dictio dictionary nary index index values. . . . . . . . . . . . . . . . . . . . . . . 6.3 Exampl Examplee column column index. index. . . . . . . . . . . . . 6.4 Operator performance performance with column column index index and 7.1 7.2 7.3 7.4

. . . . . .

. . . . . .

. . . . . . . . . . . . . . . . . . . and varyi varying ng number number of distin distinct ct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . varying varying number number of distinct distinct values. values.

Example Example showing showing data data structures structures for for partitioned partitioned column columns. s. . Example Example showing showing steps steps execute executed d by merging merging algorith algorithm. m. . . Update Costs Costs for for Various Various Delta Partiti Partition on Sizes Sizes . . . . . . . Update Costs Costs for Various Various Value-Length alue-Lengthss . . . . . . . . . . .

iii

. . . . . .

. . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

52 53 55 57 61 65 71 73

List of Tables 3.1 Complexit Complexity y of plan plan operations operations by by data structures. structures. . . . . . . . . . . . . . . . .

21

5.1 Cache Cache Parame Parameter terss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

7.1 Symbol Symbol Deﬁnitio Deﬁnition n for for partition partitioned ed columns. columns. . . . . . . . . . . . . . . . . . . . .

62

List of Algorithms 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 6.1 6.2 6.3

Scan with equality equality selection selection on uncompr uncompressed essed columns columns . . . . . . . . Scan with with equalit equality y selection selection on columns columns with with sorted sorted dictionar dictionaries ies . . Scan with with equalit equality y selection selection on columns columns with with unsorted unsorted dictionaries dictionaries   . Scan Scan with range range selec selectio tion n on uncompr uncompress essed ed column columnss . . . . . . . . . Scan with with range range select selection ion on columns columns with with sorted sorted dictionaries dictionaries   . . . . Scan with with range range select selection ion on columns columns with with unsorted unsorted dictionaries dictionaries . . Lookup Lookup on uncomp uncompres ressed sed column columnss . . . . . . . . . . . . . . . . . . . Lookup Lookup on on dict diction ionary ary encoded encoded column columnss . . . . . . . . . . . . . . . . Insert Insert on on column columnss with with sorted sorted dict diction ionarie ariess . . . . . . . . . . . . . . . Insert Insert on columns with with unsorted unsorted dictionaries dictionaries . . . . . . . . . . . . . Insert Insert on on column columnss with with dictio dictionary nary indice indicess . . . . . . . . . . . . . . . Scan Scan with with equali equality ty sele selecti ction on on colum column n indice indicess . . . . . . . . . . . . Scan Scan with with range range selec selectio tion n on colum column n indice indicess . . . . . . . . . . . . . .

iv

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

22 22 22 23 23 23 24 24 25 25 54 56 56

Abstract Cost models are an essential part of database systems, as they are the basis of query performance optimization. optimization. Based on predictions predictions made by cost models, the fastest fastest query execution execution plan can be chosen and executed or algorithms can be tuned and optimized. In-memory databases shift the focus from disk to main memory accesses and CPU costs, compared to disk based systems where input and output costs dominate the overall costs and other processing costs are often neglected. However, modeling memory accesses is fundamentally diﬀerent and common models do not apply anymore. This work presents a detailed parameter evaluation for the plan operators scan with equality selection, scan with range selection, positional lookup and insert in in-memory column stores. Based on this evaluation, we develop a cost model based on cache misses for estimating the runtime of the considered plan operators using diﬀerent data structures. We consider uncompressed columns, bit compressed and dictionary encoded columns with sorted and unsorted dictionaries dictio naries.. Furthermore, urthermore, we discuss discuss tree indices indices on the columns and dictionaries dictionaries.. Finally, we consider partitioned columns consisting of one partition with a sorted and one with wit h an unsorte unsorted d dictio dictionary nary.. New values values are insert inserted ed in the unsorted unsorted dic dictio tionary nary partition partition and moved moved periodic periodicall ally y by a merge merge process process to the sorted sorted partiti partition. on. We propose propose an eﬃcien eﬃcientt merge algorithm, supporting the update performance required to run enterprise applications on read-optimized databases and provide a memory traﬃc based cost model for the merge process.

v

Chapter 1

Introduction In-memory column stores commence to experience a growing attention by the research community muni ty.. They are traditionally traditionally strong in read intensive intensive scenarios scenarios with analytical analytical workloads workloads like data warehousing, using techniques like compression, clustering and replication of data. A recent trend introduces column stores for the backbone of business applications as a combined solution soluti on for transactional transactional and analytical analytical processing. processing. This approach approach introduces high performance requirements as well for read performance as also for write performance to the systems. Typically, optimizing read and write performance of data structures results in trade-oﬀs, as e.g. higher compression compression rates introduce introduce overhead overhead for writing, writing, but increase increase the read performance. These trade-oﬀs trade-oﬀs are usually usually made during the design of the system, system, although the actual workload work load the system is facing during execu execution tion vari varies es signiﬁcantly signiﬁcantly.. The underlying underlying idea of this report is a database system, which supports diﬀerent physical column organizations with unique performance characteristics, allowing to switch and choose the used structures at runtime depending on the current, current, historical historical or expected expected future workloads. workloads. Howeve However, r, this report willl not provide wil provide a comple complete te descri descripti ption on or design design of such such a sys system tem.. Instea Instead, d, we provide provide the basis for such a system by focusing on selected data structures for in-memory column stores, presenting a detailed parameter evaluation and providing cache-based cost functions for the discussed discu ssed plan operators. operators . The decision which column organization scheme is used, has to be made by the system based on knowledge of the performance of individual database operators and data structures. This knowledge is represented by the cost model . A cost model is an abstrac abstractt and simpliﬁe simpliﬁed d versi version on of the actual actual system system,, which which allows allows to make make predic predictio tions ns about about the actual actual system system.. A model is always a tradeoﬀ between accuracy and speed with which predictions can be made. The most accurate model is the system system itself. Obviously Obviously,, executing executing a Query Execution Plan (QEP) in (QEP) in order to determine the time it will take for its execution, does not make sense. Models based on simulations simulations also produce very accurate accurate predictions predictions.. Howeve However, r, evaluati evaluating ng such suc h a model model requir requires es to run costly simula simulatio tions. ns. Due to the performanc performancee requir requireme ement ntss for 1

CHAPTE CHA PTER R 1. INT INTRO RODUC DUCTION TION

query execution, execution, query optimizers usually usually build on analytical models. Analytical Analytical models can be described in closed mathematical terms and are therefore very fast to evaluate. Query optimizers do not require extremely accurate predictions of the model, but they do require that the relations produced by the model reﬂect the real system. In other words, the order of QEPs sorted by their performance should be the same in the model and in the reall system rea system.. Additi Additional onally ly,, the fastes fastestt QEPs usually usually do not diﬀer diﬀer signiﬁ signiﬁcan cantly tly in execut execution ion performance, but predicting which of those plans will be the fastest requires a very detailed model. mode l. The time saved saved by ﬁnding ﬁnding the faster faster execut execution ion plan is then outwe outweigh ighed ed by the time time needed for the additional accuracy of the prediction. Therefore, we are usually not interested in ﬁnding the best possible solution, but rather in ﬁnding a suﬃciently fast solution in a short time frame. Our contributions are i) a detailed parameter discussion and analysis for the operations scan with equality selection, scan with range selection, lookup and inset on diﬀerent physical column organizations in in-memory column stores, ii) a cache based cost model for each operation and column organization plus dictionary and column indices, iii) an eﬃcient merge algorithm for dictionary encoded in-memory column stores, enabling them to support the update performance performan ce required required to run enterpris enterprisee application application workloads on read-optimize read-optimized d databases. databases.

1.1 1. 1

Pr Prob oble lem m St Stat atem emen entt

In this work, we focus on in-memory databases [52, [52, 7, 27, 69, 69, 58]. 58]. Especia Especially lly column columnar ar inmemory databases have received recent interest in the research community [25, [ 25, 44, 44, 59]. 59]. A logical column can be represented in main memory in diﬀerent ways, called physical column organization organiz ation or schemes. schemes. Essentially Essentially,, the physical physical schema schema describes how a column column is encoded and stored in main memory. We focus fo cus on estimating estimating query costs for the considered considered physical physical column organizations. organizations. In order to answer this problem, we assume a closed system in terms of possible queries, physical column organizations organizations and given given operator implemen implementations tations.. We want to ﬁnd an accura accurate te estimation, which is still computable in reasonable time. We also take column and dictionary indices into account and want to provide a respective cost model for the use of these indices.

1.2

Ass Assump umptio tions ns and Sim Simpli pliﬁca ﬁcatio tions ns

We will make certain assumptions throughout this report and focus on estimating query costs based on cache misses for the given operators and physical column organizations in a column oriented in-memory storage engine. In particular, we assume the following: 2

1.2. ASSUM ASSUMPTIONS PTIONS AND S SIMPLIFICA IMPLIFICATIONS

Enterprise Environment We base this work on the assumptions assumptions of using a columnar columnar inmemory database for enterprise applications for combined transactional and analytical systems. Recent work analyzed the workload of enterprise customer systems, ﬁnding in total more than 80 % of all querie queries are read read access acces s – for system systemssofeven ov er 90 % [46] . While this is the expected result fors analytical systems, theOLAP high amount read over queries on [46]. transactional systems is surprising as this is not the case in traditional workload deﬁnitions. Consequently, the query distribution distri bution leads to the concept of using a read-optimi read-optimized zed database for both transactiona transactionall and analyti anal ytical cal system systems. s. Even Even though though most most of the workload workload is read-orie read-orient nted, ed, 17 % (OLTP) (OLTP) and 7 % (OLAP) of all queries are updates. A read-optimized read-optimized database database supporting both workloads workloads has to be b e able to support this amount amount of update operations. operations. Additional Additional analyses analyses on the data have shown an update rate varying from 3,000 to 18,000 updates/second. Insert Only Column Store In order to achieve achieve high high update rates, we chose chose to model table modiﬁcations following the insert-only approach as in [44, [ 44, 5 599]. Therefore, updates are always modeled as new inserts, and deletes only invalidate rows. We keep the insertion order of tuples and only the lastly inserted version is valid. We chose this concept because only a small fraction of the expected enterprise workload are modiﬁcations [44 [44]] and the insert-only approach allows queries to also work on the history of data. Workload We assume a simpliﬁed workload, workload, consisting of the query types scan with eequality quality selection, selec tion, scan with range selection, selection, positi p ositional onal lookup and inserts. A given given workload consists consists of a given given distri distribut bution ion of these these queries queries and their parameter parameters. s. We chose chose these operators operators as we identiﬁed them as the most basic operators needed by an insert only database system. Additionally, more complex operators can be assembled by combining these basic operators, as e.g. a join consisting of multiple scans. Isolated Column Consideration Isolated Consideration As columnar databases databases organize organize their relations in isolated columns, query execution typically is a combination of multiple operators working on single columns columns and combining combining their results. results. For simplicit simplicity y, we assume queries working working on a single column and neglect eﬀects introduced by operators on multiple columns. Parallelism With high end database database servers developi developing ng from multi-cor multi-coree to many-core many-core systems, modern databases are highly parallelized using data level parallelism, instruction level parallelism and thread level parallelism. However, for our cost model we assume the algorithms to execute on a single core without thread level or data level parallelism. For future work, an extension for parallel query processing could be introduced based on the algebra of the generic cost model introduced in [52] [52].. Materialization Strategies In the process of query execution execution,, multiple multiple columns have have to be stitched together, representing tuples of data as output. This process is basically the opposite operat ope ratio ion n of a projec projecti tion on in a row row st store ore.. The The opti optima mall point point in the the quer query y plan plan to do this is not obvious [1 [1]. Additi Additional onally ly to the decisi decision on when to add columns columns to the query query plan, a column store can work on lists of positions or materialize the actual values, which can be needed neede d by some operators. Abadi et al. present present four diﬀerent diﬀerent materialization materialization strategies strategies - early 3

CHAPTE CHA PTER R 1. INT INTRO RODUC DUCTION TION

materialization pipelined, early materialization parallel, late materialization pipelined and late materialization parallel. We refer to the process of adding multiple columns to the query plan as tuple reconstr reconstruct uction ion.. The process process of switc switchin hingg the output output from lists lists of positio positions ns to lis lists ts of positiontoand value pairs is called as material materializatio ization. n.The Throughout Throug hout the report, we assume scan operators create lists of positions their result. process of tuple reconstruction can be seen as a collection of positional lookup operations on a column. Column Indices Indices Index structures structures for databases databases have been intensely intensely studied in literature literature and variou variouss indexi indexing ng techni technique quess exists exists.. We assume assume a column column index index to be a map mappin pingg from values to positions, building a classical inverted index known from document retrieval systems. We assume the index to be implemented by a B+ -tree structure as described in Section 6. Bandwidth Boundness In general, the execution execution time of an algorithm algorithm accessing accessing only main memory as the lowest level in the memory hierarchy and no other I/O devices can be assembled by two parts – a) the time spent doing computations on the data and b) the time the processor is waiting wai ting for data from the memory hierarchy hierarchy. Bandwidth Bandwidth bound algorithms algorithms spend a signiﬁcan signiﬁcantt amount of time fetching data from memory so that the actual processing times are comparable small. sma ll. We assume assume our consid considere ered d algorit algorithms hms to be bandwi bandwidth dth bound and do not consider consider T C CP P U  in our cost model and focus on predicting the memory access costs through the number of cache misses. misses. How Howeve ever, r, database systems are not always always bandwidth bound and for more accurate predictions T predictions T C model [52]. ]. CP P U  can be calibrated and considered in the cost model [52

1.3 1. 3

De Deﬁn ﬁnit itio ion n of of K Key ey Ter erms ms

In order to avoid ambiguity, we deﬁne the following terms to be used in the rest of the report.

1. Table: A relational table or relation with attributes and containing records. 2. Attribute: A ﬁeld of a table with an associated type. 3. Record: Elementary units contained in a table, containing one value for each attribute in the table. 4. Column: Tables ables are physi physical cally ly stored stored in main memory as a collec collectio tion n of columns. columns. A column is the physical representation of an attribute. 5. Physical Column Organization: Describes the organization scheme and how a column is physically stored in memory, e.g. if it is stored uncompressed or if compression techniques are applied. applied. 6. Update: Any modiﬁcation operation on the table resulting in a new entry. 4

1.4. STR STRUCT UCTURE URE OF THI THIS S REPORT

7. Partitioned Column: A partitioned column consists of two partitions – one uncompressed and write optimized delta partition and one compressed and read optimized main partition.. All writes go into the delta partition tition partition and are merged periodically periodically into the main partition. 8. Merge: The merge process periodical p eriodically ly combines combines the read and write optimized optimized partitions of a partitioned column into one read optimized partition. 9. Column Index: A column index is a separate data structure accelerating speciﬁc operations on the column by the costs of index maintenance when inserting new values. 10. Dictionary Index: A dictionary index is a separate tree index on top of an unsorted dictionary, allowing to leverage binary search algorithms. 11. Dictionary Encoding: Dictionary encoding is a well known light weight compression technique, reducing the redundancy by substituting occurrences of long values with shorter references to these values. 12. Dictionary and Attribute Vector: In a dictionary encoded column, the actual column contains cont ains two two containers: containers: the attribute attribute vector and the value dictionary dictionary.. The attribute attribute vector is a standard vector of integer values storing only the reference to the actual value, which is the index of the value in the value dictionary and is also called value-id. 13. Bit-Packing: A bit-packed vector of integer values uses only as many bits for each value as are required to represent the largest value in the vector.

1.4 1. 4

St Stru ruct ctur ure e of of th this is Re Repor portt

The remainde remainderr of the report is struct structure ured d as fol follo lows: ws: Chapte Chapterr 2 discusses related work regarding in-memory in-memory column stores, stores, cost models for columnar in-memory databases, databases, caches caches and their eﬀects on application performance performance and indices. indices. Chapter Chapter 3 gives an overview of the discussed system and introduces the considered physical column organizations and plan operators. Chapter 4 presents an evaluation of parameter inﬂuences on the plan operator performance with varying number of rows, number of distinct values, value disorder, value length and value skewn sk ewness ess.. Our cache cache miss miss based based cost model model is present presented ed in Chapte Chapterr 5, followed by Chapter   6 introducing ter introducing column and dictionary dictionary indices and their respective respective costs. Then, Chapter Chapter 7 describes partitioned columns and the merge process including a memory traﬃc based cost model. Finally, the report closes with concluding remarks and future works in Chapter 8.

5

CHAPTE CHA PTER R 1. INT INTRO RODUC DUCTION TION

6

Chapter 2

Background and Related Work This section gives an overview of related work regarding in-memory column stores, followed by work concerning cost models for main memory databases, index structures and cache eﬀects. Rmakrishnan and Gehrke deﬁne a database managements system as ”a software designed to assist in maintaining and utilizing large collections of data” [61 data” [61]. ]. Today oday, a wide range of database systems exists, varying in their intended use cases and workloads, the used data models, their logical and physical data organization or the primary data storage locations. Typical use cases and types of databases are text retrieval systems, stream based databases [14], 14], real-ti real-time me databas databases, es, transa transacti ction on processi processing ng system systems, s, data data wareh warehous ouses es or data data min mining ing systems. Text retrieval systems like [73 like [73]] usually store collections of free text called documents and are optimi optimized zed to execut executee querie queriess matchin matchingg a set of stored stored document documents. s. Stream Stream based based databases are designed to handle constantly changing workloads, as e.g. stock market data or data generated through RFID tracking [56] [56].. Real-time Real-time databases databases are speciﬁcally designed designed to guarantee guaran tee speciﬁed time constraint constraintss as required by e.g. telephone telephone switching, switching, radar tracking tracking or arbitrage trading applications [39 applications [39]. ]. Transaction processing systems are usually based on the relational model [12 [12]] and are designed to store data generated by events called transactions, supporting atomicity, consistency, isolation isolat ion and durability durability.. OLTP OLTP workloads workloads are characteri characterized zed by a mix of reads and writes to + a few rows at a time, typically leveraging B -T -Trees rees or other index structures. Usually, Usually, OLTP systems store their data in row organized tables, which allows fast retrieval of single rows and are optimized for fast write performance. Typical use cases for transactional databases are in the ﬁeld of operational data for enterprise resource planning systems or as the backbone of other applications. Conversely, analytical systems usually work on a copy of the operational data extracted through extract-transform-load (ETL) processes and stored optimized for analytical queries. OLAP applications are characterized by bulk updates and large sequential scans, spanning few columns but many rows of the database, database, for example example to compute compute aggregate values. values. The 7

CHAPTER CHAPTE R 2.

BACK BACKGRO GROUND UND AND RELA RELATED TED WOR WORK K

analytical systems are often organized in star schemas and work with pre-aggregated data. Besides the relational data model, various other data models exist like the network data model, mode l, the hierarc hierarchic hical al data data model model or the object data data model. model. Howe Howeve ver, r, this this report report focusse focussess on relational databases in the context context of enterpris enterprisee applications applications.. Relational Relational databases diﬀer in their intended data schemata (like normalized or star schemata), their physical layouts of data storage (e.g. row or column oriented, oriented, compression compression techniques) techniques) or their primary data storage locations (disk or main memory). Classical Classi cal disk disk based based transac transaction tional al databas databases es are IBMs IBMs DB2, DB2, Oracle Oracle Databas Databasee or SAPs SAPs MaxDB. H-Store [38] H-Store [38] is is a row-based distributed database system storing data in main memory. Besides transactional systems, the need for specialized decision support systems evolved [20 evolved [20,, 67] resulting resul ting in analytical analytical systems optimized optimized for OLAP processing. processing. SybaseIQ SybaseIQ is a column oriented disk-based system explicitly designed optimizing the analytical query performance [50] [50].. C-Store and its commercial version Vertica are also disk based column stores designed for analytics, allowing allo wing columns to be stored stored in multiple multiple sort orders. Writes are accumulated accumulated in a writeable writeable store and moved by a tuple-mover into the read optimized store [66 store [66]. ]. MonetDB and Vectorwise [9, [9, 75 75,, 74 74,, 10] are 10] are column oriented databases targeted to support query-intensive applications like data mining and ad-hoc decision support. A recent research trend started to reunite the worlds of transactional and analytical processing cess ing by introducing introducing and proposing proposing systems systems designed designed for mixed workloads workloads like like SancoussiDB SancoussiDB [58,, 59], [58 59] , Hyrise [26 Hyrise [26]] or HyPer [41 HyPer [41,, 4 40] 0]..

2.1 2. 1

In In-M -Mem emor ory y Co Colu lumn mn St Stor ores es

Businesses use their transactional data as a basis to evaluate their business performance, gain insights, for planning and predictions of future events with the help of separate analytical system sys tems. s. In the recent recent past, past, a desire desire for more more actual actual analyses analyses develope developed, d, requir requiring ing to work work direct dir ectly ly on the transacti transactional onal data. Plattn Plattner er descri describes bes the separatio separation n of transac transactio tional nal and analytical systems and that database structures were designed to support complex business transactions focusing on the transactional processing [58 [ 58]. ]. Analytical Analytical and ﬁnancial ﬁnancial planning systems were moved into separate systems, which were highly optimized for the read intensive workloads, promising more performance and ﬂexibility. The main reason for separating transactional and analytical systems were their diﬀerent performance performan ce requiremen requirements. ts. Actually Actually,, the requiremen requirements ts are even even contradict contradicting. ing. Transactional ransactional workloads were believed to be very write intensive, selecting only individual rows of tables and manipulating them. Analytical workloads usually have a very high selectivity, scanning entire attributes attrib utes and joining and grouping grouping them. The underlying underlying problem is that data structures can only be optimized up to a certain point for all requirements. After that point, each optimization in one direction direction becomes a cost factor for another another operation. operation. For example, higher compression compression of a data structure can speed up complex read operations due to less memory that has to be 8

2.1. IN-M IN-MEMO EMOR RY COLUM COLUMN N STORE STORES S

read, but increases the expenses for adding new items to the data structure. In extreme cases, the whole compression scheme even has to be rebuilt. The ﬂexibility and speed which is gained by separating OLTP and OLAP is bought by introducing high costs for transferring the data (ETL processes) and managing the emerging redundancy redundancy.. Additionally Additionally,, data is only periodica p eriodically lly transferred transferred betw b etween een transactional transactional and analyti anal ytical cal system systems, s, introdu introducin cingg a delay delay in reportin reporting. g. This This becomes becomes especia especially lly importan importantt as through analytical reporting on their transactional data, companies are much more able to understand and react to events inﬂuencing their business. Consequently, there is an increasing demand for “real-time analytics” – that is, up-to-the moment reporting on business processes that have traditionally been handled by data warehouse systems. Although warehouse vendors are doing as much as possible to improve response times (e.g., by reducing load times), the explicit expli cit separation separation between between transaction processing and analytical analytical systems introduces a fundamental ment al bottleneck bottleneck in analytics analytics scenarios. scenarios. While the predeﬁnition predeﬁnition of data to be extracted extracted and transformed to the analytical system results in business decisions being made on a subset of the potential information, the separation of systems prevents transactional applications from using analytics functionality throughout the transaction processing due to the latency that is inherent in the data transfer. Recent research started questioning this separation of transactional and analytical systems and introduces eﬀorts of uniting both systems again [58, [ 58, 59, 59, 46, 40, 40, 25, 41]. 41]. Althou Although, gh, th thee goal is not to advocate advocate a comple complete te uniﬁca uniﬁcatio tion n of OLAP OLAP and OLTP OLTP system systems, s, because because the requirements of data cleansing, system consolidation and very high selectivity queries cannot yet be met with the proposed systems systems and still require require additional systems. systems. Howeve However, r, using a read optimized database system for OLTP allows to move most of the analytical operations into the transactional systems, so that they proﬁt by working directly on the transactional data. Applications Applications like real-time stock level level calculation, calculation, price calculation calculation and online customer customer segmentati segme ntation on will beneﬁt from this up-to-date up-to-date data. A combined database database for transactiona transactionall as well as analytical workloads eliminates the costly ETL process and reduces the level of indirection between diﬀerent systems in enterprise environments, enabling analytical processing directly on the transactional data. The authors of   of   [58, [58, 46 46]] pursue the idea of designing a persistence layer which is better optimized for the observed workload from enterprise applications, as today’s database systems are designed for a more update intensiv intensivee workload workload as they are actually actually facing. Consequen Consequently tly,, the authors start with a read optimized database system and optimize its update performance to support transactional enterprise applications. The back-bone of such a system’s architecture could be a compressed in-memory column-store, as proposed in [58 [58,, 25]. 25]. Col Column umn orie orient nted ed databases have proven to be advantageous for read intensive scenarios [50 [ 50,, 75], 75], especially in combination with an in-memory architecture. Such a system has to handle contradicting requirements for many performance aspects. Decisions may be made with the future application in mind or by estimating the expected workload. work load. Neverthe Nevertheless, less, the question arises which column oriented data structures are used in 9

CHAPTER CHAPTE R 2.

BACK BACKGRO GROUND UND AND RELA RELATED TED WOR WORK K

combination with light-weight compression techniques, enabling the system to ﬁnd a balanced trade-oﬀ between the contradicting requirements. This report aims at studying these trade-oﬀs and at analyzing possible data structures.

2.2 2. 2

In Inde dex x St Stru ruct ctur ures es

As in disk based systems, indexing in in-memory databases remains important although sequential access speeds allow for fast complete column scans. Disk based systems often assume that index structures are in memory and accessing them is cheap compared to the main cost factor which whic h is accessing accessing the relations relations stored on secondary secondary storage. Howeve However, r, in-memory in-memory databases still proﬁt from index structures that allow to answer queries selecting single values or ranges of values. alues. Additionally Additionally,, translating translating values into value-ids alue-ids requires requires searching searching on the value domain which whi ch can be accele accelerate rated d by the use of indice indices. s. For in-mem in-memory ory database databasess the index access access performance is even more essential and must be faster than sequentially scanning the complete column. Lehman and Carey [48] [48]   describe in their work from 1986 how in-memory databases shift the focus for data structures and algorithms from minimizing disk accesses to using CPU cycles and main memory eﬃciently eﬃciently.. They present present T-Trees T-Trees as in-memory in-memory index structures, structures, based on AVL and B-Trees. B-Trees. T-Trees T-Trees are binary search search trees, trees, containing containing multiple multiple elements elements in one node to leverage cache eﬀects. A recent trend redesigns data structures to be more cache friendly and signiﬁcantly improve performance performan ce by eﬀectively eﬀectively using caches. caches. Lee et al. introduce introduce a Cache Sensitive Sensitive T-Tree T-Tree [47] [47] whereas Rao and Ross optimize B + -Trees for cache usage introducing Cache Sensitive Search Trees [62, [62, 63]. 63]. The work of Kim et al. presents a performance study evaluating cache sensitive data structures structures on multi-cor multi-coree processors processors [42] [42].. Besides creating Besides creating indices indices manually manually,, the ﬁeld of automatic automatic index tuning promises self-tuning self-tuning systems syst ems reacting dynamically dynamically to their workloads. workloads. Some techniques techniques analyze the workload workload and periodically optimize the system while other techniques constantly adapt the system with every query.. Graefe and Kuno propose adaptive query adaptive merging merging as an adaptive adaptive indexing indexing scheme, exploiting exploiting partitioned B-trees [2 B-trees [222] by focusing the merge steps only on relevant key ranges requested by the queries [24, [24, 23]. 23]. Idreos Idreos et al. propose propose databas databasee cracking cracking as another another adapti adaptive ve indexi indexing ng technique, partitioning the keys into disjoint ranges following a logic similar to a quick-sort algorithm [33, [33, 34 34]. ]. Finall Finally y, Idreos Idreos et al. propose propose a hybri hybrid d adapti adaptive ve indexi indexing ng techniq technique ue as a combination of database cracking and adaptive merging [35]. [35]. Besi Beside dess tree tree or hash based based indice ind ices, s, bitmap bitmap indices indices are also widely widely used in in-mem in-memory ory database database system systems, s, e.g. e.g. Wu et al. propose the Word-Aligned ord-Aligned Hybrid code [72 code [72]] as compression scheme for bitmap indices. 10

2.3.. 2.3

2.3 2. 3

CO COST ST MO MODE DELS LS

Cos ostt Mode Modells

Relatively little work has been done on researching main memory cost models. This probably is due to the fact, that modeling the performance of queries in main memory is fundamentally diﬀere diﬀ erent nt to disk disk based based systems, systems, where IO access access is clearly clearly the most expens expensiv ivee part. part. In inmemory databases, query costs consist of memory and cache access costs on the one hand and CPU costs on the other hand. In general, cost models diﬀer widely in the costs they estimate. Complexity based models provide a correlation between input sizes and execution time, but normally abstract constant factors fact ors as in the big O-Notatio O-Notation. n. Statis Statistic tical al models models can be used used to estimate estimate the result result sizes sizes of operati operations ons based on known known parame parameter terss and statisti statistics cs of the processed processed data. Other Other cost cost models are based on simulations and completely go through a simpliﬁed process mapping the estimated estim ated operations. Usually Usually simulation simulation based models provide very accurate accurate results results but are also expensive expensive and complex. complex. Analytical Analytical cost models provide provide closed mathematical mathematical formula formulas, s, estimating estim ating the costs based on deﬁned deﬁned parameters. parameters. Additionally Additionally,, system system architecture architecture aware models mode ls take take system system speciﬁc speciﬁc charac character terist istics ics into into accoun account, t, e.g. e.g. cache cache consci conscious ous cost cost models models which rely on parameters of the memory hierarchy of the system. Based on IBM’s research prototype“Oﬃce by Example”(OBE) Whang and Krishnamurthy [70] prese 70] present nt query optimization optimization techniques techniques based on modeling CPU costs. As modeling CPU costs is complicated and depends on parameters like the hardware architecture and even different programming styles, Whang and Krishnamurthy experimentally determine bottlenecks in the system and their relative their relative weights   weights   and unit and unit costs , building analytical cost formulas. Listgarten and Neimat [49 Neimat [49]] compare three diﬀerent approaches of cost models for in-memory database systems. Their ﬁrst approach is based on hardware costs, counting CPU cycles. The second approach models application costs, similarly to the method presented by Whang and Krishnamurthy   [70]. Krishnamurthy [70]. The third approach approach is based on execution execution engine costs, costs, which is a compromise between complex hardware-based models and general application-based models. Listgarten and Neimat argue that only the engine-based approach provides accurate and portable cost models. Manegold and Kersten [51] [51] describe describe a generic cost model for in-memory database systems, to estimate the execution costs of database queries queries based on their cache cache misses. The main idea is to des descri cribe be and model reoccur reoccurrin ringg basic basic patterns patterns of main main memory memory access. access. More complex complex pattern patt ernss are modeled modeled by combini combining ng the basic access access pattern patternss with with a presen presented ted algebra. algebra. In contrast to the cache-aware cost model from Manegold, which focusses on join operators, we compare scan and lookup operators on diﬀerent physical column layouts. Zukowski, Boncz, Nes and Heman [75 Heman [75,, 8] describe 8] describe X100 as an execution engine for monetDB, optimized for fast execution of in-memory queries. Based on in-cache vectorized processing and the tuple-at-a-time Volcano [21] [21] pipelining pipelining model, fast in cache query execution is enabled. The X100 cost model is based on the generic cost model of Manegold and Kersten, modeling query 11

CHAPTER CHAPTE R 2.

BACK BACKGRO GROUND UND AND RELA RELATED TED WOR WORK K

costs by estimating cache misses. Sleiman, Lipsky and Konwar [65] [65] present present an analytical model, predicting the access times for memory memory reques requests ts in hierarc hierarchic hical al multi multi-le -leve vell memory memory envir environm onmen ents, ts, based based on a lin linear ear-algebraic queuing theory approach. Pirk [57]] argues that general purpose database systems usually fall short in leveraging the Pirk [57 principle of data locality, because data access patterns can not be determined during designing the system. system. He proposes proposes storage storage strate strategie gies, s, which which automatic automaticall ally y layo layout ut the data in main main memory mem ory based on the workload workload of the database database.. His cost analyses analyses are based on the generic generic cost model of Manegold and Kersten, but are extended for modern hardware architectures including hardware prefetching.

2.4

Caches

The inﬂuences of the cache hierarchy on application performance have been extensively studied in litera literatur ture. e. Various arious techni technique quess have have been proposed proposed to mea measur suree costs costs of cache cache misses misses and pipelinee stalling. Most approaches pipelin approaches are based on handcrafted handcrafted micro benchmarks benchmarks exposing the respect res pectiv ivee parts parts of the memory hierarc hierarchy hy.. The diﬃcult diﬃculty y is to ﬁnd accurate accurate measur measureme ement nt techniques, mostly relying on hardware monitors to count CPU cycles and cache misses. Other approaches are based on simulations, allowing detailed statistics of the induced costs. Drepper [18 [18]] describes the concepts of caching, their realization in hardware and summarizes the implication implicationss for software software design. Barr, Cox and Rixner [4] [4] study study the penalties occurring when missing the translation lookaside buﬀer (TLB) in systems with radix page tables like the x86-64 system and compare diﬀere diﬀ erent nt page table table organiz organizatio ations. ns. Due to the page table table access access,, the process process of transla translatin tingg a virtual to a physical address can induce additional TLB cache misses, depending on the organization of the page table. Saavedra and Smith [64] [64] develop develop a model predicting the runtime of programs on analyzed machines. They present an uniprocessor machine-independent model based on various abstract operations. operation s. By combining combining the measuremen measurementt of the performance of such such abstract operations on a concrete processor and the frequency of these operations in a workload, the authors are able to estimate the resulting execution time. Additionally, penalties of cache misses are taken into accoun acc ount. t. Howe Howeve ver, r, it is assume assumed d that that an underl underlyin yingg miss miss ratio ratio is known known for the respect respectiv ivee workload and no mechanisms for predicting the number of cache misses are provided. Hirstea and Lenoski [31] [31] identify identify the increasing performance gap between processors and memory systems as an essential bottleneck and introduce a memory benchmarking methodology based on micro benchmarks measuring restart latency, back-to-back latency and pipelined bandwidth. The authors show that the bus utilization in uniform-memory-access (UMA) systems has a signiﬁcant impact on memory latencies. 12

2.4.. 2.4

CA CACH CHES ES

Babka and Tuma [3 [3] present a collection of experiments investigating detailed parameters and provide a framework measuring performance relevant aspects of the memory architecture of x86-64 systems. The experiments vary from determining the presence and size of caches to measuring the cache line sizes or cache miss penalties. Puzak et al. [60] [60] propose propose Pipeline Spectroscopy  to Spectroscopy  to measure and analyze the costs of cache misses. misse s. It is based on a visual visual represen representation tation of cache cache misses, their costs and their overlapp overlapping. ing. The spectrograms are created by simulating the program, allowing to closely track cache misses and the induced stalling. stalling. Due to the simulation simulation approach, approach, the method is cost intensiv intensivee and not suited for cost models requiring fast decisions.

13

CHAPTER CHAPTE R 2.

14

BACK BACKGRO GROUND UND AND RELA RELATED TED WOR WORK K

Chapter 3

System Deﬁnition This chapt This chapter er gives gives an ov overv erview iew of the considere considered d system system.. First, First, Sectio Section n 3.1 gives a formal deﬁnition of the system and considered parameters. Then, Section 3.2 Section 3.2 introduces introduces the considered physical column organization schemes uncompressed column, encoded columns and bit-packing. Followed by Section 3.3 Section 3.3 discussing operations ondictionary columns and their complexities depending dependi ng on the internal internal organization. organization. Finally Finally, Section Section   3.4 introduces the considered plan operators and discusses discusses their theoretical theoretical complexities. complexities.

3.1 3. 1

Par aram amet eter erss

We consider a database consisting of a set of tables T. A table t ∈ T consists of a set of attributes A. The number of attributes of a table t will be denoted as |A|. We assume assume the the value domain V of each attribute attribute a ∈ A to be ﬁnite and require the existence of a total order ρ over V. In particular, we deﬁne c.e  as as the value length of attribute attribute a and assume V to be the set of alphanumeric strings with the length c.e . An attribute a attribute a is a sequence of   c.r  values values D

vof  ∈   a. a . with

D

V

⊆ , where

D

c.r

is also called number number of rows of  a and a and

also called the dictionary

a set of values D = { v1 ,...,vn }. We deﬁne c.d  := := | D| as the number of distinct values casee the the of an attribu attribute. te. In case the dictionar dictionary y is sorted, sorted, we requir requiree ∀vi ∈D : vi < vi+1 . In cas dictionary is unsorted, v unsorted, v 1 ,...,vn are in insertion order of the values values in attribute attribute a a.. The position of a value value vi in the dictionary deﬁnes its value-id id id((vi ) := := i i.. For bit encoding, the number of b values in D is limited to 2 , with b with b being the number of bits used to encode values in the value := b b as the compressed value length of  a, a , requiring c.e c ≥ log2 (c.d ) vector. vec tor. We deﬁne c.e c := bits. D is

The degree of sortedness in a in a is is described by the measure of disorder denoted by c.u , based on Knuth’s measure of disorder, describing the minimum amount of elements that need to be 15

CHAPTE CHA PTER R 3.

SYS SYSTEM TEM DE DEFINI FINITION TION

Figure Fig ure 3.1: Inﬂuen Inﬂuence ce of the paramete parameterr skewn skewness ess on the value value distri distribut bution ion.. The histogram histogramss reﬂect the number of occurrences of each distinct value (for 10 distinct values and 100.000 generated values, skewness varying from 0 to 2).

removed from a sequence so that the sequence would be sorted [43] [43].. Finally, we deﬁne the value skewness c.k , describing the distribution of values of an attribut tribute, e, as the exponent exponent characte characteriz rizing ing a Zipﬁan Zipﬁan distr distribu ibutio tion. n. We chose chose to model model the different distributions by a Zipﬁan distribution, as the authors in [32] [32] state that the majority of columns analyzed from ﬁnancial, sales and distribution modules of an enterprise resource planning (ERP) system were following a power-law distribution – a small set of values occurs very often, while the majority of values is rare. The frequency of the xth value in a set of   c.d   distinct values can be calculated with the skewness parameter c.k  as: as: f f ((x, c.k , c.d ) =

1 xc.k c.d

1 n=1 nc.k



(3.1)

Intuitively, c.k  describe describess ho how w heavil heavily y the distr distribu ibutio tion n is drawn drawn to one value. value. Figure Figure   3.1 shows the inﬂuence of   c.k   on a value distribution with c.d   = 10 and c.d  = = 100. 100.000 values, displaying the distribution as a histogram for c.k   = 0, c.k   = 0.5, c.k   = 1.0 and c.k   = 2.0. In the case of   c.k  = = 0, the distribution equals a uniform distribution and every value occurs equally often. As c.k  increases, increases, the ﬁgure shows how the distribution is skewed more and more. 16

3.2. PHYSICA PHYSICAL L COLU COLUMN MN OR ORGANIZA GANIZATION TION

Logical Table

Column

0 1

0 1

Germany000000 Australia0000

2 3 4

United States United States Australia0000

Germany Australia

2 United States 3 United States 4 Australia

Figure 3.2: Organization of an uncompressed column with value length 13.

3.2

Ph Physi ysical cal Col Column umn Org Organi anizat zation ion

The logical view of a column is a simple collection of values that allows appending new values, retrieving the value from a position and scanning the complete column with a predicate. How the data is actually stored in memory is not speciﬁed. In general, data can be organiz organized ed in memory in a variety ariety of diﬀerent diﬀerent ways, ways, e.g. in standard vectors in insertion order, ordered collections or collections with tree indices [59] [ 59].. In addition to the type of organization of data structures, the used compression techniques are also essential for the result resulting ing perform performanc ancee charac character terist istics ics.. Regard Regarding ing compre compressi ssion, on, we will focus on the lightt weight compression ligh compression technique techniquess dictionary encoding and bit compression compression.. As concrete combinations, we examine uncompressed columns and dictionary encoded columns with bit compressed attribute vectors whereas the dictionary can be sorted and unsorted. Uncompressed columns store the values as they are inserted consecutively in memory, as e.g. used in [46 in [46]. ]. This design decision has two important properties that aﬀect the performance of the data structure. structure. First, the memory consumption consumption increases increases and second second the scan performance decreases due to the lower number of values per cache line and the lack of a sorted dictionary for fast queries. The update performance performance can be very high, high, as new values have have to be written to memory and no overhead is necessary to maintain the internal organization or for applying compression techniques. Figure 3.2 pictures an example of an uncompressed column, showing the logical view of Figure the table on the left and the layout of the column as it is represented in memory on the right side. As no compression compression schemes schemes are applied, applied, the logical logical layout layout equals the layout layout in memory, memory, storing the values consecutively as ﬁxed length strings in memory.

3.2.1 3.2 .1

Dictio Dictionar nary y Encodin Encoding g

Dictionary encoding Dictionary encoding is a well well known, known, light-w light-weigh eightt compression compression technique technique [66, 5, 68], 68], which reduces the redundancy by substituting occurrences of long values with shorter references to these the se values. alues. In a dictio dictionar nary y encoded encoded column, column, the actual column column contai contains ns tw twoo contai container ners: s: 17

CHAPTE CHA PTER R 3.

Logical Table 0

Germany

1 Australia 2 United States 3 United States 4 Australia

Column

SYS SYSTEM TEM DE DEFINI FINITION TION

Dictionary

0

00000000

00 Germany000000

1 2 3 4

00000001 00000010 00000010 00000001

01 Australia0000 10 United States

Figure 3.3: Organization of a dictionary encoded column, where the dictionary is unsorted.

Logical Table

Column

0 1 2

Germany Australia United States

0 1 2

00000001 00000000 00000010

3 4

United States Australia

3 4

00000010 00000000

Dictionary 00 Australia0000 01 Germany000000 10 United States

Figure 3.4: Organization of a dictionary encoded column, where the dictionary is kept sorted.

the attribute attribute vector and the value dictionary dictionary.. The attribute vector vector is a vector vector storing only references to the actual values of   c.e c bits, which represent the index of the value in the value dictionary and are also called value-ids. For the remainder, we assume c.e c = 32 bits. The value value dictionary may be an unsorted or ordered collection. collection. Ordered Ordered collections collections keep their tuples in a sorted order, allowing allowing fast iterations iterations over over the tuples tuples in sorted order. AdditionAdditionally, the search operation can be implemented as binary search that has logarithmic complexity. This comes with the price of maintaining the sort order, and inserting in an ordered collection is expensive in the general case. Figure 3.3 give givess an exampl examplee of a dictio dictionary nary encoded encoded column. column. The column column only stores stores refere ref erence ncess to the value value diction dictionary ary,, repres represen ented ted as 1 byte byte values. alues. The valuevalue-ids ids are stored stored sequen seq uentia tially lly in memory memory in the same order as the values values are insert inserted ed into into the column. column. The value dictionary is unsorted and stores the values sequentially in memory in the order as they are insert inserted. ed. Figure Figure   3.4 shows shows the same exampl examplee with with a sorted sorted dictiona dictionary ry.. Here, Here, the value value dictionary is kept sorted, potentially requiring a re-encoding of the complete column when a new value is inserted into the dictionary. 18

3.2. OPER OPERA ATION TIONS S

3.2.2 3.2 .2

Bit-P Bit-Pac ackin king g

A dictionary encoded column can either store the value-ids in an array of a native type like in integ tegers ers or compresse compressed, d, e.g. e.g. in a bit-pa bit-pack cked ed form. When When using a nativ nativee type, the number number of representable representable distinct distinct values values in the column is restricted restricted to the size of that type, e.g. with characters of 8 bit to 28 = 256 or with integers on a 64 bit architecture to 264 . The The size size of of the used type is a tradeoﬀ between the amount of representable distinct values and the needed memory space. When using bit-packing, the value-ids are represented with the minimum number of bits needed nee ded to encode encode the current current number number of distin distinct ct values alues in the value value dic diction tionary ary.. Be d the number of distinct values, then the number of needed bits is b = log2 (d). The valu value-i e-ids ds are stored sequentially sequentially in a pre-allocated pre-allocated area in memory. memory. The space savings savings are traded for more extraction overhead when accessing the value-ids. Since normally bytes are the smallest addressable unit, each access to a value-id has to fetch the block where the id is stored and correctly shift and mask the block, so that the value-id can be retrieved. 1

0

2

2

a)

0

1

0

0 1

0 1

0

b)

0

1

0

0 1

0 1

0

c)

0

0

0

1

0

0 1

0

d)

0

0

0

1

0

0 1

0

1 0

1

0

0

0

0

0

0

>>2

&

e)

0

0

0

0

0

0

1

1

0

0

0

0

0

0 1

0

Figure 3.5: Extraction of a bit-packed value-id. Figure 3.5 outlines an example of how a value-id can be extracted when using bit-packing. Figure 3.5 Part a) shows the column with the compressed value-ids from the unsorted dictionary example shown in Figure 3.3. As there there are only 5 values, alues, the second second byte byte is ﬁlled ﬁlled with zeros. zeros. The exampl exa mplee shows shows the extracti extraction on of the value value-id -id from row 3. In a ﬁrst ﬁrst step step – shown shown by b), the byte containing the searched value-id has to be identiﬁed, here the ﬁrst of two. In c), a logical right shift of 2 is applied in order to move the searched value-id to the last two bits of the byte.. d) extracts the searched byte searched value-id value-id by masking the shifted shifted byte with a bit-mask bit-mask so that the searched value-id is obtained as shown in e). In order to increase the decompression speed techniques like Streaming SIMD Extensions (SSE) can be applied to the extraction process [71]. 71]. Anothe Anotherr way way is to block-w block-wise ise extract extract the compresse compressed d values, alues, theref therefore ore amortiz amortizing ing the decompression costs over multiple values. 19

CHAPTE CHA PTER R 3.

3.3 3. 3

SYS SYSTEM TEM DE DEFINI FINITION TION

Ope Opera rati tion onss on Dat Data a Stru Struct ctur ures es

A column is required to support the two operations operations getValueAt   getValueAt   and getSize   getSize   in order to allow accessing accessing and iterating iterating over over its values. values. Additionally Additionally,, the operation appendValue operation appendValue    supports inserting new values inserting values at the end of the column. column. A dictionary supports supports the operations getValoperations getValueForValueId    and getSize   ueForValueId getSize   in order to allow accessing and iterating over the stored values. Additionally, addValue  adds addValue  adds a new value to the dictionary, potentially requiring a rewrite of the complete complete column. With getValueIdForValue With getValueIdForValue , the dictionary can be searched for a speciﬁc value. If the dictionary is sorted, a binary search can be used, otherwise the dictionary has to be scanned linearly. Dictionary Operations Dictionary Operations As both sorted sorted and unsorted unsorted dictiona dictionarie riess are implemen implemented ted as a sequenti sequ ential al array of ﬁxed length values, alues, the operation getValueForValueId operation getValueForValueId  can can be implemented by directly calculating the respective address in memory, reading and returning the requested value. alue. Theref Therefore ore,, the costs costs only only depend depend on the siz sizee of the stored stored values alues c.e   resulting in a complexity of  O O (c.e )).. In tcase new value added to aIn sorted dictionary, thenvalue has to be inserted at the correct correc positio posiation n in into to theiscollecti colle ction. on. the worst wors t case, case, when whe the new value is the smallest small est,, every eve ry other value has to be b e moved in order to free the ﬁrst position. There Therefore, fore, the complexity complexity for addValue for addValue  in in the sorted case is O (c.d · c.e ). ). In case of an unsorted unsorted dictionary dictionary,, the new value value can be directly written to the end of the array, resulting in a complexity of  O O (c.e )).. getValueForValueId   searches in the dictionary for a given value, returning its value-id or getValueForValueId   a not found valu valuee in case the value alue is not in the dictio dictionar nary y. If the dictio dictionary nary is sorted, sorted, the search searc h can be b e performed performed with a binary search search algorithm, algorithm, resulting resulting in logarithmic logarithmic complexity complexity of ). If the dictionary is unsorted, all values have to be scanned linearly, resulting O(log c.d   ·· c.e ). in complexity of   O(c.d  · · c.e )).. Column Operations In the case of an uncompressed column, getValueAt column, getValueAt  can can directly return the requested value. If the column is dictionary encoded, the value-id has to be retrieved ﬁrst and then the requested requested value value is returned. returned. The complexity complexity for the uncompressed uncompressed case therefore results resul ts in O (c.e ) and in a complexity for the dictionary encoded cases of  O (c.e  · · c.e c ). appendValue  on an uncompressed column is trivial, assuming the array has enough free appendValue  on space left, as the value is appended to the array, resulting in a complexity of   O(c.e ). In case of an unsorted dictionary, we ﬁrst check whether the appended value is already in the dictionary dicti onary.. This check check is performed performed with the dictionary dictionary operation getValueForValueId   getValueForValueId   and linearly depends on c.e  and and c.d . If the valu valuee is not in the dictionary dictionary, it is added with addValue with addValue an and d the the val alue ue-i -id d is appen appende ded d to th thee co colu lumn mn.. The The total total comp comple lexi xity ty for the the un unso sort rted ed case case sorted dictio dictionary nary,, getValueForValueId getValueForValueId    has is therefore in O(c.d   + + c.e c + c.e )).. In case of a sorted logarithmic costs and addValue and addValue  depends depends linearly on c.d . Additionally, the insertion of a new value-id may invalidate all old value-ids in the column, requiring a rewrite of the complete column and resulting in a complexity of   O + c.r  · · c.e c + c.e )).. O (c.d   + 20

3.4. PLAN OPER OPERA ATOR TORS S

     d    e     t    r    o    s    n      U    y    r    a    n      i    o     t    c      i      D

     d    e     t    r    o      S    y    r    a    n    o      i     t    c      i      D

     d    e    s    s    e    r    p    m    o    c    n      U

ScanEqual O(c.r  · · c.e  + q.s + q.s))

+ c.r  · · O (log c.d   + + q.s q.s)) c.e c +

+ c.r  · · O(c.d   + + q.s q.s)) c.e c +

ScanRange O(c.r  · · c.e  + q.s + q.s))

+ c.r  · · O (log c.d   + + q.s q.s)) c.e c +

O(c.r  · · c.e c + + q.s)) c.r  · · c.e   + q.s

Lookup Insert

O(c.e ) O(c.e )

O(c.e c + c.e )

+ O(c.d  · · c.e   + c.r   · c.e c )

O(c.e c + c.e ) O(c.d · c.e + c.e c )

Table 3.1: Complexity of plan operations by data structures.

3.4 3. 4

Pl Plan an Ope Opera rato tors rs

We now introduce the plan operators scan with equality selection, scan with range selection, positional position al lookup and insert and discuss discuss their theoretical theoretical complexit complexity y. These operators were chosen, as we identiﬁed them as the most basic operators needed by a database system, assuming an insert only system as proposed in [58, [ 58, 59, 59, 46 46,, 25]. 25 ]. Additionally, more complex operators can be assembled assembled by combining combining these basic operators, operators, as e.g. a nested loop join consisting consisting of multiple mult iple scans. We diﬀerenti diﬀerentiate ate betw b etween een equality and range selectio selections ns as they have diﬀerent diﬀerent performance performan ce characteri characteristics stics due to diﬀerences diﬀerences when performing performing value comparisons introduced by the dictionary encoding. encoding. Table able 3.1 gives 3.1 gives an overview of the asymptotic complexity of the four discussed operators on the diﬀerent data structures.

3.4.1 3.4 .1

Scan Scan with with Equal Equalit ity y Select Selection ion

A scan with equality selection sequentially iterates through all values of a column and returns a list of positions where the value in the column equals the searched value. Algorithm 3.4.1 shows 3.4.1 shows the implemen implementati tation on as pseudocode pseudocode for the case of an unc uncomp ompres ressed sed column. column. As the column column is uncompr uncompress essed, ed, no decomp decompres ressio sion n has to be performe performed d and the values values can be compare compared d direct dir ectly ly with the searched searched value. value. The costs for an equal equal scan scan on an uncomp uncompres ressed sed column column are charac character terize ized d by compar comparing ing all c.r   valu values es and by building building the result set, result resulting ing in + q.s). ). O(c.r   ·· c.e   + q.s 21

CHAPTE CHA PTER R 3.

SYS SYSTEM TEM DE DEFINI FINITION TION

Algorithm 3.4.1: ScanEqualUncompressed(X, column column)) result ← array array[] [],pos ,pos ← 0 for each each value value ∈ column if   value == value == X X then   result.append( then result.append( pos pos)) do pos ← pos pos + + 1 return   (result return result))

 

Algorithm 3.4.2: ScanEqualDictSorted(X, column, column, dictio dictionary nary)) result ← array array[] [],pos ,pos ← 0 valueIdX   = dictionary.bin dictionary.binarySearch arySearch((value value)) for each valueId each valueId ∈ column if  valueId == valueId == valueId valueIdX then   result.append( then do result.append( pos pos)) pos ← pos pos + + 1

 

return   (result return result)) Algorithm 3.4.3: ScanEqualDictUnsorted(X, column, column,dic dictio tionary nary)) result ← array array[] [],pos ,pos ← 0 valueID = valueID = dictiona dictionary.scanF ry.scanFor or((value value)) for each valueId each valueId ∈ column if  valueId == valueId == valueId valueIdX do then   result.append( then result.append( pos pos)) + 1 pos ← pos pos + return   (result return result))

 

We do not discuss a scan for inequality, as the implementation and performance characteristics are assumed to be the same as for a scan operator with equality selection. Algorithm 3.4.2 Algorithm 3.4.2   shows how a scan with equality selection is performed on a column with a sorted dictionary. dictionary. First, the value-id value-id in the value dictionary dictionary of the column for the searched searched value x value x is is retrieved by performing a binary search for x for x in in the dictionary. Then, the value-ids of the column are scanned sequentially and each matching value-id is added to the set of results. The costs for an equal scan on a column with a sorted dictionary consist of the binary search cost in the dictionary and comparing each value-id, resulting in O (log c.d   + + c.r  · · c.e c + + q.s q.s). ). Algorithm 3.4.3 outlines a scan with equality selection on a column with an unsorted dictionary dicti onary.. As we are scanning the column searching searching for all occurrences of exactly one value, we can perform the comparison similarly as in the case of a sorted dictionary based only on 22

3.4. PLAN OPER OPERA ATOR TORS S

Algorithm 3.4.4: ScanRangeUncompressed(low, hi high, gh, colum column n) result ← array array[] [],pos ,pos ← 0 for each each value value ∈ column if   (value > low) and (value < high) low) and high) then   result.append( then result.append( pos pos)) do pos ← pos pos + + 1

 

Algorithm 3.4.5: ScanRangeDictSorted(low, high, high, column, column, dictio dictionary nary)) result ← array array[] [],pos ,pos ← 0 valueIdlow ← dictionary.bin dictionary.binarySearch arySearch((low low)) valueIdhigh ← dictionary.bin dictionary.binarySearch arySearch((high high)) for each valueId each valueId ∈ column if   (valueId > valueIdlow ) and (valueId < valueIdhigh ) then   result.append( then do result.append( pos pos)) pos ← pos pos + + 1

 

return   (result return result)) Algorithm 3.4.6: ScanRangeDictUnsorted(low, high, high, column, column, dictio dictionary nary)) result ← array array[] [],pos ,pos ← 0 for each valueId each valueId ∈ column value ← dictionary dictionary[[valueId valueId]] if   (value > low) low) and and (value < high) high) do then   result.append( then result.append( pos pos)) + 1 pos ← pos pos + return   (result return result))

 

the valuevalue-ids ids.. The diﬀerenc diﬀerencee to the sorted sorted dictio dictionary nary case is that that a lin linear ear search search has to be performed instead of a binary search to retrieve the value-id of the searched value x. x . Similar to the costs for a scan with equality selection on a column with a sorted dictionary, the costs on a column with an unsorted dictionary consist of the search costs for the scanned value in the dictionary dictionary and comparing comparing each value-id. value-id. In contrast contrast to the sorted dictionary dictionary case, + q.s q.s). ). the search costs are linear, resulting in a complexity of  O (c.d   + + c.r  · · c.e c + 3.4.2 3.4 .2

Scan Scan with with Range Range Sele Select ction ion

A scan operator with range selection sequentially iterates through all values of a column and returns a list of positions, where the value in the column is between low and high high.. In contrast to a scan operator with an equality selection, it is searched for a range of values instead of one 23

CHAPTE CHA PTER R 3.

SYS SYSTEM TEM DE DEFINI FINITION TION

Algorithm 3.4.7: LookupUncompressed( position, column) column) value ← column column[[ position] position] return   (value return value)) Algorithm 3.4.8: LookupDictionary( position, column, dictionary) dictionary ) valueId ← column column[[ position] position] value ← dictionary dictionary[[valueId valueId]] return   (value return value))

single value. Algorithm 3.4.4 outlines Algorithm 3.4.4 outlines the implementation of a range scan in pseudocode for an uncompresse pre ssed d column column.. Simila Similarly rly to a scan scan operator operator with equali equality ty selecti selection, on, we can perform perform the comparisons directly on the values while iterating sequentially through the column. Therefore, the costs are determined by the value length c.e , the number of rows c.r  and and the selectivity q.s of q.s of the scan, resulting in O (c.r  · · c.e   + q.s + q.s). ). Algorithm 3.4.5 shows the implementation for the range scan on a dictionary encoded column col umn with a sorted sorted dictionar dictionary y. First, First, the valuevalue-ids ids of   of   low and high are retrieved with a binary search search in the dictionary dictionary.. As the dictionary is sorted, we know that that   idlow > idhigh ⇒ value((idlow ) > value( value value(idhigh). Therefore, Therefore, we can scan the value-ids alue-ids of the column and decide only by comparing with the value-ids of  low low and and high high if if the current value-id has to be a part of the result set. The costs are similar to the costs for a scan with equality selection, determined by the binary search costs, the scanning of the column and building the result set, resulting in + q.s q.s). ). + c.r  · · c.e c + O(log c.d   + Finally, algorithm 3.4.6 algorithm 3.4.6 shows shows the implementation of a scan operator with range selection on a column column with an unsort unsorted ed dictionar dictionary y. As the dictio dictionary nary is unsorte unsorted, d, we can not draw draw any conclusions of the relations between two values based on their value-ids in the dictionary. We iterate sequentially sequentially through the value-i value-ids ds of the column. For each each value-id, alue-id, we perform a lookup lookup retrievin retrievingg the actual actual value alue stored stored in the dictiona dictionary ry.. The comparis comparison on can then be performed on that value with l with low ow   and high and high.. In contrast to the sorted dictionary case, the costs in the unsorted case are determined by scanning the value-ids, performing the lookup in the dictionary and building the result set, + q.s). ). resulting in a complexity of   O(c.r  · · c.e c + c.r  · · c.e   + q.s 24

3.4. PLAN OPER OPERA ATOR TORS S

Algorithm 3.4.9: InsertDictSorted(value, column,dicti column, dictionary onary)) valueId ← dictionary.bin dictionary.binarySearch arySearch((value value)) if  valueId = valueId = NotInDictionary NotInDictionary valueId ← dictionary.insert dictionary.insert((value value)) if  valueId valueId < dictionary.size() dictionary.size() then then   column.reencode() then column.reencode() column.append((valueId column.append valueId)) else column.append column.append((valueId valueId))

  

Algorithm 3.4.10: InsertDictUnsorted( position, column, dictionary) dictionary ) valueId ← diction dictionary.scanF ary.scanFor or((value value)) if  valueId = valueId = NotInDictionary NotInDictionary then valueId then valueId ← dictionary.insert dictionary.insert((value value)) column.append((valueId column.append valueId))

3. 3.4. 4.3 3

Look Lookup up

A positional lookup retrieves the value of a given position from the column. The output is the actual value, value, as the positio p osition n is already already known. Algorithm Algorithm   3.4.7 shows 3.4.7 shows the lookup in case of an uncompressed column, where the value can be returned directly. The costs only depend on the value length, resulting in a complexity of  O (c.e )).. In the case case of a dictio dictionary nary encoded encoded column column,, the algorithm algorithmss for a positio positional nal lookup do not diﬀer between between a sorted and an unsorted unsorted dictionary. dictionary. Therefore, Therefore, algorithm 3.4.8 algorithm 3.4.8 shows shows the lookup for a dictionary encoded column, where the value-id is retrieved from the requested positio posi tion n and a dictio dictionar nary y lookup lookup is performed performed in order order to retrie retrieve ve the searc searched hed value. value. The costs depend on the compressed and the uncompressed value length, resulting in a complexity ). of  O O (c.e c + c.e ).

3. 3.4. 4.4 4

Inse Insert rt

An insert operation operation appends appends a new value value to the column. column. As we keep keep the values alues alwa always ys in insertion order, this can be implemented as a trivial append operation, assuming enough free and allocated allocated space to store store the inserte inserted d value. alue. In the case case of a diction dictionary ary encoded encoded column column,, we have to check if the value is already in the dictionary. Algorithm 3.4.9 Algorithm 3.4.9 outline outliness the insert operation for a column with a sorted dictionary dictionary.. First, First, a 25

CHAPTE CHA PTER R 3.

SYS SYSTEM TEM DE DEFINI FINITION TION

binary search is performed on the dictionary for value v value v.. If  v is v is not found in the dictionary, it is inserted so that the sort order of the dictionary is preserved. In case that v that v is is not inserted at the end of the dictionary, a re-encode of the complete column has to be performed in order to reﬂect the updated value-id value-idss of the dictionary dictionary.. After the re-encode or if   if   v was already found in the dictionary, the value-id is appended to the column. The complexity is in O (c.d · c.e + c.r · c.e c ). Algorithm 3.4.10 shows the insertion of a new value for a column with an unsorted dictionary tionar y. Similarly Similarly to the sorted case, we ﬁrst search for the inserted inserted value value in the dictionary dictionary by performing performing a linear linear search. As the dictionary is not kept in a particular order, the values values are always always appended appended to the end of the dictiona dictionary ry.. Theref Therefore ore,, no rere-enc encode ode of the column column is necessary. The resulting complexity is O (c.d  · · c.e   + + c.e c ).

26

Chapter 4

Parameter Evaluation In the previous chapter, we deﬁned plan operators and discussed their implementations and complexity depending on the parameters deﬁned in Section 3.1. This chapter thrives to experimentally verify the theoretical discussion of the parameters and their inﬂuence on plan operations. We implemented all operators in a self designed experimental system written in C++. All experim experimen ents ts were were conduc conducted ted on an Intel Intel Xeon X5650, with with 2x6 cores, cores, hyperhyper-thre threadi ading, ng, 2.67 GHz and 48 GB main memory memory.. The sys system tem has 32 KB L1 data data cache, cache, 256 KB uniﬁed uniﬁed L2 cache, 12 MB uniﬁed L3 cache and a two level TLB cache with 64 entries in the L1 data TLB and 512 entries in the uniﬁed L2 TLB. The TLB caches are 4-way associative, the L1 and L2 cache cache are 8-way 8-way associati associative ve and the L3 cache cache is 16-wa 16-way y associat associativ ive. e. The data was generat gen erated ed in such such a way way that that all parameter parameterss were were ﬁxed ﬁxed exc except ept the varied aried par parame ameter ter.. We compiled compil ed our programs programs using Clang++ version 3.0 with full optimizations optimizations.. In order to avoid scheduling eﬀects, we pinned the process to a ﬁxed CPU and increased the process priority to its maximum. We used the PAPI framework to measure cache misses and CPU cycles [17] [ 17]..

4.1 4. 1

Num umbe ber r of Rows Rows

We start by discussing the inﬂuence of the number of rows c.r  on on the plan operator performance. Scan with Equality and Range Selection Figure 4.1(a) Figure 4.1(a) shows shows the time needed to perform a scan operator with equality selection on a column with the number of rows c.r  varied varied from 2 millio million n to 20 millio million. n. The time for the scan operation operation increase increasess lin linear early ly with the numbe numberr of rows, whereas the time per row stays constant. Similar to the scan with equality selection, Figure 4.1(b) Figure 4.1(b) shows shows the linear inﬂuence of the number of rows on the runtime for a scan with range selection. 27

CHAPTER CHAPTE R 4.

350M

400M

300M

350M 300M

250M    s    e     l    c    y     C     U     P     C

   s    e     l    c    y     C     U     P     C

200M 150M 100M

250M 200M 150M 100M

50M

50M

0 2M

PARAMET ARAMETER ER EV EVALUA ALUATION TION

4M

6M

8M

0

10M 1 2M 2M 1 4 4M M 16 6M M 18 8M M 20 0M M

2M

4M

6M

8M

Number of Rows Uncompr.

S.Dict.

U.Dict.

Uncompr.

(a) Equal Scan

S.Dict.

U.Dict.

(b) Range Scan

16

100M

14

10M

12    s    e     l    c    y     C     U     P     C

10M 1 2 2M M 14 4M M 16 6M M 18 8M M 20 0M M

Number of Rows

1M

10

   s    e     l    c    y     C     U     P     C

100k

8 6 4

10k 1k 100

2 10

0 2M

4M

6M

8M

10M

12M

14M

16M

18M

2M

20M

S.Dict.

6M

8M

10M 12M 14M 16M 18M 20M

Number of Rows

Number of Rows Uncompr.

4M

Uncompr. S.Dict.

U.Dict.

S.Dict. + Reencode U.Dict.

(d) Insert

(c) Positiona Positionall Lookup

Figure 4.1: CPU cycles for (a) for (a) equal scan, (b) scan, (b) range scan, (c) scan, (c) positional lookup and (d) and (d) insert on one column with number of rows c.r  varied varied from 2 million to 20 million, c.d  = = 200, 200,000, c.u   = = 0, c.e   = = 8, c.k  = = 0 and a query selectivity of  q of  q.s .s = = 2,000. Positional Lookup For a positio p ositional nal lookup on a column, we expect the number number of rows to have no inﬂuence on the performance on the lookup operation, as conﬁrmed by Figure 4.1(c). The costs for performing a lookup on a dictionary encoded column is greater compared to an uncompressed column, as the value-id has to be retrieved ﬁrst. Insert For insert inserting ing new values alues into a column column,, we do expect expect the numbe numberr of rows rows c.r   to have no inﬂuence on the time an actual insert operation takes, regardless if the column is uncomp unc ompres ressed sed or diction dictionary ary encoded. encoded. In cas casee the column column uses uses dictio dictionar nary y encodin encodingg with with a sorted dictionary, we expect a linear inﬂuence on the re-encode operation by the number of rows in the column. Figure 4.1(d) shows the number of CPU cycles for an insert operation with a varying numbe numberr of ro rows ws on an uncomp uncompres ressed sed column, column, a dictio dictionar nary y encoded encoded column with a sorted sorted dictionary, a dictionary encoded column with a sorted dictionary and performing a re-encode 28

4.2. NUM NUMBER BER OF DIS DISTINC TINCT TV VALU ALUES ES

[L1: 32 ]

160M 140M

800M

120M

700M

   s    e     l    c    y     C     U     P     C

[L2: 256 ]

[L3: [L3: 12 12m m ]

900M

   s    e     l    c    y     C     U     P     C

100M 80M 60M

600M 500M 400M 300M 200M

40M 20M

100M 10

12

2

14

2

16

2

18

2

20

2

2

0

22

2

10

12

2

14

2

Number of Distinct Values Uncompr.

S.Dict.

U.Dict.

16

18

20

22

2 2 2 2 Number of Distinct Values

Uncompr.

(a) Equal Scan

S.Dict.

2

U.Dict.

(b) Range Scan

14 1G

12

100M

10    s    e     l    c    y     C     U     P     C

10M

8 6 4

100k

    U     P     C

10k 1k 100

2 0

1M

   s    e     l    c    y     C

10 10

2

12

14

2

2

16

2

18

2

20

2

10

2

22

2

12

14

2

2

S.Dict.

18

2

20

2

22

2

Number of Distinct Values

Number of Distinct Values Uncompr.

16

2

Uncompr. S.Dict.

U.Dict.

(c) Pos Positional itional Lookup

S.Dict. + Reencode U.Dict.

(d) Insert

Figure 4.2: CPU cycles for (a) for (a) equal scan, (b) scan, (b) range scan, (c) scan, (c) positional lookup and (d) and (d) insert 10 23 on one column with number of distinct values c.d  varied varied from 2 to 2 , c.r   = = 223 , c.u   = = 223 , = 8, c.k  = = 0 and a query selectivity of  q of  q.s .s = = 2,000. c.e   =

and a dictionary encoded column with an unsorted dictionary.

Inserting into an uncompressed column is very fast and not inﬂuenced by the number of rows in the column. In the case of a dictionary encoded encoded column, the insert takes takes longer as it takess to search take search the dictionary for the inserted inserted value. value. The large diﬀerence diﬀerence between between the sorted and the unsorted dictionary is due to the eﬀect of the binary search, which can be leveraged in the sorted dictionary dictionary case. Finally Finally,, we can see the linear impact of the num number ber of rows on the insert with the re-encode operation – note the logarithmic scale. 29

CHAPTER CHAPTE R 4.

4.2 4. 2

PARAMET ARAMETER ER EV EVALUA ALUATION TION

Nu Num mber o off Di Dist stin inct ct Val alue uess

We now focus on the number of distinct values c.d   of a column and their inﬂuence on the operators scan, insert and lookup. Scan with Equality Equality Selection Selection When When scanni scanning ng a column column with an equali equality ty sele selecti ction, on, we expect the number of distinct values to inﬂuence the dictionary encoded columns, but not the uncompress uncom pressed ed column. Figure 4.2(a) Figure 4.2(a) shows shows the results of an experiment performing an equal 23 scan on a column with 2 rows and c.d  varied varied from 210 to 223 . We chose a selectivity of 2, 2 ,000 rows,, in order to keep the eﬀect of writing the result set minimal. As expected, rows expected, the runtime runtime for the scan on the uncompressed column is not aﬀected and we clearly see the linear impact on the unsorted unsorted dictionary column. Howeve However, r, the logarithmically logarithmically increasing increasing runtime for the column using a sorted dictionary is hard to recognize due to the large scale. Scan with Scan with Range Range Select Selection ion In contrast contrast to a scan scan operator operator with equal equalit ity y select selection ion,, the implemen imple mentation tation of a range scan only diﬀers in the case for an unsorted unsorted dictionary dictionary. The cases for an uncompressed column and a column with a sorted dictionary are the same as for a scan operator with equality selection, as Figure 4.2(b) shows. 4.2(b) shows. For an unsorted dictionary encoded column, Figure Figure 4.2(b) shows a strong impact of the varied aried number number of distin distinct ct values values on the runtime runtime.. Based Based on our earlier earlier discus discussio sion n of a scan scan operator with range selection on an unsorted dictionary, we would not expect this characteristic. The increase in CPU cycles with increasing distinct values is due to a cache eﬀect. As c.u   = = 223 , we access access the dictionary in a random fashi fashion on while iterating iterating over the column. As long as the dictionary is small and ﬁts into the cache, these accesses are relatively cheap. With a growing number of distinct values the dictionary gets too large for the individual cache levels and the number of cache misses per dictionary access increases. Considering a value length of 8 bytes, we can identify jumps slightly before each cache level size of 32KB, 256KB and 12MB – e.g. at c.e  · · 215 = 256 KB, which are discussed in more detail in Section 5.2. Section 5.2. Lookup Figure 4.2(c) Figure 4.2(c) shows shows the inﬂuence of the number of distinct values when performing ainﬂuenced posit positio iona nalby l looku loo kup p on one co colu lumn mn.. values. As we ca can n se see, e, the the ti time me for one one si sing ngle le lookup lookup is not the number of distinct Insert Figure 4.2(d) Figure 4.2(d) shows shows the inﬂuence for the number of distinct values when inserting new values into a column. For an uncompressed column, the insert takes the same, independent of the number of distinct values. When using dictionary compression and an unsorted dictionary, the time for inserting a new value increases linearly with the numbers of distinct values as the dictio dic tionary nary is searc searched hed linearly linearly for the new value. alue. In case of a sorted sorted dictionar dictionary y withou withoutt reencoding the column, we notice a logarithmic increase with increasing values respective to the binary search search in the dictionary dictionary.. If the newly inserted value value did change existing existing valuealue-ids, ids, the column colum n has to be re-encoded. re-encoded. As we can see, the increase increase for binary searching searching the dictionary is negligible, as it is shadowed by the high amount of work for re-encoding the column. 30

4.3. VALUE DISO DISORDER RDER

900M

350M

800M

300M

700M 250M    s    e     l

600M    s    e     l    c    y     C

200M

   c    y     C     U     P     C

500M 400M

150M

    U     P     C

300M

100M

200M

50M

100M 0

0 0

5M

10M

15M

0

20M

5M

Uncompr.

S.Dict.

Uncompr.

U.Dict.

15M

20M

S.Dict.

U.Dict.

(b) Range Scan

(a) Equal Scan 12

100M

10

10M 1M

8    s    e     l    c    y     C     U     P     C

10M Value Disorder

Value Disorder

   s    e     l    c    y     C     U     P     C

100k

6 4

10k 1k 100

2

10

0 0

5M

10M

15M

0

20M

S.Dict.

10M

15M

20M

Value Disorder

Value Disorder Uncompr.

5M

Uncompr. S.Dict.

U.Dict.

(c) Pos Positional itional Lookup

S.Dict. + Reencode U.Dict.

(d) Insert

Figure 4.3: CPU cycles for (a) for (a) equal scan, (b) scan, (b) range scan, (c) scan, (c) positional lookup and (d) and (d) insert on one column with value disorder c.u  varied varied from 0 to 20 million, c.r  = = 20 million, c.d   = =2 million, c.e   = = 8, c.k  = = 0 and a query selectivity of  q of  q.s .s = = 2,000.

4.3 4. 3

Val alue ue Di Diso sord rder er

This section evaluates the inﬂuence of the parameter value disorder c.u   in a column on the discussed operators. Scan with Equality Selection When performing performing a scan with equality equality selection, selection, the comparisons can be done directly on the value-ids in case of a dictionary encoded column or are done directly directly on the values values in case of an uncompresse uncompressed d column. Therefore, Therefore, we do not expect the value value disorder to inﬂuence the performance performance of an equal scan. Figure 4.3(a) Figure 4.3(a) conﬁrms conﬁrms this assumption. Scan with Range Selection Figure 4.3(a) Figure 4.3(a) shows shows the performance of a scan operator with range selection with varied disorder of values in the column. As expected, we see no inﬂuence 31

CHAPTER CHAPTE R 4.

PARAMET ARAMETER ER EV EVALUA ALUATION TION

in case of an uncompressed column or a column with a sorted dictionary. In case of a dictionary encoded column with an unsorted dictionary, we see an increase in CPU cycles. In contrast to a scan with equality selection, a scan operator with range selection on an unsorted dictionary has to lookup the actual values values in the dictionary dictionary in order to compare them. When the value value disorder is low, temporal and spatial locality for the dictionary access is high, which results in good cache usage with a high number number of cache cache hits. The greater the disorder, disorder, the more random accesses to the dictionary are performed, resulting in more CPU cycles for the scan operation. Lookup and Insert We do expect the value value disorder to have have no inﬂuence on single positional lookups and inserts. Figure 4.3(c) Figure 4.3(c) a and nd 4.3(d) 4.3(d) conﬁrm conﬁrm our assumptions. assumptions.

4.4 4. 4

Val alue ue Le Leng ngth th

We now focus on the inﬂuence of the value length c.e  of of the values in a column on the discussed operators. Scan with Scan with Equalit Equality y Select Selection ion Figure 4.4(a) shows the inﬂuence of an increased value length on a scan operator with equality selection. For uncompressed columns we see an increase in cycles for scan operators with longer values, as expected. However, we see a drop for every 16 bytes, due to alignment alignment eﬀects, eﬀects, showing showing how crucial it is to align memory correctly correctly.. In case of a dictionary compressed column, we see no signiﬁcant inﬂuence in case of a sorted dictionary based on the value length. Although, the costs for the search do increase slightly, the increase is shadowed shadowed by the costs for scanning scanning the value-ids. alue-ids. When using an unsorted unsorted dictionary, dictionary, the costs for scanning the dictionary are signiﬁcantly higher and we see a signiﬁcant impact of the value length on the total scan costs. Scan with Range Selection The impact impact of an increasing increasing value value length length on a scan operator with range selection is similar to a scan with equality selection, as shown by Figure 4.4(b). In case of an uncompressed column, we see an increase in costs with larger value lengths and the same alignment eﬀect, as the values are compared directly and larger values result in larger costs for comparing the values. values. We see no signiﬁcant signiﬁcant impact of the value value length for the case of a sorted dictionary but an increase in case of an unsorted dictionary. Positional Lookup The costs for a positional lookup do increase linearly linearly with incr increasin easingg size of values, as the actual values are returned by the lookup operation, resulting in more work for longer values as shown by Figure4.4(c) Figure4.4(c) Insert The costs for inserting new values values always always increase with larger values, values, as the costs for writing writi ng the values do increase. increase. Figure Figure 4.4(d) 4.4(d) shows that in case of an uncompressed column the increase is quite small, which probably is due to block writing and write caching eﬀects. The costs on a column with a sorted dictionary also increase slightly based on the increased costs for the binary search. However, in case a re-encode is necessary, the increase is shadowed 32

4.5. VALUE SKEW SKEWNESS NESS

100M 50M 40M    s

   s     l    e    c    y     C     U     P     C

    l    e    c    y     C     U     P     C

10M

30M 20M 10M

1M 20

40

60

80

100 100

0

120

20

40

Length of Values in Bytes Uncompr.

S.Dict.

60

80

100

120

Length of Values in Bytes U.Dict.

Uncompr.

(a) Equal Scan

S.Dict.

U.Dict.

(b) Range Scan 10M

50

1M

40

100k

30

   s    e     l    c    y     C     U

10k

20

    P     C

1k

   s    e     l    c    y     C     U     P     C

100

10

10

0 20

40

60

80

100

20 20

120

S.Dict.

60

80

100

120

Length of Values in Bytes

Length of Values in Bytes Uncompr.

40

Uncompr. S.Dict.

U.Dict.

S.Dict. + Reencode U.Dict.

(d) Insert

(c) Pos Positional itional Lookup

Figure 4.4: CPU cycles for (a) for (a) equal scan, (b) scan, (b) range scan, (c) scan, (c) positional lookup and (d) and (d) insert on one column with value length c.e  varied varied from 8 to 128 byte, c.r  = = 512, 512,000 , c.d  = = 100, 100,000, = 512, 512,000, c.k  = = 0 and a query selectivity of  q of  q.s .s = = 1,024. c.u  = by the large costs for re-encoding the column. When using an unsorted dictionary, we also see a linear increase in costs with larger values.

4.5 4. 5

Val alue ue Sk Skew ewne ness ss

We now focus on the inﬂuence of the skewness c.k  of of the value distribution of the values in a column on the discussed operators. Scan with Equality Selection, Lookup and Insert The skewness skewness of values values inﬂuences inﬂuences the pattern in which the dictionary of a column is accessed when scanning its value-ids and looking them up in the dictionary, the skewer the value distribution, the less cache misses 33

CHAPTER CHAPTE R 4.

PARAMET ARAMETER ER EV EVALUA ALUATION TION

400M

400M 350M

350M

300M

300M

250M

   s     l    e    c    y     C     U     P     C

   s    e     l    c    y     C     U     P     C

250M

200M 150M

200M

100M

150M

50M

100M 0.0

0 0.0

0.5

1.0

1.5

2.0

0.2

0.4

0.6

Uncompr.

S.Dict.

Uncompr.

U.Dict.

20

1.2

1.4

1.6

1.8

2.0

S.Dict.

U.Dict.

100M

18

10M

16

1M

14 12

100k    s    e     l    c    y     C

10

    P     C

1.0

(b) Range Scan

(a) Equal Scan

   s    e     l    c    y     C     U

0.8

Skewness of Values

Skewness of Values

8

    U     P     C

6 4 2

10k 1k 100 10

0 0.0

0.5

1.0

1.5

0.0

2.0

S.Dict.

1.0

1.5

2.0

Skewness of Values

Skewness of Values Uncompr.

0.5

Uncompr. S.Dict.

U.Dict.

S.Dict. + Reencode U.Dict.

(d) Insert

(c) Positiona Positionall Lookup

Figure 4.5: CPU cycles for (a) for (a) equal scan, (b) scan, (b) range scan, (c) scan, (c) positional lookup and (d) and (d) insert on one column with value skewness c.k  varied varied from 0 to 2, c.r  = = 20 million, c.d  = = 200, 200,000, = 20 million, c.e  = = 8 and a query selectivity of  q of  q.s .s = = 2,000. c.u  =

occur. In case of a scan operator with equality selection, positional lookup or insert we do not havee this pattern hav pattern of scanning scanning the column and accessing accessing the dictionary dictionary.. Therefore, Therefore, we do not expect the skewness of values in a column to inﬂuence these operators, which is conﬁrmed by the experimental results shown in Figures 4.5(a),4.5(c) Figures 4.5(a),4.5(c) a and nd 4.5(d). 4.5(d). Scan with Scan with Range Range Select Selection ion In contrast, contrast, Figure Figure   4.5(b) shows the inﬂuence of the value skewness ske wness on a scan operator with range selection. selection. In case of a dictionary encoded column column with an unsorted dictionary, we scan the value-ids of the column sequentially and randomly access the value dictionary (value disorder c.u   = 20 million). The skewer skewer the value distribution distribution is, the more likely it gets that a value with a high frequency is accessed and is still in the cache. Therefore, There fore, the number number of cache cache misses misses is reduced for skewed skewed value value distr distributio ibutions, ns, resulting in a faster execution of the scan operator. 34

4.6. CON CONCLUS CLUSIONS IONS

4.6 4. 6

Co Conc nclu lusi sion onss

In this section, we presented a detailed parameter evaluation for the in Chapter Chapter 3 presented algorithms. algorit hms. We presented presented a set of experiment experimental al results, results, giving an overvie overview w of the inﬂuences inﬂuences of our deﬁned parameters on the operators scan with equality selection, scan with range selection, positional lookup and insert. Additionally,, we discussed Additionally discussed and conﬁrmed conﬁrmed the expected inﬂuences inﬂuences based on the complexit complexity y discussion discu ssion of the operators presented presented in Section Section 3.4 3.4 and conﬁrmed conﬁrmed our expectations. expectations. Howeve However, r, we found some cases like the inﬂuence of the number of distinct values or the eﬀect of the value skewness on a scan operator with range selection in case an unsorted dictionary is used, that we did not predict based on the discussion discussion of complexit complexity y. These inﬂuences inﬂuences are based on caching eﬀects, which will be discussed in detail in the following chapter.

35

CHAPTER CHAPTE R 4.

36

PARAMET ARAMETER ER EV EVALUA ALUATION TION

Chapter 5

Estimating Cache Misses In our experimental validation of the parameter inﬂuences on the operators scan with equality selection, scan with range selection, insert and lookup we found signiﬁcant inﬂuences depending on parameters parameters introduced introduced by cache eﬀects. First, First, this chapter describes describes the memory hierarchy hierarchy on modern computer systems and discusses why the usage of caches is so important for the performance performan ce of applications applications.. Second, Second, we introduce a cost model to predict predict the number number of cache cache misses for plan operators and to estimate their execution time.

5.1 5. 1

Ba Bac ckgro kgroun und d on Ca Cac che hess

In early computer systems, the frequency of the Central the Central Processing Unit (CPU) (CPU) was was the same as the frequency of the memory bus and register access was only slightly faster than memory access. However, CPU However, CPU frequencies did heavily increase in the last years following Moores Law [55], 55], but frequencies of memory buses and latencies of memory chips did not grew with the same speed. As a result, memory access gets more expensive, as more CPU more CPU cycles are wasted while stalling for memory access. This development is not due to the fact that fast memory can not be built, it is an economical decision as memory, which is as fast as current CPUs current CPUs,, would be orders of magnitude more expensive expensive and would would require extensive extensive physical physical space on the boards. b oards. In general, memory designers have the choice between Static between Static Random Access Memory (SRAM) and Dynamic Random Access Memory (DRAM). (DRAM). 5.1. 5.1.1 1

Memo Memory ry Cell Cellss

SRAM cells are usually built out of six transistors (although variants with only 4 do exist but SRAM cells have disadvantages disadvantages [53, 18 53, 18]) ]) and can store a stable state as long as the power is supplied. supplied. Their 37

CHAPTER CHAPT ER 5. ESTIMA ESTIMATING TING CA CACHE CHE MISS MISSES ES

Nehalem Quadcore Core 0

Nehalem Quadcore Core 1

Core 2

Core 3

Core 1

TLB

Core 2

Core 3

Core 0 TLB

L1 Cacheline

L1

L1

L2 Cacheline

L2

L2

L3 Cacheline

Memory Page

L3 Cache

QPI

Main Memory

QPI

L3 Cache

Main Memory

Figure 5.1: Memory Hierarchy on Intel Nehalem architecture. stored state is immediately available for reading and no synchronization or wait periods have to be considered. contrast, contrast, DRAM DRAM cells be constructed using muchysimpler structure consisting of onlyInone only transist transistor or and cells a capaci capcan acitor. tor. The state of the amemory memor cell cell is stored in the capacito capacitor r while whi le the transis transistor tor is only used used to guard the access access to the capacit capacitor. or. This This design design is more economical compared to SRAM to SRAM.. However, However, it introduces introduces a couple of compli complication cations. s. First, the capacitor capaci tor discharges discharges over over time while reading reading the state of the memory cell. Therefore, Therefore, today’s systems refresh DRAM refresh DRAM chips chips every 64 ms and after every read of the cell in order to recharge the capacitor [16 capacitor [16]. ]. During the refresh, refresh, no access to the state of the cell is possible possible.. The charging charging and discharging of the capacitor takes time, which means that the current can not be detected immediately after requesting the stored state, therefore limiting the speed of  DRAM of  DRAM cells. cells. In a nutshell, SRAM nutshell, SRAM is is fast but expensive as it requires a lot of space. In contrast, DRAM chips chips are slowe slowerr but cheaper cheaper as they they allow allow larger chips chips due to their their simple simplerr struct structure ure.. For more details regarding the two types of   Random Access Memory (RAM) and their physical realization the interested reader is referred to [18 to [18]. ].

5.1.2 5.1 .2

Memor Memory y Hierar Hierarc chy

An underlying underlying assumption assumption of the memory hierarchy hierarchy of modern computer systems systems is a principle principle known as as data locality   locality   [29]. [29]. Temporal emporal data data localit locality y indica indicates tes that access accessed ed data is lik likely ely to be accessed again soon, whereas spatial data locality indicates that data stored together in memory mem ory is likely likely to be access accessed ed togethe together. r. These These princi principle pless are leverage leveraged d by using caches, caches, combining the best of both worlds by leveraging the fast access to SRAM to SRAM chips chips and the sizes made possible by DRAM by DRAM chips. chips. Figure 5.1 Figure 5.1 shows shows a hierarchy of memory on the example of the Intel Nehalem architecture. Small and fast caches close to the CPUs CPUs built out of  SRAM cells SRAM cells cache accesses to the slower main memory built out of  DRAM of  DRAM cells. cells. Therefore, the hierarchy 38

5.1. BAC BACKG KGRO ROUND UND ON CA CACHE CHES S

64

0 Tag T

Set S

Offset O

Figure 5.2: Parts of a memory address. consists of multiple consists multiple levels levels with increasing increasing storage sizes sizes but decreasing decreasing speed. Each CPU core has its private L1 and L2 cache and one large L3 cache shared by the cores on one socket. Additionally, the cores on one socket have direct access to their local part of main memory through an Integrated an Integrated Memory Controller (IMC). When (IMC). When accessing other parts than their local memory, the access is performed over a Quick a Quick Path Interconnect (QPI) (QPI) controller controller coordinating the access to the remote memory. The ﬁrst level is the actual registers inside the CPU CPU,, used to store inputs and outputs of the processed instructions instructions.. Processors Processors usually only have a small amount amount of integer integer and ﬂoating ﬂoating point registers registers which can be acces accessed sed extremely extremely fast. When working working with parts of the main memory, their content has to be ﬁrst loaded and stored in a register to make it accessible for the CPU the CPU.. However, instead of accessing the main memory directly the content is ﬁrst searched in the Level the Level 1 Cache (L1 Cache). Cache) . If it is not not found found in L1 in L1 Cache it Cache it is requested from Level 2 Cache (L2 Cache). Cache). Some systems even make use of a Level 3 Cache (L3 Cache) Cache)..

5.1.3 5.1 .3

Cache Cache In Inter ternal nalss

Caches are organized in cache lines, which Caches which are the smallest smallest addressable addressable unit in the cache. cache. If the requested content can not be found in any cache, it is loaded from main memory and transferred down the hierarchy. The smallest transferable unit between each level is one cache line . Caches Caches where every every cache line of level level   i is also present in level i + 1 are called called inclusive caches , otherwise they are called exclusive called exclusive caches . All Intel processors implement an inclusive cache model, which is why an inclusive cache model is assumed for the rest of this report. When requesting a cache line from the cache, the process of determining if the requested line line is already already in the cache cache and locating locating where where it is cached cached is crucia crucial. l. Theore Theoretic ticall ally y, it is possible to implement fully associative caches, where each cache line can cache any memory location locat ion.. Howe Howeve ver, r, in practice practice this is only only realiz realizable able for very very small small caches caches as a searc search h ov over er the complete cache is necessary when searching for a cache line. In order to reduce the search space, the concept of a n-way set associative cache  with cache  with associativity Ai divides a cache of size   C i into size into C i /Bi /Ai sets and restricts the number of cache lines which can hold a copy of a certain memory address to one set or A or A i cache lines. Thus, when determining if a cache line is already present in the cache, only one set with A i cache lines has to be searched. For determining if a requested address is already cached, the address is split into three parts as shown by Figure 5.2. The ﬁrst part is the oﬀset O, which size is determined by the cache line size of the cache. So with a cache line size of 64 byte, the lower 6 bits of the address would 39

CHAPTER CHAPT ER 5. ESTIMA ESTIMATING TING CA CACHE CHE MISS MISSES ES

be used as the oﬀset into the cache line. The second part identiﬁes the cache set. The number s of bits used to identify the cache set is determined by the cache size size C i , the cache line size Bi and the associativity associativity Ai of the cache by s =  log2 (C i /Bi /Ai ). The remai remainin ningg 64 − o − s bits of the address address are used as a tag to identify identify the cached copy. copy. Therefore, Therefore, when requesting requesting an address from main memory, the processor can calculate S calculate S  by by masking the address and then search the respective cache set for the tag T T .. This This can be easily easily done by compari comparing ng the tags of the A the A i cache lines in the set in parallel.

5.1.4 5.1 .4

Addre Address ss Transla ranslatio tion n

The operating system provides each process a dedicated continuous address space, containing an address range from 0 to 2 x . This This has several several advan advantage tagess as the process process can addres addresss the memory through virtual addresses and does not have to bother with the physical fragmentation. Additionally, memory protection mechanisms can control the access to memory and restrict programs to access memory which was not allocated by them. Another advantage of virtual memory is the use of a paging mechanism, which allows a process to use more memory than is physically available by paging pages in and out and saving them on secondary storage. The continuous continuous virtual address address space of a process is divided into pages of size size p, which is on most operating system 4 KB. Those virtual pages are mapped to physical memory. The mapping itself is saved in a page table, which resides in main memory itself. itsel f. When the process accesses a virtual memory address address,, the address is translated by the operating system into a physical address with help of the memory management unit inside the processor. The address translation is usually done by a multi-level page table, where the virtual address is split into multiple parts, which are used as an index into the page directories resulting in a physi physical cal address address and a respect respectiv ivee oﬀset. oﬀset. As the page table table is kept in main main memory memory,, each each translation of a virtual address into a physical address would require additional main memory accesses or cache accesses in case the page table is cached. In order to speed up the translation process, the computed values are cached in the Transthe Transaction Look-Aside Buﬀer (TLB), (TLB), which which is a small small and fast cache. cache. Whe When n accessin accessingg a virtua virtuall address, the respective tag for the memory page is calculated by masking the virtual address and the the TLB TLB is searched searched for the tag. In case the tag is found, found, the physic physical al address address can be retrieved retrie ved from the cache. cache. Otherwise, Otherwise, a TLB miss TLB miss occurs and the physical address has to be calculated, calcu lated, which which can be quite costly. costly. Details Details about the address address translation translation process, TLBs and paging structure caches for Intel 64 and IA-32 architectures can be found in [36] [36].. The costs introduced by the address translation scale linearly with the width of the translated address [29 address [29,, 15], 15 ], therefore making it hard or impossible to built large memories with very small latencies. 40

5.2. CACH CACHE E EFFECTS ON APP APPLICA LICATION TION PERF PERFORMAN ORMANCE CE

5.1.5 5.1 .5

Prefet Prefetch ching ing

Modern processors try to guess which data will be accessed soon and initiate loads before the data is access accessed ed in order order to reduce reduce the incurri incurring ng acc access ess latencie latencies. s. Good prefet prefetch ching ing can completely comple tely hide the latencies so that the data is already in the cache when accessed. accessed. Howeve However, r, if data is loaded which is not accessed later it can also evict data inducing additional misses. Processors Processo rs support software software and hardware hardware prefetch prefetching. ing. Software Software prefet prefetchi ching ng can be seen as a hin hintt to the processor, processor, indicating which which addresses addresses are accessed accessed next. Hardware Hardware prefetching prefetching automatically recognizes access patterns by utilizing diﬀerent prefetching strategies. The Intel Nehalem architecture contains two second level cache prefetcher – the L2 streamer and data prefetch logic (DPL) (DPL) [37]. [37]. The prefetching prefetching mechanisms mechanisms only work inside the page boundaries boundaries of 4KB, in order to avoid triggering expensive TLB misses.

5.2

Ca Cac che Eﬀ Eﬀect ectss on Ap Appli plica catio tion n P Perf erform ormanc ance e

The described caching and virtual memory mechanisms are implemented as transparent systemss from tem from the viewpoin viewpointt of an actual applicat application. ion. Howe Howeve ver, r, knowin knowingg the system system and its characteristics can have crucial implications on application performance.

5.2.1 5.2 .1

The Stride Stride Experim Experimen entt

As the name random access memory suggests, the memory can be accessed randomly and one would would expect constan constantt access access costs. costs. In order to test test this this assump assumption tion we ran a simple simple benchmark accessing a constant number (4,096) of addresses with an increasing stride between the accessed addresses. We implemented this benchmark by iterating through an array chasing a pointer. The array is ﬁlled with structs so that following the pointer of the elements creates a circle through the complete array. The structs consist of a pointer and an additional data attribute realizing the padding in memory, resulting in a memory access with the desired stride when following follo wing the pointer pointer chained chained list. In case of a sequenti sequential al array, array, the pointer of element element i i points to element element i + 1 and the pointer of the last element references the ﬁrst element so that the circle circle is closed. closed. In case of a random random array array, the pointer pointer of each each elemen elementt points points to a random random element of the array while ensuring that every element is referenced exactly once. If the assumption holds and random memory access costs are constant, then the size of the padding in the array and the array layout (sequential or random) should make no diﬀerence when iterating iterating over over the array. array. Figure Figure 5.3 shows the result for iterating through a list with 4,096 elements, while following the pointers inside the elements and increasing the padding betwe bet ween en the element elements. s. As we can clearly clearly see, the access access costs are not constant constant and increa increase se with an increasing increasing stride. stride. We also see multiple multiple points of discontin discontinuit uity y in the curves, curves, e.g. the 41

CHAPTER CHAPT ER 5. ESTIMA ESTIMATING TING CA CACHE CHE MISS MISSES ES

[C ac ache Line Liness iize ze]]

[P Paages ges iz ize]

550 500    t   n 450   e   m 400   e    l    E 350   r   e   p   s   e    l   c   y    C    U    P    C

300 250 200 150 100 50 0

0

2

2

2

4

2

6

8

2

2

10

12

2

14

2

16

2

18

2

2

Stride in Bytes Random

Sequential

Figure 5.3: Cycles for cache accesses with increasing stride. [Cache Li Linesi zze e]

[Pagesize]

[Cac e Linesize]

2.5    t    n    e    m    e     E    r    e    p    s    e    s    s     i     M

   t    n    e    m    e

2.0

    E    r    e    p    s    e    s    s     i     M

1.5

2.0 1.5 1.0 0.5 0.0

[Pagesize]

2.5

0

2

2

2

4

2

6

2

Seq. L1 Misses Seq. L2 Misses

8

10

12

2 2 2 Stride in Bytes

14

2

16

2

18

2

Seq. L3 Misses Seq. TLB Misses

1.0 0.5 0.0

0

2

2

2

4

2

6

2

Random, L1 Misses Random, L2 Misses

(a) Sequential Sequential Access

8

10

12

2 2 2 Stride in Bytes

14

2

16

2

18

2

Random, L3 Misses Random, TLB Misses

(b) Random Access

Figure 5.4: Cache misses for cache accesses with increasing stride. random access times increase heavily up to a stride of 64 byte and continue increasing with a smaller small er slope. Figure 5.4 conﬁrms Figure 5.4 conﬁrms the assumption, that an increasing number of cache misses is causing the increase increase in access access times. The ﬁrst point of discontin discontinuit uity y in Figure Figure 5.3 is 5.3 is roughly the size of the cache lines of the test system. The strong increase is due to the fact, that with a stride smaller than 64 byte, multiple list elements are located on one cache line and the overhead of loading one line is amortized over the multiple elements. For strides greater than 64 byte, we would expect a cache miss for every single list element and no further increase in access times. However, as the stride gets larger, the array is placed over multiple pages in memory and more TLB TLB misses occur, as the virtual addresses on the new pages have have to be transla translated ted into phy physic sical al addres addresses ses.. The number number of   of   TLB TLB cache misses increases up to the page size of 4 KB and stays at its worst case of one miss per element. 42

5.3. A CAC CACHE-M HE-MISS ISS BAS BASED ED COS COST T MODEL

[L1]

[L2]

[L1]

[L3]

1000.0    t   n   e   m   e    l    E

   t    n    e    m    e     E    r    e    p    s    e    s    s     i     M

100.0

  r   e   p   s   e    l   c   y    C    U    P    C

10.0

[L2]

0.10 0.01 0.00 0.00 0.00 0.00 0.00 0.00 16KB 64KB 256KB 1MB

1.0 16KB 64KB 256KB 256KB 1MB 4MB 16MB 64MB256MB 64MB256MB Size of Accessed Area in Bytes Random

[L3]

10.00 1.00

Size of Accessed Area in Bytes Random, L3 Misses Random, L1 Misses

Sequential

4MB 16MB 64MB 256MB

Random, L2 Misses Random, TLB Misses

(b) Random Access

(a) Sequentia Sequentiall Access

Figure 5.5: Cycles and cache misses for cache accesses with increasing working sets. With strides greater as the page size, the TLB misses can induce additional cache misses when transla tran slatin tingg the virtual virtual to a ph physi ysical cal address. address. These These cache cache misses misses are due to access accesses es to the paging structures, which reside in main memory [4, memory [4, 3, 3, 64 64]. ]. 5.2.2 5.2 .2

The Size Size Experim Experimen entt

In a second experiment, we access a constant number of addresses in main memory with a constant stride of 64 byte and vary the size of the working set size or accessed area in memory. A run with n with n memory memory accesses and a working set size of  s byte s byte would iterate s·n64 times through the array. Figure 5.5(a) shows that the access costs diﬀer up to a factor of 100 depending on the worki wo rking ng set size. size. The points points of discon discontin tinuit uity y correl correlate ate with the sizes sizes of the caches caches in the system. As long as the working set size is smaller than the size of the L1 Cache, Cache, only the ﬁrst iteration iterati on results in cache misses misses and all other accesses can be answere answered d out of the cache. As the working set size in increases, the accesses in one iteration start to evict the earlier accessed addresses, resulting cache misses in the next iteration. Figure 5.5(b) Figure 5.5(b) shows shows the individual cache cache misses with increasing increasing working set sizes. Up to working sets of 32 KB the misses for the L1 Cache Cache go up to one per element, the L2 cache misses reach their plateau at the L2 cache size of 256 KB and the L3 cache misses at 12 MB.

5.3 5. 3

A C Cac ache he-M -Mis isss Base Based d Cost Cost M Mode odell

In traditional disk based systems, IO operations are counted and used as a measurement for costs. For in-memory database database systems IO operations are not of interest interest and the focus shifts to 43

CHAPTER CHAPT ER 5. ESTIMA ESTIMATING TING CA CACHE CHE MISS MISSES ES

description

unit

cache level

-

sy s ymb ol

i

cache capacity

[byte]

C i

cache bloc ock k size

[byte]

Bi

numbe umberr of cac cache line liness

- #i = = C C i /Bi

Table 5.1: Cache Parameters with i with i ∈ {1,...,N } main memory accesses. accesses. The initial examples examples from Section Section 5.2, 5.2, measuring measuring memory access costs for varying strides and working set sizes, showed that memory access costs are not constant and an essential essential factor of the overall overall costs. In general, assuming assuming all data is in main memory, memory, the total execution time of an algorithm can be separated into computing time and time for accessing data in memory [51] [ 51].. T Total + T T M em CP P U  + T otal = T C

(5.1)

In case the considered algorithms are close to bandwidth bound, T M em is the dominant factor driving the execution time. Additionally, modeling T modeling T C CP P U  requires internal knowledge of the used processor, is very implementation speciﬁc and also dependent on the resulting machine code created created by the compiler compiler which which makes makes it extrem extremely ely hard to model. model. As our consid considere ered d operators on the various data structures only perform a small amount of computations while accessing acces sing large amounts of data, we assume our algorithms to be bandwidth bandwidth bound and believe believe T M em to be a good estimation of overall costs. As discussed in Section 5.2, Section 5.2, the the costs for accessing memory can vary signiﬁcantly due to the underlying memory hierarchy and mechanisms like prefetching and virtual address translation. Therefore, the costs for accessing memory can be quantiﬁed by the number of cache misses on each level in the memory hierarchy, assuming constant costs for each cache miss. In this section, we will provide explicit functions to calculate the estimated number of cache misses for each operator and column organization cache organization.. Some cost functions are based on the work presented by Manegold and Kersten [51 Kersten [51], ], presenting a generic cost model to estimate the execution times of algorithms based on their cache misses by modeling basic access patterns and an algebra to combine combine basic patterns patterns to model more complex ones. How Howeve ever, r, we develop our own cost functions, functions, as they are speciﬁcally designed designed for the operators and column organization organization schemes. We develop parameterized functions estimating the number of cache misses for each operation. With the speciﬁc parameters for each cache level, the cache misses on that level can be predicted. Furthermore, the total costs can be calculated by multiplying the number of cache misses with the latency of the next level in the hierarchy as proposed in [51]. [51]. Measur Measuring ing the 44

5.3. A CAC CACHE-M HE-MISS ISS BAS BASED ED COS COST T MODEL

individual cache level latencies requires accurate calibration and is very system speciﬁc. As a simpler and more robust estimation, we use the number of cache misses as a direct indicator for the resulting resulting number of cycles, cycles, only roughly weightin weightingg the diﬀerent cache cache levels. levels. The cache level in the hierarchy is indicated by i, whereas the the TLB TLB is treated as an additional level in the hierarchy. The cache line size or block size of a respective level is given by B i and the size by by C C i . The number of cache lines at the level i is denoted by #i # i. Table 5 Table 5.1 .1 gives gives an overview of the required parameters of the memory hierarchy. The function function M i (o, c) describes the estimated amount of cache misses for an operation operation o on a column c. The operati operations ons are escan are escan , rscan , lookup and insert and insert . The respective respective physical physical column organization is given by a subscript indicating A) an uncompressed column, B B )) a dictionary encoded column with an unsorted dictionary and C C )) a dictionary encoded column with a sorted dictionary.

5.3.1 5.3 .1

Scan Scan with with Equal Equalit ity y Select Selection ion

We start by developing functions estimating the number of cache misses for a scan operator with equality selection on an uncompressed column. The scan consists of sequentially iterating over the column, resulting in one random miss and as many sequential misses as the column covers cache lines. c.r  · · c.e Mi (escanA , c) = Bi





In case the column is dictionary encoded with a sorted dictionary, the binary search for the searched value results in log2 random cache misses, while the sequential scan over the compressed value-ids results in as many cache misses as the compressed column covers cache lines. c.e c.r  · · c.e c + log2 (c.d ) · Mi (escanB , c) = Bi Bi

 





The number of cache misses for an equal scan on a column with an unsorted dictionary is similar as with a sorted dictionary, but instead of a binary search a linear search is performed, scanning in average half the dictionary. Mi (escanC , c) =



c.r  · · c.e c

Bi

  +



c.d  · · c.e

2 · Bi

Figure 5.6(a) shows Figure 5.6(a) shows the number of L1 cache misses of an experiment performing a scan with equality selection on a column while varying the number of rows from 2 million to 20 million. The experimental results are compared to the predictions calculated based on the presented cost functions. The predicted cache misses follow closely the measured number of misses. 45

CHAPTER CHAPT ER 5. ESTIMA ESTIMATING TING CA CACHE CHE MISS MISSES ES

5.3.2 5.3 .2

Scan Scan with with Rang Range e Selec Selectio tion n

In case of an uncompressed column, a scan with range selection iterates sequentially over the uncompress uncom pressed ed values, values, comparing the values with the requested requested range. Similarly Similarly,, in case of a sorted dictionary, the searched values are retrieved from the dictionary with a binary search and the value-ids value-ids are scanned sequentially sequentially for the searched searched value-ids value-ids.. In both cases, cases, the resulting resulting cache misses are the same as for a scan with equality selection. In case of an unsorted dictionary, the scan operation sequentially iterates over the column and has to perform a random access into the dictionary due to the range selection. Regarding the random access into the dictionary, we assume that every value in the dictionary is accessed at least once. In the best case, the access access to the dictionary dictionary is sequentia sequentially lly,, utilizing all values values in a cache-li cache-line. ne. In the worst case, every access access to the dictionary dictionary may result in a cac cache he miss. The number of cache misses increases with increasing dictionary sizes respective to the cache size size and the amoun amountt of disorder disorder in the column. column. Theref Therefore, ore, we model the numbe numberr of random misses by interpolating between 0 and the number of rows in the column. In order to smoothly interpolate between two values, we deﬁne the following helper functions. I l is a simple we linear interpolation function between y0 and y1 , whereas t varies from 0 to 1. Furthermore, deﬁne I d as a decelerating interpolation function. I l (y0 , y1 , t) = y0 + t + t · ( (yy1 − y0 ) I d (y0 , y1 , t) = I l (y0 , y1 , 1 − (1 − t t))2 ) Based on I l and I d , we construct I c as a cosinus-based interpolation function to smoothly interpolate between two values, as we found this interpolation type to ﬁt well to the cache characteristics. 1 − cos( cos(π π · I d 0, 1, t) I c (y0 , y1 , t) = I l y0 , y1 , 2





Finally, we introduce introduce I I    as a helper function modeling a function stepping smoothly from y0 to to   y1 around a location of   x0 , whereas ρ whereas ρ indicates the range in which the interpolation and τ and τ the degree of how asymmetric asymmetric the interpolation interpolation is performed. performed. These values values might might be b e system speciﬁc and can be calibrated as needed.

I (x, x0 , y0 , y1 , ρ , τ ) =

y0 : x : x < 2 x −ρ y1 : x : x ≥ 2 x +ρ∗τ x +ρ I c (y0 , y1 , logρ(∗x(τ )−+1) +1) ) : else

 

0

0

2

0

If the number of covered cache-lines C cache-lines C i is smaller than the number of available cache-lines #i , every every cache-l cache-line ine is loaded loaded at its ﬁrst access access and remains remains in the cache. cache. For sub subseq sequen uentt accesses, this cache-line is already in the cache and the access does not create an additional cache miss. 46

5.3. A CAC CACHE-M HE-MISS ISS BAS BASED ED COS COST T MODEL

If   C i > #i , then already loaded cache-lines may be evicted from cache by loading other cache-lin cach e-lines. es. Subsequen Subsequentt accesses accesses then have have to load the same cache-line cache-line again, producing producing more cache misses. The worst case is that every access to a cache-line has to load the line again, because it was already evicted, resulting in c.r  cache cache misses. misses. Assuming Assuming randomly distributed distributed values in a column, how often cache-lines are evicted depends on the ratio of the number of cache-lines #i and the number of covered cache-lines C i . With increasing C increasing C i the probability that cache-lines are evicted evicted before before they they are accessed accessed again increase increases. s. This This result resultss in the number number of random random cache misses of: Mri (rscanC , c) = I I ((c.d , log2 (C i ), 0, c.r , ρ, τ τ )) The number of sequential cache misses depends on the success of the prefetcher. In case no or only a few random cache misses occur, the prefetcher has not enough time to load the requested cachee lines, cach lines, resulting resulting in sequenti sequential al misses. misses. With increasing increasing numbers of random cache misses, misses, the time window for prefetching increases, resulting in less sequential cache misses. Assuming a page size of 4KB, we found M found M si to be a good estimation, as a micro benchmark turned out that every three random cache misses when accessing the dictionary leave the prefetcher enough time to load subsequent cache lines. We calculate the number of sequential cachee misses cach misses as follows: follows:

Msi (rscanC , c) = max



M ri C(i, c) , C(i, c) − 4096 3





Additionally, we also have to consider extra penalties payed for for TLB TLB misse misses. s. In case case an address translation misses the the TLB and TLB and the requested page table entry is not present in the respect res pectiv ivee cache cache level, level, ano anothe therr cache cache miss miss occurs. occurs. In the worst worst case, case, this this can introduce introduce an additional cache miss for every additional every dictionary lookup. Therefore, Therefore, we calculate the number of TLB misses by: i Mtlb I ((c.d , log2 (C ttlb lb · 4 ), 0, c.r , ρ , τ ) i (rscanC , c) = I

Finally, the total number of cache Finally cache misses for a scan operation with range selection on a column with wit h an uncomp uncompres ressed sed diction dictionary ary is given given by adding adding random random,, sequen sequentia tiall and TLB misses misses.. Figure 5.6(b) Figure 5.6(b)   shows a comparison of the measured eﬀect of an increasing number of distinct values on a range scan on an uncompressed column with the predictions based on the provided costt functi cos functions ons.. The ﬁgure shows shows the numbe numberr of cache cache misse missess for each level level and the model model correctly predicts the jumps in the number of cache misses. 47

CHAPTER CHAPT ER 5. ESTIMA ESTIMATING TING CA CACHE CHE MISS MISSES ES

5. 5.3. 3.3 3

Look Lookup up

A lookup on an uncompressed column results in as many cache misses as one value covers cache lines on the respective cache level. Mi (lookupA , c) =

  c.e

Bi

In case the column is dictionary encoded it makes no diﬀerence if the lookup is performed on a column with a sorted or an unsorted dictionary, hence we provide one function Mi (lookupB/C ) for both cases. Mi (lookupB/C , c) =

    c.e c

Bi

+

c.e

Bi

Figure 5.6(c) shows Figure 5.6(c) shows a comparison of the predicted number of L1 cache misses and the experimental results for a lookup operation while varying the number of rows and that the prediction. The predicted number of cache misses closely matches the experimental results. 5. 5.3. 3.4 4

Inse Insert rt

The insert insert operatio operation n is the only operatio operation n we consider consider writing writing to mai main n memory memory.. Althou Although gh it is not quite accurate, we will treat write access similar as reading from main memory and consider the resulting cache misses. Mi (insertA , c) =

  c.e

Bi

In case we perform an insert into a column with a sorted dictionary, we ﬁrst perform a binary search determining if the value is already in the dictionary, before writing the value and value-id, assuming the value was not already in the dictionary. c.e c

c.e i

M (insertB , c) =

Bi

Bi

+

   

c.e

c.d

+ log2 (

) ·

Bi

 

The number number of cache misses in the unsorted dictionary dictionary case are similar similar to the sorted dictionary case, although the cache misses for the search depend linearly on the number of distinct values. Msi (insertC , c) =

       c.e

Bi

+

c.e c

Bi

+

c.d

2

·

c.e

Bi

Figure 5.6(d) shows Figure 5.6(d) shows the predicted number of cache misses for an insert operation and that the prediction prediction closely matches matches the experimental experimental results.

48

5.3. A CAC CACHE-M HE-MISS ISS BAS BASED ED COS COST T MODEL

[L [L1: 1: 32 32k]

3.0M 2.5M    s    e2.0M    s    s     i     M1.5M    e     h    c 1.0M    a     C

500k 0.0 2M

[L2: 256k 256k]

[L3: 12mb]

16M 14M 12M    s    e10M    s    s     i     M 8M    e     h 6M    c    a     C 4M 2M 0

4M

6M

8M

16KB

10M 12M 14M 16M 18M 20M

64KB

256KB

Number of Rows

1MB

L1 Misses L2 Misses L3 Misses TLB Misses

L1 Misses Uncompressed L1 Misses Dictionary Pred. L1 Misses Uncompressed Pred. L1 Misses Dictionary

4MB

16MB

64MB

Dictionary Size Pred. L1 Misses Pred. L2 Misses Pred. L3 Misses Pred. TLB Misses

(b) Range Scan

(a) Equal scan 2.5

100.0 10.0k

2.0

   s    e    s    s 1.5     i     M    e     h 1.0    c    a     C     1 0.5     L

   s    e    s    s     i     M    e     h    c    a     C     1     L

1.0k 100.0 10.0 1.0

0.0 2M

4M

6M

8M

10M 12M 14M 16M 18M 20M

100.0m 2M

Number of Rows

4M

6M

8M 10M 12M 14M 16M 18M 20M Number of Rows

Uncompr. Dictionary Pred. L1 Misses Uncompressed Pred. L1 Misses Dictionary

Uncompr. S.Dict. U.Dict.

Pred. Uncompressed Pred. Sorted Pred. Unsorted

(d) Insert

(c) Lookup

Figure 5.6: Evaluatio Evaluation n of predicted predicted cache misses. misses. Predicted Predicted cache misses misses for for (a) equal scan, (b) range (b) range scan, scan, (c) (c) lookup and (d) (d) iinsert. nsert. For (a), or (a), (c) (c) and and (d) (d),, the number of rows c.r   was varied from 2 million to 20 million, c.d   = 200, 200,000, c.u   = 0, c.e   = 8, c.k  = = 0 and a query selectivity of   q.s = q.s = 2,000. For (b) or (b),, the number of distinct values was varied from 1024 to 2 23 , = 223 , c.u   = = 223 , c.e   = = 8, c.k   = = 0. c.r   =

49

CHAPTER CHAPT ER 5. ESTIMA ESTIMATING TING CA CACHE CHE MISS MISSES ES

50

Chapter 6

Index Structures This chapter discusses the inﬂuence of index structures on top of the evaluated data structures and their inﬂuence on the discussed discussed plan operators. First, First, we extend the unsorted dictionary dictionary case by adding a tree structure on top, keeping a sorted order and allowing binary searches. Second, we discuss the inﬂuence of inverted indices on columns and extend our model to reﬂect these changes. Indices have been studied by many researchers and database literature uses the term index in many ways. Some disk based database systems deﬁne an index as an alternative sort order of a table table on one or more attribu attributes tes.. Thus, Thus, by leveragi leveraging ng the index index structur structure, e, a query query can search a table in logarithmic complexity with binary search. In disk based systems, it is often assumed that the index structure is in memory and accessing it is cheap, as accessing the relation stored on secondary storage is the main cost factor. In the ﬁeld of text search and search engines, an inverted index maps words to documents, so for every word a list of matching document-ids is maintained. The index is called inverted called inverted , as a document traditionally is a list of words and the index enables a mapping from words to documents. documen ts. Recent Recent literature literature mentions mentions inverted inverted indices indices   [59] [59] for main memory column stores. However, we think the term inverted term inverted index  is index  is misleading at that point as a classical index also provides the mapping from values to record-ids. In this report, we assume an index to be a data structure that allows us to eﬃciently ﬁnd tuples tuples satisf satisfyin yingg a given given searc search h condit condition ion.. As tuples tuples are stored stored in insert insertion ion order, we assume an index to be a separate auxiliary data structure on top of a column, not aﬀecting the placement placem ent of values inside the column. column. Furthermore, urthermore, we distinguish distinguish between between column indices indices and and dict diction ionar ary y in indi dice ces. s. A co colu lumn mn index index is built built on top top of the the value aluess of one one colu column mn,, e.g. e.g. by creating a tree structure to enable binary search on the physically unsorted values in the column col umn.. In contras contrast, t, a dictio dictionary nary index index is a B+ -Tree built only on the distinct values of a column, enabling binary searching an unsorted dictionary in order to ﬁnd the position of a given value in the dictionary. 51

CHAPTE CHA PTER R 6.

Column (Compressed) Alpha

000

Echo

001

Bravo

010

Charlie Bravo

011 010

Hotel

100

Golf

101

Delta

110

INDE INDEX X STRUCT STRUCTURE URES S

Dictionary Index Charlie Charlie Echo Echo

Dictionary (unsorted) 0

Alpha

1

Echo

2

Bravo

3

Charlie

4

Hotel

5

Golf

6

Delta

Alp Alpha ha Bravo Bravo 0 2

Ch Char arlie lie De Delt ltaa 3 6

Echo 1

Golf 5

Hotel 4

Figure 6.1: Example dictionary index. Column and dictionary indices are assumed to be implemented as B + -T -Tree ree structures. structures. A + B -Tree is a tree structure supporting insertions, deletions and searches in logarithmic time and are optimized for systems reading blocks of data as it is the case for disk based systems but also when when access accessing ing main memory memory and reading reading cache cache lin lines. es. In contras contrastt to B-Tree B-Trees, s, the + internal nodes of B -Trees store copies of the keys and the information is stored exclusively in the leaves, leaves, including including a pointer pointer to the next leaf node for sequential sequential access. access. We denote the fan out of a tree index structure with I with I f keys with I with I n . f   and the number of nodes needed to store c.d  keys The fan out constrains the number n number n of of child nodes of all internal nodes to I f f /2 ≤ n ≤ I f f . I Bi denotes the numbers of cache lines covered per node at cache level i level i.. The number of matching keys for a scan with a range selection is denoted by q.n q .nk and q and q.n .nv denotes the average number of occurrences of a key in the column.

6.1 6. 1

Di Dict ctio iona nary ry In Inde dex x

A dictionary index is deﬁned as a B + -Tree structure [13] [13] on top of an unsorted dictionary. Figure 6.1 shows Figure 6.1 shows an example of an uncompressed column, which is encoded with an unsorted dictio dic tionary nary.. The compresse compressed d column column and the unsorte unsorted d dictio dictionary nary are stored stored as descri described bed in + Section 3.2. 3.2. Addi Additi tion onal ally ly,, a B -Tree -Tree with with a branc branchin hingg factor factor I f maintain tained ed as a f = 2 is main dictionary index. We now discuss the inﬂuence of a dictionary index on our examined operations and compare a column using an unsorted dictionary with and without a dictionary index. Scan with Equality Selection The algorithm on a column with a dictionary index is similar to Algorithm 3.4.3 Algorithm 3.4.3,, except that for retrieving the value-id we leverage the dictionary index and perform a binary search. The number of cache misses regarding the sequential scanning of the value-ids of the column stays the same as in Equation 5.3.1 and 5.3.1 and the costs for the binary search 52

6.1. DIC DICTION TIONAR ARY Y IND INDEX EX

140M 120M    s    e     l    c    y     C     U     P     C

100M 80M 60M 40M 20M 10 2

12

2

14

16

18

20

2 2 2 2 Number of Distinct Values

22

2 26 2 24 2 222 20 2 218    s    e 216    c    y 14 2     C     U 12     P 2     C 10 2 8 2

0

2

2

5

2

10

15

2 2 Number of Distinct Values

20

2

U.Dict. with Dictionary Index

U.Dict. with Dictionary Index

(a) Scan with equality selection

(b) Insert

Figure 6.2: Inﬂuence Inﬂuence of a dictionary dictionary index on CPU cycles for (a) for (a) scan scan with an equality selection and (b) inserting new values into a column with an unsorted dictionary while varying the number of distinct values. c.r  = = 10M, c.e   = = 8, c.u   = = 0, c.k   = = 0.5, q 5, q.s .s = = 1000. logarithmically depend on the number of distinct values of the column. Figure 6.2(a) shows Figure 6.2(a) shows how the costs for a scan with equality selection develop with an increasing number of distinct values for a column using an unsorted dictionary without a dictionary index ind ex compared compared to a column column with a diction dictionary ary index. index. We notice notice simi similar lar costs for the scan operation on columns with few distinct distinct values. values. Howeve However, r, as the dictionary dictionary grows, the costs for linearly scanning the dictionary increase linearly in case of not using a dictionary index and the costs with an index only increase slightly due to the logarithmic cost for the binary search, resulting in better performance when using a dictionary index. Scan with Range Selection Although the dictionary index index maintains a sort order over over the dictionary, the value-ids of two values still allow no conclusions about which value is larger or smallerr as in the case of a sorted dictionary. smalle dictionary. There Therefore, fore, a scan with a range sele selection ction still needs to lookup and compare the actual values as described in Algorithm 3.4.6. In Inse sert rt a Recor Record d One main cost factor for inserti inserting ng new value valuess into into a column column with an unsorted dictionary is the linear search determining if the value is already in the dictionary. This can be accelerated through the dictionary index, although it comes with the costs of maintaining the tree structure as outlined in Algorithm 6.1.1 Algorithm 6.1.1.. Assuming the new value is not already in the dictionary, the costs for inserting it are writing the new value in the dictionary, writing the compressed value-id, performing the binary search in the index and adding the new value to the index. Lookup a Record Looking Looking up a record record is not aﬀected aﬀected by a dictio dictionar nary y index as the index can not be levera leveraged ged performing performing the lookup lookup and does not have have to be maint maintaine ained. d. Theref Therefore ore 53

CHAPTE CHA PTER R 6.

INDE INDEX X STRUCT STRUCTURE URES S

Algorithm 6.1.1: InsertDictIndex(column, dictionary dictionary)) valueId ← dictionary.index.binarySearch( dictionary.index.binarySearch(value value)) if  valueId = valueId = NotInDictionary NotInDictionary valueId ← dictionary.insert dictionary.insert((value value)) then dictionary.index.insert((value dictionary.index.insert value,, valueId valueId)) column.append((valueId column.append valueId))



the Algorithm as outlined in 3.4.8 in 3.4.8 is is also applicable when using a dictionary index.

Conclusions Considering the discussed operations, a dictionary encoded column always proﬁts by using a dictionary dicti onary index. Therefore, Therefore, we do not provide adapted adapted cost functions functions for a dictionary index as we do not have to calculate in which cases it is advisable to use. Even insert operations do proﬁt from the index as the dictionary dictionary can be searched searched with logarithmic costs which outweighs outweighs the additional costs of index maintenance, as shown by Figure 6.2(b). The costs for index maintenance overshadow the saved costs only in cases with very few distinct values, where scanning scann ing the dictionary dictionary is similarly fast as a binary search. How Howeve ever, r, the additional additional costs in these cases are very small and a dictionary index for unsorted dictionaries is still advisable.

6.2 6. 2

Col olum umn n Ind ndex ex

A column index can be any auxiliary data structure accelerating the search of tuples given a value or range of values for an attribute. The most popular methods are hash-based indexing and tree-based indexing as described in [61] [ 61].. Hash-based indexing groups all values in a column into buckets into buckets , based on a hash function determining which value value belongs into which bucket. A bucket consists of a linked chain of v values alues and can hold a variable number of values. When searching for a speciﬁc value, calculating the value of the hash function and accessing the resulting bucket can be achieved in constant time, assumi ass uming ng a good distri distribut bution ion ov over er all buckets buckets.. Howe Howeve ver, r, support support for select selection ionss with with range range queries is not given using hash-based indexing. Tree-based indexing allows for fast selections using equality predicates and also supports eﬃcient eﬃcie nt range queries. queries. Therefore, Therefore, we assume a column index to be b e a B+ -Tree structure, similar to the dictionary index index described above. above. Howeve However, r, the index is built on top of the complete complete column col umn and not only on the distin distinct ct values values.. Theref Therefore ore,, the index index does not only store one 54

6.2.. 6.2

Column (Uncompressed)

COLUM COLUMN N IN INDE DEX X

Dictionary Index

Alpha

Charli Cha rliee

Echo

Ec Echo ho

Bravo Charlie Bravo

Alpha

Bravo

Charlie

Delta

Echo

Golf

Hotel

Hotel Golf Delta

(0 (0))

(2, 2,4) 4)

(3)

(7)

(1)

(6)

(5)

Figure 6.3: Example column index. position, positio n, but has to store store a list list of positions positions for every every value value.. A column column index index can be add added ed to an any y column column,, regardl regardless ess of the physica physicall organiz organizati ation on of the column. column. Figure Figure   6.3 shows an example of a column index on an uncompressed column. Lookup Performing Performing a positional positional lookup on a column can not proﬁt from a column index and also does not require require any maintenance maintenance operations for the index. Therefore Therefore the Algorithm as outlined outlin ed in in 3.4.8 3.4.8 is is still applicable. Search with Equality Selection A search with with equality selection can be answered answered entirely by using using the column column index. index. Theref Therefore, ore, the costs do not depend depend on the physi physical cal layout layout of the column and the same algorithm can be used for all column organizations, as outlined by Algorithm 6.2.1. 6.2.1. Firs First, t, th thee inde index x is searc searche hed d for for value alue   X X    by binary searching the tree struct str ucture ure resulti resulting ng in a list list of position positions. s. If the value alue is not found, found, an empty empty list is return returned. ed. The resulting list of positions then has to be converted into the output format by adding all positio posi tions ns to the result result array array. Locatin Locatingg the leaf node for the searched searched key requires requires reading reading logI f f (I n ) · I Bi cache lines for reading every node from the root node to the searched leaf node, assumi ass uming ng each each access accessed ed node lies on a separat separatee cache cache lin line. e. Then, Then, ite iterati rating ng thr through ough the lis listt of positions and adding every position to the result array requires to read and write q.n v /Bi cachelines, assuming the positions are placed sequentially in memory. Mi (escanI ) = logI f f (I n ) · I Bi + 2 · q.n k ·

  q.nv Bi

(6.1)

Search Searc h with with Range Range Select Selection ion Simila Similarly rly to the search search with with equali equality ty selection selection,, a searc search h with range selection can be answered entirely by using the column index, as outlined by Algorithm 6.2.2 gorithm 6.2.2.. Assuming the range selection matches any values, we locate the node with the ﬁrst matching matching value by performing a binary search on the column index. The number number of cache cache misses for the binary search are log I f f (I n ) · I Bi . Then, we sequentially retrieve the next nodes by following the next pointer of each node until we ﬁnd a node with a key greater or equal to high to high.. Assuming completely ﬁlled nodes, this requires reading all nodes containing the q.n q .nk 55

CHAPTE CHA PTER R 6.

INDE INDEX X STRUCT STRUCTURE URES S

Algorithm 6.2.1: SearchEqualIndex(X, columnInde columnIndex x) result ← array array[]; []; node ← columnIndex.binarySearch columnIndex.binarySearch((X ) if  node not node not NotFound for each each pos ∈ node.positionList then do result.append result.append(( pos pos)) return((result return result))

 

Algorithm 6.2.2: SearchRangeIndex(X, Y, co colum lumnInd nIndex ex)) result ← array array[] [] node ← columnIndex.binarySearch columnIndex.binarySearch((X ) if  node not node not NotFound while node.key while node.key < Y for each each pos ∈ node.positionList then do do result.append result.append(( pos pos)) node ← node.next return((result return result))

     

matching keys, resulting in in q.nk /I f f  nodes. For all matching nodes the positions are added to the result array, requiring to read and write q.nv /Bi cache lines per key.

 

q.n q .nk q.nv Mi (rscanI ) = logI f f (I n ) · I Bi + I + I Bi · + 2q.n 2q.nk · I f Bi f

(6.2)

Insert Inserting Inserting new values values into the physical physical organization organization of a column is not aﬀecte aﬀected d by a column col umn index. index. Howe Howeve ver, r, the new value value has also to be ins insert erted ed into the column column ind index. ex. The costs incurring for the index maintenance are independent from the physical organization of the column. This requires searching the tree structure for the inserted value, reading log I ff   (I n ) · I Bi cache cac he lines. lines. If the value already already exists, exists, the newly newly inserted inserted position position is added to the list of positions of the respective node, otherwise the value is inserted and the tree has to be potentially rebalanced. The costs for rebalancing are in average logI ff   (I n ) · I Bi . Mi (insertI ) = 2 · logI f f (I n ) · I Bi

(6.3)

Evaluation Figure 6.4(a) shows Figure 6.4(a) shows a comparison for a range scan on a column index compared to a column with a sorted dictionary and without an index. The ﬁgure shows the resulting CPU cycles for 56

6.2.. 6.2

350M

COLUM COLUMN N IN INDE DEX X

6M

300M

5M

250M

   s 4M    e    s    s     i

   s    e     l    c    y     C     U     P     C

200M 150M

    M3M    e     h2M    c    a     C

1M

100M

0

50M

1M

2M

3M

4M

5M

0 1M

2M

3M

4M

5M

6M

7M

8M

9M

10M

7M

8M

9M

10M

with index

Pred. L3 Pred. TLB

TLB Pred. L1 Pred. L2

L1 L2 L3

Selectivity no index

6M

Selectivity

(a) Range scan performance performance with and without column (b) Cache misses for range scan with column index index 12

9

2

2

   t   r   e   s   n    I   e 11   n 2   o   r   o    f   s   e    l   c 10   y 2    C    U    P    C

8

2

7

2

6

   s 2    e    s    s 5     i     M2    e 4    c2    a     C 3 2

9

2

2

6

2

8

2

10

2

12

2

14

2

16

2

18

2

20

2

2

22

6

2

2

8

2

10

2

12

2

14

2

16

2

18

2

20

2

22

2

Number of Distinct Values

Number of Distinct Values

L1 TLB

U.Dict, Dict.Index no Col.Index U.Dict, Dict.Index and Col.Index

Pred. L1 Pred. TLB

(c) Insert performa performance nce with and withou withoutt column index (d) L1 and TLB misses for insert with column index

Figure 6.4: CPU cycles Figure cycles for for (a) (a) scan with a range selection and and (c) inserting (c) inserting new values into a column with and without a column index. (b) and (b) and (d) show the respective cache misses for the case of using a column index. c.r   = = 10 10M M ,, c.d  =1M, =1M, c.e   = = 8, c.u   = = 0, c.k   = =0

the scan operation with increasing increasing result sizes. For small results the index performs better, but around a selectivity of roughly 4 million the complete scan performs better due to its sequential access pattern. Figure 6.4(b) Figure 6.4(b) shows shows the resulting cache misses for the scan operation using the column index and the predictions based on the deﬁned model. Figure 6.4(c) shows the costs for inserting a new value into a column using dictionary encoding with an unsorted dictionary plus dictionary index and no column index compared to a column with a column index. As expected, the cost for inserting in a column with a column index are approximate approximately ly twice as high. Figure 6.4(d) Figure 6.4(d) shows shows the resulting L1 and TLB cache misses for inserting into a column with a column index and the respective predictions.

57

CHAPTE CHA PTER R 6.

58

INDE INDEX X STRUCT STRUCTURE URES S

Chapter 7

Partitioned Columns

This chapter introduces a partitioned column organization, consisting of a main and a delta partition partiti on and outlines outlines the merge process. We present an eﬃcient eﬃcient merge algorithm, enabling enabling dictionary encoded in-memory column stores to support the update performance required to run enterprise enterprise application application workloads workloads on read-optimiz read-optimized ed databases. The chapter chapter is based on the work presented in [46] [46].. Traditional read-optimized databases often use a compressed column oriented approach to store store da data ta [66, 66, 50, 50, 75 75]. ]. Performing Performing single single inserts in such a compressed compressed persistenc persistencee can be as complex as inserting into a sorted list list [28]. [28]. One approach approach to handle updates in a compre compressed ssed storage is a technique called diﬀerential updates, maintaining a write-optimized delta partition that accumula accumulates tes all data data chang changes. es. Pe Period riodica ically lly,, this this delta delta partiti partition on is combin combined ed wit with h the read-optimiz read-o ptimized ed main partition. We refer to this process as merge merge  throughout throughout the report, also referred to as checkpointing by others [28] [28].. This process involves uncompressing the compressed main partition, merging the delta and main partitions and recompressing the resulting main partition. partiti on. In contrast to existing existing approaches, approaches, the complete complete process is required required to be executed executed during regular system system load without without downtime. downtime. The update performance of such a system is limited by two factors – (a) the insert rate for the write-optimized structure and (b) the speed with which the system can merge the accumulated updates back into the read-optimized partition. Inserting into the write-optimized stru struct ctur uree ca can n be perfo performe rmed d fa fast st if th thee size size of th thee st stru ruct ctur uree is kept ept smal smalll en enou ough. gh. As an additional beneﬁt, this also ensures that the read performance does not degrade signiﬁcantly. However, keeping this size small implies merging frequently, which increases the overhead of updates. 59

CHAPTER CHAPT ER 7. PAR ARTITIONE TITIONED D COLUMNS COLUMNS

7.1 7. 1

Me Merg rge e Pr Proc oces esss

For supporting eﬃcient updates, we want the supported update rate to be as high as possible. Furthermore, update rate shouldscenario. also be greater than the minimum sustained rate required by thethe speciﬁc application When the number of updates againstupdate the system durably exceed the supported update rate, the system will be rendered incapable of processing new inserts and other queries, leading to failure of the database system. An important performance parameter for a partitioned column is the frequency at which the merging merging of the partitio partitions ns must must be execut executed. ed. The frequenc frequency y of execut executing ing the mergin mergingg of partitions partit ions aﬀects the size (number (number of tuples) tuples) of the delta partition. partition. Computing Computing the appropriate size of the delta partition before executing the merge operation is dictated by the following two conﬂicting choices: 1. Small Small delta delta partit partition ion   A small delta partition implies a relatively low overhead to the read query, query, implying implying a small reduction reduction in the read perform p erformance. ance. Furthermore, urthermore, the insertion into the delta partition will also be fast. This means however that the merging step needs to be executed thereby increasing impactison the system. 2. Large delta partition partition A Amore largefrequently, delta partition implies that thethe merging executed less frequentl freque ntly y and therefore adds only a little overhead overhead to the system. Howeve However, r, increasing increasing the delta partition size implies a slower read performance due to the fact that the delta partition stores uncompressed values, which consume more compute resources and memory bandwidth, thereby appreciably slowing down read queries (scan, index lookup, etc.) Also, while comparing values in the main partition with those in delta partition, we need to look up the dictionary dictionary for the main partition partition to obtain the uncompressed uncompressed v value alue (forced materializati materi alization), on), thereby thereby adding overhead overhead to the read performance. performance. In our system, we trigger the merging of partitions when the number of tuples ND in the delta partition is greater than a certain pre-deﬁned fraction of tuples in the main partition NM . Figure   7.1 shows Figure shows an exampl examplee of a column column with its main and delta partitio partitions. ns. Note Note that the other other column columnss of the table would would be stored stored in a simila similarr fashion fashion.. The main partition partition has a dictionary consisting consisting of its sorted unique values values (6 in total). Hence, Hence, the encoded values values are stored using 3 =  log2 6 bits. The uncompressed values (in gray) are not actually stored, but shown sho wn for illustration illustration purpose. The compressed compressed value value for a given given value is its positio p osition n in the dictionary which is also called value-id, stored using the appropriate number of bits (in this case 3 bits). The delta partition stores the uncompressed values themselves. In this example, there are ﬁve tuples with the shown uncompressed values. In addition, a Cache a Cache Sensitive B+ -Tree (CSBTree) [ [63] Tree) 63] containing containing all the unique uncompressed values is maintained. Each value in the tree alsoo stores als stores a pointer pointer to the list of tuple tuple ids where where the value alue was was insert inserted. ed. For exa exampl mple, e, the 60

7.2. MER MERGIN GING G ALG ALGORIT ORITHM HM

Main(Compressed) Main Dictionary J

M

N M M

100 010 011 010

Delta (Uncompressed)

J

J

U M

hotel delta frank delta

D

apple charlie delta

000 001 010

Merge of two

frank

011

partitions

inbox

101

hotel

100

0 1 2

bravo charlie golf

3 4

charlie young

golf

Partition after merge (Compressed)

  m   o   r    f   n   n   i   o   a    i    t    M    i    t   r   a    P   a   n   t    l   o   e    i    t    i    D    t   r   a   m    P  o   r    f

0110 0011 0100 0011

0001 0010

hotel delta frank delta

bravo charlie

Merged Dictionary apple bravo charlie delta frank golf hotel inbox young

charlie

0000 0001 0010 0011 0100 0101 0110 0111 1000

bravo charl charlie ie

(0) (1,3)

CSB+ tree young

N D D

golf go

(2)

young

(4)

(#): Index to Delta Partition

Figure 7.1: Exampl Figure Examplee showin showingg the data structur structures es maintain maintained ed for each each column column.. The main partiti part ition on is stored stored compresse compressed d togethe togetherr with with the dictionar dictionary y. The delta partition partition is stored stored uncompresse uncomp ressed, d, along with the CSB+ tree. After the merging of the partitions, we obtain the concatenated concat enated compressed compressed main column column and the updated dictionary. dictionary. value “charlie” “charlie” is inserted at positions positions 1 and 3. Upon insertion, the value is appended to the delta partition and the CSB+ tree is updated accordingly. After the merging of the partitions has been performed, the main and delta partitions are concatenated concat enated and a new dictionary dictionary for the concatenated concatenated partition partition is created. created. Further urthermore, more, the compressed values for each tuple are also updated. For example, the encoded value for“hotel” was 4 before merging and is 6 after merging. Furthermore, urthermore, it is possible possible that the number number of bits that are required to store the compressed value will increase after merging. Since the number of unique values in this example is increased to 9 after merging, each compressed value is now stored using  log9 = 4 bits.

7.2 7. 2

Me Merg rgin ing g Al Algo gori rith thm m

We now describe the merge algorithm in detail and enhance the na¨ na¨ıve merge implementation by applying applying optimization optimizationss known from join processing. processing. Furthermore, urthermore, we will parallelize parallelize our implementa implem entation tion and make make it architect architecture-a ure-aware ware to achiev achievee the best b est possib p ossible le throughput. throughput. For the remainder of the report, we refer to the symbols explained in Table 7.1. 7.1. We use a compression scheme wherein the unique values for each column are stored in a 61

CHAPTER CHAPT ER 7. PAR ARTITIONE TITIONED D COLUMNS COLUMNS

Description Number of columns in the table Number of tuples in the main table Number of tuples in the delta table

Unit Symb ol - NC - NM - ND

Number of tuples in the up dated table For a given column j column j ; j ∈ [1 . . . NC ]: Main partition of the j the j th column Merged column Sorted dictionary of the main partition Sorted dictionary of the delta partition Up dated main dictionary Delta partition of the j the j th column. Uncompressed Value-Length Compressed Value-Length Compressed Value-Length after merge Fraction of unique values in delta Fraction of unique values in main

-

bytes bits bits -

Merge auxiliary structure for the main Merge auxiliary structure for the delta Memory Traﬃc Number of available parallel threads

- - bytes -

NM 

M j M j U jM U jD j UM D j E j E jC j EC λ jD λ jM X j jM XD MT NT

Table 7.1: Symbol Symbol Deﬁnition. Entities Entities annotated annotated with  represent the merged (updated) entry.

separate dictionary structure consisting of the uncompressed values stored in a sorted order. Hence, | U jM | = = λ λ jM · NM   with | X | denoting the number of elements in the set X. By deﬁnition, λ jM , λ jD ∈ [0 . . . 1]. Furtherm urthermore, ore, the compre compresse ssed d value alue stored stored for a given given value alue is its index j 1 in the (sorted) dictionary structure, thereby requiring log |UM | bits to store store it. it. Henc Hence, e, j j EC   =  log |UM |. Input(s) and Output(s) of the Algorithm: For each column of the table, the merging algorithm combines the main and delta partitions of the column into a single (modiﬁed) main partition and creates a new empty delta partition. In addition, the dictionary dictionary U j maintained for each column of the main table is also updated to reﬂect the modiﬁed mo diﬁed merged column. column. This also includes modifying modifying the compressed compressed values values stored for the various tuples in the merged column. U jM , while M j , D j and and U For the j the j th column, the input for the merging algorithm consists of  M 1

Unless otherwise stated, log refers to logarithm with base 2 (log2 ).

62

7.2. MER MERGIN GING G ALG ALGORIT ORITHM HM

j the output consists of  M M  j and and U U M . Furthermore, we deﬁne the cardinality N cardinality N M  of the output j and the size of the merged dictionary | UM Equation 7.1 and 7.1 and 7.2. 7.2 . | as shown in Equation

NM  = N M   + ND

(7.1)

j = D j |UM | | ∪ U jM |

(7.2)

We perform the merge using the following two steps: 1. Merging Dictionaries: This Dictionaries: This step consists of the following two sub-steps: a) Extracting the unique values from the delta partition D partition D j to form the corresponding sorted dictionary denoted denote d as as U U jD . b) Merging the two sorted dictionarie dictionariess U jM and U and U jD , creating the sorted j U M dictionary U dictionary without duplicate values. 2. Updating Compressed Compressed Values: This step consists of appending the delta partition j D to the main partition partition M j and updating the compressed values for the tuples, based on j the new dictionary U dictionary U M . Since D Since D j may have introduced new values, this step requires: a) Computing Comput ing the new compressed compressed value-length value-length.. b) Updating the compressed compressed values for all tuples with the new compressed j value, using the index of the corresponding uncompressed value in the new dictionary U dictionary U M . We now describe the above two steps in detail and also compute the order of complexity for each of the steps. As mentioned mentioned earlier, the merging algorithm algorithm is executed executed separat separately ely for each column of the table. 7.2.1 7.2 .1

Mergin Merging g Dicti Dictiona onarie riess

The basic outline of step one is similar to the algorithm of a sort-merge-join [54] [54].. Howeve However, r, instead of producing pairs as an output of the equality comparison, the merge will only generate a list of unique values. The merging of the dictionaries is performed in two steps (a) and (b). Step 1 (a) This step involve involvess building the dictionary dictionary for the delta partition partition D j . Sinc Sincee we j maintain a CSB+ tree to support eﬃcient insertions into D , extracting the unique values in a sorted order involves a linear traversal of the leaves of the underlying tree structure [63]. [63]. j The output of Step 1(a) is a sorted dictionary for the delta partition UD , with a run-time complexity of   O O (|U jD |) Step 1(b) This step involve involvess a linear traversal traversal of the two two sorted dictionaries dictionaries U jM and and U jD j to produce a sorted dictionary UM . In line with a usual merge operation, we maintain two pointers called iterator M and iterator D , to point to the values being compared in the two dictionaries U jD and U jM, respect respectiv ively ely.. Both Both are initiali initialized zed to the start of their their respect respectiv ivee dictio dic tionari naries. es. At each step, we compare compare the curren currentt values alues b bein eingg poin pointed ted to and append the 63

CHAPTER CHAPT ER 7. PAR ARTITIONE TITIONED D COLUMNS COLUMNS

smallerr value smalle alue to the output. output. Further urthermore more,, the pointer pointer with with the smaller smaller value alue is also also increincremented men ted.. This This process process is carrie carried d out until until the end of one of the dictiona dictionarie riess is reached reached,, after after which the remaining dictionary values from the other dictionary are appended to the output dictionary. In case both values are identical the value is appended to the dictionary once and the pointers for the dictionaries are incremented. The output of Step 1(b) is a sorted dictionary j j UM for the merged column, with |UM denotingg its cardinality cardinality. The run-time complexity complexity of | denotin j j this step is O (|UM| + |UD |). 7.2.2 7.2 .2

Updatin Updating g Compre Compresse ssed d Val Values ues

The compressed values are updated in two steps – (a) computing the new compressed value length and (b) writing the new main partition and updating the compressed values. Step 2(a) The new compressed compressed value-len value-length gth is computed as shown in Equation Equation 7.3. 7.3. Note Note that the length for storing the compressed values in the column may has increased from the one used for storin storingg the compress compressed ed values values before before the merging merging algo algorit rithm. hm. Since Since we use the same length for all the compressed values, this step executes in O (1) time. j j EC = log(|UM |) bits

(7.3)

Step 2(b) We need to append the delta partition partition to the main partition and update the compressed values. As far as the main partition M partition M j is concerned, we use the following methodology. i , we compute the We iterate over the compressed values and for a given compressed value K C corresponding uncompressed value K value K i by performing a lookup in the dictionary U dictionary U jM. We then j search for K for K i in the updated dictionary dictionary U U M and store the resultant index as the encoded value using the appropriate appropriate number of   E E jC  bits in the output. For the delta partition D j , we already store the uncompressed value, hence it requires a search in the updated dictionary to compute the index, which is then stored. Since the dictionary is sorted on the values, we use a binary search algorithm to search for a given uncompressed value. The resultant run-time of the algorithm is j )).. O(NM  + (NM   + ND ) · log(|UM |))

(7.4)

To summarize the above, the total run-time for the merging algorithm is dominated by Step 2(b)) and depends 2(b depends heavil heavily y on the search search run-ti run-time. me. As shown shown in Section Section 7.4, this makes the mergingg algorithm mergin algorithm prohibitively prohibitively slow and infeasible infeasible for current current conﬁgurations conﬁgurations of tables. tables. We now present an eﬃcient variant of Step 2(b), which performs the search in linear time at the expense expe nse of using using an auxili auxiliary ary data struct structure ure.. Since Since merging merging is performed performed on every every column column separat sep arately ely,, we expect expect the overhea overhead d of storing storing the auxili auxiliary ary struct structure ure to be very very small, small, as compared to the total storage, and independent of the number of columns in a table and the number of tables residing in the main memory. 64

7.2. MER MERGIN GING G ALG ALGORIT ORITHM HM

Step1(a) Delta Dictionary J

Delta Partition (Compressed)

Step1(b) Main Auxiliary

U D

J

00

00 bravo

charlie golf 01 10 young 11

01 charlie golf 10 01 charlie 11 young

bravo

Step2(b)

Use 100 (i.e., 4) as an index to the auxiliary structure to get a new new value ---- 0110

100 hotel Old Main Partition

Old Delta Partition

X M

0 1 2 3 4 5

0000 0010 0011 0100 0110 0111

apple charlie delta frank hotel inbox

0110 Delta Auxi Auxiliary liary J

X D

Merged Partition

00 bravo

0001 bravo 0010 charlie 0101 golf 1000 young 0001

Figure 7.2: Example showing the various steps executed by our linear-time merging algorithm. The values in the column are similar to those used in Figure 7.1. 7.2.3

Initial Initial Perfor Performanc mance e Impro Improve vemen ments ts

Based on the previously described na¨ na¨ıve algorithm, we signiﬁcantly increase the merge performance by adding an additional auxiliary data structure per main and per delta partition, denoted as X jM and X jD respecti respective vely ly.. The reasoning reasoning for the auxiliary auxiliary data struct structure ure is to provide a translation table with constant access cost during the expensive Step 2(b). We now describe descr ibe the modiﬁed Steps 1(a), 1(b) and 2(b) to linearize linearize the update algorithm algorithm of compressed compressed values and improve the overall performance. Modiﬁed Step 1(a) In addition to computing computing the sorted dictionary dictionary for the delta partition, partition, we also replace the uncompressed values in the delta partition with their respective indices in the dictionary. dictionary. By this approach approach lookup indices indices for Step 2 are changed changed to ﬁxed width and allow better utilization of cache lines and CPU architecture aware optimizations like SSE. Since our CSB+ tree structure for the delta partition also maintains a list of tuple ids with each value, we access these values while performing the traversal of the tree leaves and replace them by their newly computed index into the sorted dictionary. Although this involves non-contiguous access of the delta partition, each tuple is only accessed once, hence the runtime is O(ND ). For example, consider consider Figure Figure 7.2, 7.2, borrowing the main/delta partition values depicted depic ted in Figure 7.1. Figure 7.1. As shown in Step 1(a), we create the dictionary for the delta partition 65

CHAPTER CHAPT ER 7. PAR ARTITIONE TITIONED D COLUMNS COLUMNS

(with 4 distinct values) and compute the compressed delta partition using 2 bits to store each compressed value. Modiﬁed Step 1(b) In addition to appending appending the smaller va value lue (of the two input dictiodictio j

naries) to to UM , w wee also maintain the index to which the jvalue value is written. This index is used to incrementally map each value from U from U jD and U and U jM to U to U M in the selected mapping table X table X jM or or   X jD . If both compared values are equal, the same index will be added to the two mapping tables. At the end of Step 1(b), each entry in X in X jM corresponds to the location lo cation of the corresponding corresponding  j j uncompressed value of   U U M in the updated U updated U M. Similar Similar observations observations hold true for X for X jD (w.r.t. U jD ). Since Since this modiﬁca modiﬁcatio tion n is performe performed d while while buildi building ng the new dictio dictionar nary y and both X jM and   X jD are accessed in a sequential fashion while populating them, the overall run-time of and Step 1(b) remains as noted in Equation 7.2 – linear in sum of number of entries in the two dictio dic tionari naries. es. Step Step 1(b) in Figure Figure 7.2 depicts the corresponding auxiliary structure for the example in Figure 7.1. Figure 7.1. Modiﬁed Step 2(b) In contrast to the original implementation described described earlier, computing the newi compressed value for the main (or delta) table reduces to reading the old compressed value KC  and retrieving the value stored at KiC th  index index of the corresponding auxiliary structure X jM or or X X jD . For example in Figure 7.2, Figure 7.2, the the ﬁrst compressed value in the main partition has a compressed value of 4 (1002 in binary). In order to compute the new compressed value, we look up the value stored at index 4 in the auxiliary structure that corresponds to 6 (11002 in bina binary ry). ). So valu valuee 6 is sto store red d as the new compressed value, using 4 bits since the merged dictionary has 9 unique values, as shown in Figure Figure 7.1. 7.1. Therefore, Therefore, a lookup and binary search search in the original algorithm algorithm descrip description tion is replaced by a lookup in the new algorithm, reducing the run-time to O (NM   + ND ). To summarize, the modiﬁcations described above result in a merging algorithm with overall run-time run-t ime of (7.5) O (NM  + ND + |U jM| + |U jD |) which is linear in terms of the total number of tuples and a signiﬁcant improvement compared to Equation 7.4. Equation 7.4.

7.3 7. 3

Me Merg rge e Impl Implem emen enta tati tion on

In this section, we describe our optimized merge algorithm on modern CPUs in detail and provide an analytical model highlighting the corresponding compute and memory traﬃc requirements. We ﬁrst describe the scalar single-threaded algorithm and later extend it to exploit the multiple cores present on current CPUs. 66

7.3. MERG MERGE E IMPLEME IMPLEMENT NTA ATION

The model serves the following purposes: (1) Computing the eﬃciency of our implementation. (2) Analyzing the performance and projectin pro jectingg perform p erformance ance for varying input parameters parameters and underlying underlying architect architectural ural features features like like varying arying core and memory bandwidth. bandwidth.

7.3.1 7.3 .1

Scala Scalar r Imp Implem lemen entat tation ion

Based on the modiﬁed Step 1(a) (Section 7.2.3 (Section 7.2.3), ), extracting the unique values from D from D j involves an in-orde in-orderr trave traversa rsall of the underlyi underlying ng tree tree struct structure ure.. We perform perform an eﬃcien eﬃcientt CSB+ CSB+ tree tree trave tra versa rsall using using the cachecache-fri friend endly ly algorit algorithm hm descri described bed by Rao et al. [63]. 63]. The The num number ber of elements in each node (of cache-line size) depends on the size of the uncompressed values c.e  j in the j the j th column of the delta partition. For example, with c.e  j = 16 bytes, each node consists of a maxim maximum um of 3 values. alues. In addition addition to append append the value alue to the dictio dictionar nary y during during the inorder traversal, we also traverse the list of tuple-ids associated with that value and replace the tuples with the newly computed index into the sorted dictio dictionary nary.. Since the delta partition is not guaranteed to be cache-resident at the start of this step (irrespective of the tree sizes), the run-time of Step 1(a) depends on the available external memory bandwidth. As far as the total amount of data read from the external memory is concerned, the total amount of memory required to store the tree is around 2X the total amount of memory consumed by by the values values themselv themselves es [63]. 63]. In addition to traversing traversing the tree, writing the dictionary dictionary j UD involves fetching the data for write. Therefore, the total amount of bandwidth required for the dictionary computation computation is around 4 · c.e  bytes bytes per value (3 · c.e  bytes bytes read and 1 · c.e  bytes bytes writte wri tten) n) for the column column.. Updatin Updatingg the tuples tuples invo involv lves es readin readingg their their tuple tuple id and a ran random dom j access into D into D to update the tuple. Since each access would read a cache-line (B ( Bi bytes wide) the total amount of bandwidth required would be (2 · Bi + 4) bytes per tuple (including (including the read for the write component). component). This results results in the total required required memory traﬃc for this operation as shown by Equation Equation 7.7. 7.7. Note Note that at the end of Step Step 1(a), 1(a), D j also consists of compressed values (based on its own dictionary U dictionary U jD ). Applying the modiﬁed Step 1(b) (as described in Section 7.2.3 Section 7.2.3), ), the algorithm iterates over the two dictionaries and produces the output dictionary with the auxiliary structures. As far as the number of operations is concerned, each element appended to the output dictionary involves around 12 ops 2 [11]. 11]. As far as the required required memory memory traﬃc is concerne concerned, d, both U jM j and U and U jD are read sequentially, while U while U M , X jM and X and X jD are written in a sequential order. Note that the compressed value-length used for each entry in the auxiliary structures is  j  e c =  log(|UM |).

(7.6)

Hence, the total amount of required read memory traﬃc can be calculated as shown in Equation   7.8. tion 7.8. The required required write memory traﬃc for building building the new dictionary dictionary and generate the translation table is calculated as shown in Equation 7.9. 2

1 op implies 1 operation or 1 executed instruction.

67

CHAPTER CHAPT ER 7. PAR ARTITIONE TITIONED D COLUMNS COLUMNS

M T = 4 · c.e  · · |U jD | + (2 · Bi + 4) · ND M T =

(7.7)

j c.e  · · (|U jM | + |U jD | + |UM |) + j j  e c · (|XM | + |XD |)

8 M T =

 j

c.e  · | · |UM | +

j j  e c · (|XM | + |XD |)

8

(7.8)

(7.9)

As shown for the modiﬁed Step 2(b) in Section 7.2.3, the 7.2.3, the algorithm iterates over the com j j pressed values in in M and and   D to produce the compressed values in the output table table M j . For each compressed input value, the new compressed value is computed using a lookup into the auxiliary structure structure X jM for the main partition or X jD for the delta partition, with an oﬀset equal to the stored compressed value itself. The function shown in Equation 7.10 is executed for each element of the main partition and similar for the delta partition. M j [i] ← X jM [M [M [i]]

(7.10)

As far as the memory access pattern is concerned, updating each successive element in the main or delta partition may access a random location in the auxiliary data structure (depending on the stored compressed value). Since there may not exist any coherence in the values stored in consecutive locations, each access can potentially access a diﬀerent cache line (size B i bytes). For scenarios where the complete auxiliary structure cannot ﬁt in the on-die caches, the amount of read memory traﬃc to access the auxiliary data structure is approximated by Equation 7.11. M T T    = B i · ( (N NM   + ND )

(7.11)

In additi addition, on, reading reading the main/delt main/deltaa partiti partition on require requiress a read read memory memory traﬃc as shown shown in Equation 7.12, while writing out the concatenated output column requires a total memory traﬃc that can be calculated as in Equation 7.13. 7.13. M T T    = c.e C  · ( (N NM   + ND ) /8

(7.12)

(N NM   + ND ) /8 M T T    = 2e c (

(7.13)

In case case X jM (or (or X jD ) ﬁts in the on-die caches, their access will be bound by the computation rate of the processor, and only the main/delta partitions will be streamed in and the concatena cat enated ted table is writte written n (strea (streamed med)) out. As far as the rel relati ative ve time spent in each each of the these se steps ste ps is concer concerned ned,, Step Step 1 takes takes about 33 % of the total total merge time (with c.e  = = 8 bytes and 3 50 % unique unique value values) s) and Step Step 2 takes takes the remainde remainder. r. In terms of evidence of compute and 3

Section 7.4 Section 7.4 gives more details.

68

7.3. MERG MERGE E IMPLEME IMPLEMENT NTA ATION

bandwidth bound, our analytical model deﬁnes upper bounds on the performance, if the implemen ple mentat tation ion was indeed indeed bandwidt bandwidth h bound bound (and a diﬀere diﬀerent nt bound if comput computee bound). bound). Our experimental evaluations show that our performance closely matches the lower of these upper bounds for compute and bandwidth resources – which proves that our performance is bound by the resources as predicted by the model. 7.3.2

Exploitin Exploiting g Threa Thread-lev d-level el Parall Parallelis elism m

We now present the algorithms for exploiting the multiple cores/threads available on modern CPUs. NT  denotes the number of available processing threads. Parallelization of Step 1 Recalling Step 1(a), we perform an in-order traversal of the CSB+ tree and simultaneously update the tuples in the delta partition with the newly assigned compressed values. There exist two diﬀerent strategies for parallelization: 1. Dividing Dividing the columns within a table amongst the availabl availablee threads: Since the time spent on each column varies based on the number of unique values, dividing the columns evenly amongst threads may lead to load imbalance between threads. Therefore, we use a task queue based parallelization scheme [2 [2] and enqueu enqueuee each each column as a separat separatee task. If the number number of tasks is much much larger than the number number of threads threads (as in our case with only a few tens to hundred columns and a few threads), the task queue mechanism of migrating tasks between threads works well in practice to achieve a good load balance. 2. Paralleliz Parallelizing ing the execution execution of Step 1(a) on each each column amongst the availabl availablee threads: Since a small portion of the run-time is spent in computing the dictionary, we execute it on a single-thread, and keep a cumulative count of the number of tuples that needs to be updated as we create the dictionary dictionary. We parallelize parallelize the next phase where these tuples are evenly amongst the threads and each thread scatters the compressed values to the deltadivided partition. For a table table with ve very ry few columns, columns, scheme scheme (ii) performs performs better than schem schemee (i). We impleimplemented both (i) and (ii) and since our input table consisted of few tens to hundreds of columns, we achieved achieved similar similar scaling for both these schemes schemes on current current CPUs. In Section 7.4.2 Section 7.4.2,, we report the result resultss for (i) – the results results for (ii) (ii) would would be similar similar.. Step Step 1(b) 1(b) invo involv lves es merging merging the j j two sorted dictionaries UM and UD with duplicate removal and simultaneously populating the auxiliary structures X structures X jM and X and X jD . This is an inherent inherent sequential sequential dependency dependency in this merge process proce ss and also requir requires es the merge process process to remov removee duplic duplicate ates. s. Also Also note note that for tables tables with large fractions of unique values or large value-lengths (≥ 8 bytes) a signiﬁcant portion 69

CHAPTER CHAPT ER 7. PAR ARTITIONE TITIONED D COLUMNS COLUMNS

of the total total run-ti run-time me is spent spent in Step 1(b). 1(b). Therefo Therefore, re, it is imperativ imperativee to paralle paralleliz lizee well, well, in order to achieve speedups in the overall run-time of the merging algorithm. We now describe our paralleli parallelizat zation ion scheme scheme in detail detail that in practic practicee achie achieve vess a good load-b load-bala alance nce.. Let us j j ﬁrst consider the problem of parallelizing the merging of   of   UM and and UD without duplicate removal. In order to evenly moval. evenly distribute distribute the work among the N the N T  threads it is required to partition both dictionaries into NT -quan -quantiles tiles.. Since both dictionaries dictionaries are sorted this can be b e achieved achieved j j in in   NT  log(|UM| + |UD |) steps [19] [19].. Furthermore, urthermore, we can also compute the indices in the two th dictionaries for the i thread ∀i ∈ NT following the same algorithm as presented in [11] [ 11].. Thus, each thread can compute its start and end indices in the two dictionaries and proceed with wit h the merge operation operation.. In order order to handle handle duplicat duplicatee remov removal al while while mergin merging, g, we use the following follo wing technique technique consisting consisting of three phases. We additionally additionally maintain maintain an array array counter   of size (N (NT  + 1) elements. Phase 1 Each thread thread computes its start and end indices indices in the two two dictionaries dictionaries and writes the merged output, while locally removing removing duplicates. duplicates. Since the two dictionaries dictionaries consisted of unique elements to start with, the only case where this can create duplicate elements is when the last element produced by the previous thread matches the ﬁrst element produced by the current thread. This case is checked for by comparing the start elements in the two dictionaries with the previous elements elements in the respectively respectively other dictionary. dictionary. In case there is a match, the corresponding pointer is incremented before starting the merge process. Once a thread (say the i th thread) completes the merge execution, it stores the number of unique elements produced by that thread to the corresponding location in the counter array (i.e. counter [i ]). ]). There is an explicit global barrier at the end of phase 1. Phase 2 In the second Phase second phase, phase, we compute compute the preﬁx sum of the counter counter arra array y, so that counter [i ] corresponds to the total number of unique values that would be produced by the previous   i previous i    threads. threads. Additionally Additionally,, counter [NT ] is the total number of unique values that the merge operation would produce. We parallelize the preﬁx sum computation using the algorithm by Hillis et al. [30 al. [30]. ]. Phase 3 The counter counter array produced produced at the end of phase 2 also provides provides the starting index index at whi which ch a thread thread should should start start writin writingg the locally compute computed d mer merged ged dictionar dictionary y. Simila Similarr to jnaries. phase pha separtition. 1, we recomput recom pute e theofstart andcomputed end indices indices two two for U dictio dic U tionari es. corresponds Now Now consid consider er main The range indices by in thethe thread for also to the the M j range of indices for which the thread can populate X M with the new indices for those values. Similar Simil ar observations observations hold for the delta partition. Each thread thread performs the merge operation within withi n the computed computed range of indices indices to produce the ﬁnal merged dictionary and auxiliary data structures.

Summary In comparison to the single-threaded iimplementation, mplementation, the parallel parallel implementation reads the dictionaries twice and also writes the output dictionary one additional time, thereby increasing the total memory traﬃc by c.e  · ·



j

j

j |UM| + |UD | + 2c.e  · | · |UM |

70



(7.14)

7.4. PERF PERFORMANC ORMANCE E EV EVALUA ALUATION TION

90

     e 80      p      u70       T      r 60      e      p      s 50      e 40      c      y30       C       t 20      s      o10       C      e       t 0      a      p       U

Merge ‐ Step2 Merge ‐ Step1 Update Delta      )     t      )     t     p    p      O     O     n     (      K      U      (      0      K     0      0     1      0      1

     )     t      )     t     p    p      O     O     n     (      K      U      (      0      K     0      0     5      0      5

     )     t     t      )     p    p      O     O     n     (      U      (      M      1      M      1

     )     t     t      )     p    p      O     O     n     (      U      (      M      2      M      2

     )     t     t      )     p    p      O     O     n     (      U      (      M      4      M      4

     )     t     t      )     p    p      O     O     n     (      U      (      M      8      M      8

Figure 7.3: Update Figure Update Costs Costs for Variou Variouss Delta Delta Parti Partitio tion n Sizes Sizes with a main main partiti partition on siz sizee of 100 millio million n tuples tuples with 10 % unique unique values values using 8-byte 8-byte values. alues. Both Both optimi optimized zed (Opt) and unoptimized (UnOpt) merge implementations were parallelized. bytes. byte s. The overhead overhead of the start and end index computation computation is also very small as compared compared to the totalthe comput com putati ation on perform perf by Step 1(b). 1(b Thethread, resultan resultant t parall parallel el algorit algorithm hm evenl evenly y distributes total amount of ormed dataed read/written to).each thereby completely exploiting the available memory bandwidth. Parallelization of Step 2 To parallelize the updating of compressed values, we evenly divide the total number of all   tuples N tuples N M amongst the available threads. Speciﬁcally, each thread is assigned N assigned N M /NT   tuples to operate upon. Since each each thread reads/writes reads/writes from/to independent independent ch chunks unks of tables, this paralle para lleliz lizati ation on approac approach h works works well well in practice. practice. Note Note that in case case an any y of   X jM or X jD can completely ﬁt in the on-die caches, this parallelization scheme still exploits to read the new index for each tuple from the caches and that the run-time is proportional to the amount of bandwidth required to stream the input and output tables.

7.4

Perf erform ormanc ance e Ev Evalu aluati ation on

We now evaluate the performance of our algorithm on a dual-socket six-core Intel Xeon X5680 CPU with 2-way 2-way SMT per core, and each core operatin operatingg at a fre freque quency ncy of 3.3 GHz. GHz. Each Each socket has 32 GB of DDR (for a total of 64 GB of main memory). The peak external memory bandwi ban dwidth dth on each each socke sockett is 30 GB/sec GB/sec.. We used used SUSE SUSE SLES SLES 11 as operating operating system, system, the pthrea pth read d library library and the Intel ICC 11.1 as compil compiler. er. As far as the input input data is concer concerned ned,, the number of columns in the partition NC   varie variess from 20 to 300. The valu value-l e-leng ength th of the uncompressed value c.e  for for a column is ﬁxed and chosen chosen from 4 bytes bytes to 16 bytes. bytes. The number number 71

CHAPTER CHAPT ER 7. PAR ARTITIONE TITIONED D COLUMNS COLUMNS

of tuples in the main partition N partition N M  varies from 1 million to 1 billion, while the number of tuples in the delta partition N partition N D var varies ies from 500,000 to 50 million, with with a maximum of around 5 % of NM . Since the focus of the report is on in-memory in-memory databases, databases, the number of columns is chosen chosen so that the overall data completely ﬁts in the available main memory of the CPU. The fraction of unique values λ jM and λ jD varie variess from 0.1 % to 100 % to cover cover the spectrum spectrum of scenari scenarios os in real applications. applications. For all experiment experiments, s, the values are generated uniformly uniformly at random. random. We chose uniform value distributions, as this represents the worst possible cache utilization for the values and auxiliary structures. Diﬀerent value distributions can only improve cache utilization, leadin lea dingg to better better merge merge times. times. Howe Howeve ver, r, diﬀere diﬀerence ncess in merge merge times times due to diﬀere diﬀerent nt value value distri dis tribut bution ionss are expected expected to be very very small small and are therefo therefore re neglig negligibl ible. e. We ﬁrst ﬁrst show show the impactt of varying impac arying ND on the merge perform performanc ance. e. We then vary vary c.e   from 4–16 bytes to analyze the eﬀect of varying value-lengths on merge operations. We ﬁnally vary the percentage of unique values (λ ( λ jM , λ jD ) and the size of the main partition NM . In order order to norma normali lize ze performance performan ce w.r.t. varying input parameters, parameters, we introduce introduce the term – update cost . Updat Updatee Cost is deﬁned as the amortized time taken per tuple per column (in cycles/tuple), where the total time is the sum of times taken to update the delta partition T U  and the time to perform the merging of main and delta partitions T M , while the total number of tuples is N is N M   + N D .

7.4.1 7.4 .1

Impac Impactt of Delt Delta a Part Partiti ition on Si Size ze

Figure 7.3 sho Figure 7.3 shows ws the update cost for varying varying tuples of the delta partition. partition. In addition to the delta partition update cost, we also show both run-times for the unoptimized and optimized Steps 1 and 2 in the graph. NM   is ﬁxed to be 100 million tuples, while ND is varied from 50 500, 0,000 000 (0.5 %) to 8 mi mill llio ion n (8 %) tuples tuples.. λ jM and λ jD are are ﬁxed ﬁxed to be arou around nd 10%. The The uncompressed value-length value-length c.e  is is 8 bytes. bytes. We ﬁx the number of columns N columns N C  to 300. Note that the run-times are on a parallelized a parallelized code for both implementations  on on our dual-socket multi-core system. As far as the unoptimized merge algorithm is concerned, Step 2 (updating the compressed values) alues) takes takes up the majorit ma jority y of the run-time and does not change (per tuple) with the varying varying number of tuples in the delta partition. The optimized Step 2 algorithm drastically reduces the time spent in the merge operation (by 9–10 times) as compared to the unoptimized algorithm. Considering the optimized code, as the delta partition size increases, the percentage of the total tot al time spent spent on delta updates updates increas increases es and is 30 % – 55 % of the total total time. This This signiﬁes signiﬁes that the overhead of merging contributes a relatively small percentage to the run-time, thereby making our scheme of maintaining separate main and delta partitions with the optimized merge an attractive option for performing updates without a signiﬁcant overhead. The update rate in tuples/second is computed by dividing the total number of updates with the time taken to perform delta updates and merging the main and delta partitions for the   NC  colum the columns. ns. As an example, example, for for ND = 4 million and say say NC  = 300, an update cost of 72

7.4. PERF PERFORMANC ORMANCE E EV EVALUA ALUATION TION

Update De Delta lta

Merg Merge e ‐ St Step ep1 1

Merg Merge e ‐ Step2

Update De Delta lta

4

       )      e       l      p3.5      u       T 3      r      e      p2.5      s       l      e 2      c      y       C1.5        (       t      s 1      o       C0.5      e       t      a 0       d      p       U

Mer Merge ‐ St Step ep1 1

Mer Merge ‐ Step2

14

4

8

16

1M d elt a

4

8

       )      e       l      p12      u       T      r 10      e      p      s 8       l      e      c 6      y       C        (       t 4      s      o       C 2      e       t      a 0       d      p       U

16

3M delta

4

8

16

1M delt a

(a) 1% Unique Values

4

8

16

3M delta

(b) 100% Unique Values

Figure 7.4: Update Costs for Various ValueValue-Lengt Lengths hs for two two delta sizes with 100 million million tuples in the main partition partition for 1 % and 100 % unique unique values values.. 13.5 cycles per tuple (from Figure 7.3 Figure 7.3)) evaluates to 4, 000 000,, 000 · 3 3..3 · 109

31,, 350 updates updates/secon /second d. ≈ 31

(7.15)

13 13..5 · 104 104,, 000 000,, 000 · 300 7.4.2

Impact Impact of Value Value-Len -Length gth and Percen Percentage tage of Unique Unique values values

Figure 7.4 sho Figure 7.4 shows ws the impact of varying varying uncompressed uncompressed value-lengt value-lengths hs on the update cost. The uncompresse uncomp ressed d value-length alue-lengthss are varied between between 4, 8 and 16 bytes. bytes. We show two two graphs with the percentage percentage of unique values values ﬁxed at (a) 1 % and (b) 100 % respectively respectively.. NM  is ﬁxed to be 100 million tuples for this experiment and the breakdown of update cost for ND equal to 1 million and 3 million tuples is shown. We ﬁx the number of columns N C  to 300. As the value-length increases, the time taken per tuple to update the delta partition increasess and becomes a major crease ma jor contributor contributor to the overall overall run-time. run-time. This time also increases increases as the size of the delta partition increases. increases. For example, in Figure Figure 7.4(a), 7.4(a), for an uncompressed D value-length of about 16 bytes, delta increases from about 1.0 cycles per percentage tuple for Nof ND =time = 1 million to 3.3 the cycles forupdate for 3 million. millio n. This time increases incre ases as the unique values increases. The corresponding numbers in Figure 7.4 7 .4(b) (b) for 100 % unique unique valu values es are 5.1 cycles for N for N D = 1 million and 12.9 cycles for N D = 3 million.

As far as the Step 2 of the merge is concerned, the run-time depends on the percentage of unique values. values. For 1 % unique values, values, the auxiliary auxiliary structures structures ﬁt in cache. cache. As described in Section   7.3.2, Section 7.3.2, the auxiliary structures being gathered ﬁt in cache and the run-time is bound by the time time requir required ed to read read the input partition partitionss and write the updated updated partition partitions. s. We get a run-time of 1.0 cycles per tuple (around 1.8 cycles per tuple on 1-socket), which is close to the bandwidth bound computed in Section 7.3.2. 7.3.2. For 100 % unique unique values values,, the auxiliar auxiliary y struct str ucture uress do not ﬁt in cache cache and must must be gather gathered ed from memory memory.. The time taken taken is then then 73

CHAPTER CHAPT ER 7. PAR ARTITIONE TITIONED D COLUMNS COLUMNS

around 8.3 cycles (15.0 cycles cycles on 1-socket 1-socket), ), which closely matches matches (within (within 10 %) the analytical analytical model developed in Section Section 7.3 7.3.. The time for Step 2 mainly depends on whether the auxiliary structures can ﬁt in cache and therefore is constant with small increases in the delta size from 1 million to 3 million. As far as Step 1 is concerned, for a given delta partition size ND , the time spent in Step 1 increases sub-linearly with the increase in value-length (Section 7.3.1). 7.3.1). For a ﬁxed ﬁxed valuevaluelength, the time spent increases marginally with the increase in ND – due to the fact that this increase in partition size only changes the unique values by a small amount and hence the compressed value-length also changes slightly, resulting in a small change in the run-time of Step 1. With larger larger changes changes in the percentage percentage of unique values values from 1 % to 100 %, the run-time run-time increases. incre ases. For instance, instance, for 8-byte 8-byte values values and 1 million million delta partitions, partitions, Step 1 time increases increases from 0.1 cycles cycles per tuple at 1 % unique values values to 3.3 cycles cycles per tuple at 100 % unique values values.. Finally Finall y, the percentage percentage of time time spent spent in updating updating the tuples tuples as compare compared d to the total time increases both with increasing value-lengths for ﬁxed ND and increase in ND for ﬁxed value-lengths.

7.5 7. 5

Me Merg rge e Stra Strate tegi gies es

An important performance parameter for a system with a merge process as described in the preceding sections, is the frequency at which the merging of the partitions must be executed. This frequency is given by the size of the delta partition, at which a merge process is initiated. The appropriate appropriate size is dictated by two conﬂicting conﬂicting choices: choices: a) A small delta partition means fast insertions into the delta partition and only a small overhead for read queries, but implies a more frequently executed merge process with higher impact to the system. b) A large delta partition reduces the overhead of the merge process, but implies higher insertion costs in the delta partition and a higher overhead for read queries. In our system, we trigger the merging of partitions when the number of tuples ND in the delta partition is greater than a certain pre-deﬁned fraction of tuples in the main partition N partition N M . In contrast to the previously introduced partitioned column with one main and one delta partition, we extend the concept of a partitioned column and allow an arbitrary number of partitions. A partition can be either a write a write optimized optimized store (WOS) (WOS) or a read optimized store (ROS). A WOS WOS is optimized for write access similar to the delta partition whereas a ROS is optimized optimized for read read querie queriess simila similarr to the main partition partition.. In order to furthe furtherr balanc balancee thi thiss tradeoﬀ between merge costs and query performance, merge strategies can be applied, similar to tradeoﬀs in the context of index structures between index maintenance costs and query performance [6] performance [6].. A merge strategy deﬁnes which partitions are merged together when a merge process is executed. We now discuss the strategies immediate strategies immediate merge , no merge   and logarithmic and logarithmic merge . 74

7.5. MERG MERGE E STRA STRATEGIES TEGIES

Immediate Merge Every Immediate Every write optimized optimized store is merged immediatel immediately y with the existing existing read optimized store and hence the strategy is called Immediate Merge . Note that this means that there is exactly one WOS and one ROS at all times and that the one ROS is constantly growing growi ng and potentially potentially very very large. This results results in high costs for the merge process, process, b becaus ecausee for every merge all tuples are touched. For the query processing this is an advantage, because only one read and one write optimized storage have to be considered instead of combining the result res ultss of multi multiple ple storages. storages. The immediat immediatee merge merge strateg strategy y can be classi classiﬁed ﬁed as an ext extrem remee strategy, with an unbalanced trade-oﬀ between query and merge costs. No Merge The counterpart strategy of the immediate merge is called No called No Merge . In order to compact one WOS, the strategy always creates one new ROS out of the WOS. Therefore, only tuples from the small WOS have have to be considered, considered, which which makes the merge very fast. Howeve However, r, the result is a growing collection of equally sized ROS, which all have to be queried in order to answer one single query. Although the queries against the storages can be parallelized, the conﬂation of the results is an additional overhead. The No Merge Strategy can be considered as an extreme strategy as well, shifting the costs from merge process to the query processing. Logarithmic Merge The idea behind behind the loga logarith rithmic mic merge merge strate strategy gy is to ﬁnd a balanc balancee between query and merge costs. This is accomplished by allowing multiple ROS to exists, but also merging several WOS into one ROS from time to time. In a b-way logarithmic merge strategy each storage is assigned a generation g generation g.. If a new ROS is created created exclusivel exclusively y from one WOS, it is assigned g assigned g = 0. When When b storage storagess with g with g = x = x exist, exist, they are merged into one storage with g = = x x+1. +1. The number number of ROS is bounded bounded by O by O(log (log n), where n where n is is the total size of the collection. With an Immediate Merge Strategy the storage maintenance costs grow linearly with every applie app lied d merge merge whereas whereas the query query costs stay low. low. When When a No Merge Strategy Strategy is applie applied d the contrary image appears where merge costs stay low and query costs grow linearly. The center part shows how query and merge costs grow with a 2-way 2-way Logarithmic Logarithmic Merge Strategy. Strategy. The merge costs are varying depending on how many storages have to be touched. In every 2x cases all storages storages have have to be merged merged into into one. one. The query query costs costs depend on the number number of existing existing storages and are therefore lowest when all storages have been merged into one. In order to choose a merge strategy, the workload has to be analyzed and characterized. In a second step the resulting costs for each merge strategy can be calculated and the most promising strategy should be chosen, based on an analytical cost model predicting the cost diﬀerences diﬀere nces for each each merge strategy. strategy. Building Building on top of the presented presented cache-misscache-miss-based based cost model, our goal is to roughly estimate the ratio between the costs for read queries and write querie que ries. s. In order to do so, we calcul calculate ate the nu numbe mberr of sel select ect statemen statements ts compared compared to insert insert statements and weigh them according to their complexity and estimated selectivity selectivity.. This results in an abstract ratio q  betw q  between een read and write write query query costs costs for the workl workload oad.. This This numbe numberr is used as an input parameter for the calculated cost functions for each merge strategy. The actual total costs for a workload workload depend on several several parameters. parameters. The costs are added 75

CHAPTER CHAPT ER 7. PAR ARTITIONE TITIONED D COLUMNS COLUMNS

up from the costs for read queries, write queries and the performed merges. The costs for the read queries highly depend on the number of queries, their selectivity, their complexity and the size of the tables they are accessing. The costs for the write queries are given by the num number ber of write queries and the number number of distinct distinct values values in the write optimized optimized storage. The merge costs depend on the number of touched tuples per write and the characteristics of the tables’ distinct values (intersection, tail). However, we are not interested in modeling the actual total costs for one merge strategy. Instead, we are interested in modeling only the costs that are inﬂuenced by the merge strategy in order to determine determine the merge strategy with the lowest lowest total costs. The total costs for one merge strategy strategy comprise the read, write and merge costs. costs. The cost estimations estimations are very similar similar to the index maintenance case presented in [6 [6].

76

Chapter 8

Conclusions and Future Work This last This last chapt chapter er presen presents ts the conclusi conclusion on of this this report. report. We summari summarize ze the content content of this this work and give an outlook of future work. In this report, we presented a cost model for estimating cache misses for the plan operators scan with equality selection, scan with range selection, positional lookup and insert in a columnoriented in-memory database. We outlined functions estimating the cache misses for columns using diﬀerent data structures structures and covered covered uncompress uncompressed ed column columns, s, bit-pack bit-packed ed and dictionary dictionary encoded columns with sorted and unsorted dictionaries. We presented detailed cost functions predicting cache misses and TLB misses. Chapter 3 gave Chapter 3 gave an overview of the discussed system and introduced the considered physical column organizations organizations and plan operators. As expected, uncompressed uncompressed columns are well well suited if fast insertions insertions are required and also support fast single lookups. lookups. Howeve However, r, scan performance performance is slow slow compare compared d to dictio dictionary nary encoded encoded column columns. s. Especia Especially lly scan operator operatorss with with equali equality ty selection proﬁt from scanning only the compressed attribute vector. Chapter 4 presented Chapter 4 presented n evaluation of parameter inﬂuences on the plan operator performance with wit h varying arying numbe numberr of rows, rows, numbe numberr of distin distinct ct values, alues, value disord disorder, er, value alue len length gth and v alue skewness. skewnes s. For of dictionary dictio column unsorted unsortedparameters dictionary dictionary,, inﬂuencing we identiﬁed identiﬁed besides the number rowsnary andencoded the valuecolumns lengths –using threean additional the– performance performan ce of scan operations. operations. The number number of distinct distinct values values has a strong impact on range and equal scans, and renders unsorted dictionaries unusable for columns with a large amount of distinct values and dictionaries larger than available cache sizes. However, if the disorder in the column is low, the penalties paid for range scans are manageable. Additionally, the skewness of values in a column can inﬂuence the performance of range scan operators, although the impact is small unless the distribution distribution is extremely skewe skewed. d. Regarding Regarding single lookup operations, operations, the physical column organizations do not largely diﬀer, except that the lookup in uncompressed columnss is slightly column slightly faster. faster. In a nutshell, uncompressed columns seem to be well suited for classical OLTP workloads 77

CHAPTE CHA PTER R 8.

CON CONCLUS CLUSIONS IONS AND FUT FUTURE URE WOR WORK K

with a high number of inserts and mainly single single lookups. As the number of scan operations and especially range scans increases, the additional insert expenses pay oﬀ, rendering dictionary encoded column suitable for analytical analytical workloads. workloads. Considering Considering mixed workloads workloads as in [46 in [46,, 45], 45] , there is no optimal column organization as the result is depending on the concrete query distribution distributio n in the workload. workload. Especially Especially in the cases of mixed workloads workloads,, where the optimal column organization is unclear, our analytical cost model presented in Chapter 5 and 5 and extended for index structures in Chapter 6 6,, allows to roughly estimate the costs and decide for a suited column organization. Chapter 7 introduced partitioned columns and proposed an optimized online merge algoChapter rithm for dictionary encoded in-memory column stores, enabling them to support the update performance required to run enterprise application workloads on read-optimized databases. Additi Add itional onally ly,, we introduce introduced d a memory memory traﬃc traﬃc based based cost cost model model for the merge merge process process and proposed merge strategies to further balance the tradeoﬀ between merge costs and query performance. Possible directions of future work could be the extension of the model for non uniform memory access systems or to take more sophisticated algorithms for the discussed plan operations or additional operators into account. For example, example, the range scan operation on an uncompressed uncompressed dictionary can be imple implemen mented ted by ﬁrst scanning the dictionary to build a bitmap index and then iterating over the valueids performing performing lookups into the bitmap index instead instead of access accessing ing the dictio dictionar nary y. Such Such an algorit alg orithm hm performs performs well well in case case the bitmap bitmap index index ﬁts into into the cache. cache. A simila similarr direct direction ion of research would be to extend the model for various index structures.

78

References [1] D. Abadi, D. Myers, D. DeWitt, and S. Madden. Madden. Materializat Materialization ion strategies strategies in a columncolumnoriented DBMS. IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007 , pages 466–475, 2007. [2] U. A. Acar, G. E. Blelloc Blelloch, h, and R. D. Blumofe Blumofe.. The data data locality locality of work work stealin stealing. g. In SPAA,, pages 1–12, 2000. SPAA [3] puter V. Babka and P. T uma. Investigating Inve stigating Cachearking Parameters x86 F2009. amily Processors. ComPerformanc Perfor mance e˚ Evaluation Evaluat ion and Benchmarking Benchm , pages of 77–96, [4] T. Barr, A. Cox, Cox, and S. Rixner. Translation ranslation Caching: Caching: Skip, Don’t Walk Walk (the Page Table) Table).. ACM SIGARCH Computer Architecture News , 38(3):48–59, 2010. [5] J. Bernet. Dictionary Compression for a Scan-Based, Main-Memory Database System . PhD thesis, ETH Zurich, Apr. 2010. [6] S. Blum. A generic merge-based dynamic indexing framew framework ork for imemex. Master’s thesis, ETH Zurich, 2008. [7] P. Boncz, S. Mangold, and M. L. Kersten. Kersten. Database Database Architecture Architecture Optimized Optimized for the new Bottleneck: Memory Access. page 12, Nov. 1999. [8] P. Boncz and M. Zukowski. Zukowski. MonetDB/X100: MonetDB/X100: Hyper-pipelini Hyper-pipelining ng query execution. execution. Proc. CIDR,, 2005. CIDR [9] P. A. Boncz, Boncz, M. L. Kersten Kersten,, and S. Manego Manegold. ld. Breaki Breaking ng the memory memory wall in MonetDB. MonetDB. Communications of the ACM , 51(12):77, Dec. 2008. [10] P. A. Bonczk. Bonczk. Monet: A Next-Gener Next-Generation ation DBMS Kernel Kernel For Query-Inten Query-Intensive sive ApplicaApplications . PhD thesis, CWI Amsterdam, 2002. [11] J. Chhugani, A. D. Nguyen, V. W. Lee, W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Dubey. Eﬃcient Eﬃcient implementat implementation ion of sorting sorting on multi-cor multi-coree SIMD CPU architecture. In VLDB In VLDB , pages 1313–1324, 2008. 79

REFERENCES

[12] E. Codd. A relational relational model of data for large shared shared data banks. banks. Communications Communications of the ACM , 13(6), 1970. [13] D. Comer. Ubiquitous Ubiquitous B-Tree. B-Tree.   CSUR, CSUR, 1979. [14] C. Cranor, T. Johnson, Johnson, O. Spataschek, Spataschek, and V. Shkapeny Shkapenyuk. uk. Gigascope: a stream database database for network applications. Journal of Instruction-Level Parallelism , 10:1–33, 2008. [15] V. Cuppu, B. Jacob, B. Davis, Davis, and T. Mudge. A performa p erformance nce comparison comparison of contemporary contemporary DRAM architectures. Pr Pro oceedings dings of the 26th annual annual intern internatio ational nal symp symposium osium on . . . , 1999. [16] V. Cuppu, B. Jacob, B. Davis, and T. Mudge. High-performanc High-performancee DRAMs in workstation workstation environments.   Computers, IEEE Transactions on , 50(11):1133–1153, 2001. environments. [17] J. Dongarra, K. London, S. Moore, P. Mucci, and D. Terpst erpstra. ra. Using PAPI PAPI for hardware performance performan ce monitoring monitoring on Linux Linux systems. systems. Confer Conferenc encee on Linux Clusters Clusters:: The HPC Revolution , 2001. [18] U. Drepper. What Every Programmer Should Know Know About Memory. h Memory. http://pe ttp://people. ople. re redhat. dhat. com/drepper/cpumemory. com/dr epper/cpumemory. pdf , 2007. [19] R. S. Francis Francis and I. D. Mathieson. A Benchm Benchmark ark Parallel Parallel Sort for Shared Memory Multiprocessors. IEEE Trans. Computers , 37(12):1619–1626, 1988. [20] C. D. French. rench. One Size Fits All Database Architectu Architectures res Do Not Work for DSS. SIGMOD DSS. SIGMOD , 1995. [21]] G. Graefe. [21 Graefe. Volcano olcano — An Extens Extensibl iblee and Parallel Parallel Query Evalua Evaluatio tion n System System.. IEEE Transactions on Knowledge and Data Engineering , 1994. [22] G. Graefe. Sorting and indexing with partitione partitioned d B-trees. Proc. B-trees. Proc. of the 1st Int’l Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, USA, USA, 2003. [23] G. Graefe and H. Kuno. Adaptive Adaptive indexing indexing for relational relational keys. keys. Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on , pages 69–74, 2010. [24] G. Graefe and H. Kuno. Self-selec Self-selecting, ting, self-tuning, self-tuning, incremental incrementally ly optimized indexes. indexes. Pro Proceedings of the 13th International Conference on Extending Database Technology , pages 371–381, 2010. [25] M. Grund et al. HYRISE—A HYRISE—A Main Memory Hybrid Hybrid Storage Engine. Engine.   VLDB , 2010. [26] M. Grund et al. Optimal Optimal Query Operator Materialization Materialization Strategy Strategy for Hybrid Databases. Databases. DBKDA,, 2010. DBKDA [27] B. He, Y. Li, Q. Luo, and D. Yang. Yang. EaseDB: EaseDB: a cache-obl cache-oblivious ivious in-memory in-memory query query processor. SIGMOD , 2007. 80

REFERENCES

[28] S. H´eman, eman, M. Zukowsk Zukowski, i, N. J. Nes, L. Sidirourgos, Sidirourgos, and P. A. Boncz. Positional Positional update handling in column stores. SIGMOD , 2010. [29] J. Hennessy Hennessy and D. Patterson. Patterson.   Computer architecture: architecture: a quantitative appro ap proach ach . Morgan Morgan Kaufmann, 2003. Kaufmann, [30] W. D. Hillis and G. L. Steele, Steele, Jr. Data parallel parallel algorithms. algorithms. Commun. Commun. ACM , 29(12):1170– 29(12):1170– 1183, 1986. [31] C. Hristea, D. Lenoski, and J. Keen. Measuring Memory Hierarchy P Performance erformance of CacheCoherent Multiprocessors Using Micro Benchmarks. ACM/IEEE Supercomputing Con ference ferenc e , 1997. [32] F. Hubner, u u cost-aware are strategy for merging merging diﬀerentia diﬀerentiall stores ¨ bner, J. Bose, ¨ose, and J. Kruger. ¨ ger. A cost-aw in column-oriented in-memory DBMS. BIRTE , 2011. [33] S. Idreos, M. Kersten, Kersten, and S. Manegold. Manegold. Database Database cracking. cracking. Proceedings of CIDR, CIDR, 2007. [34] S. Idreos, M. Kersten, and S. Manegold. Manegold. Self-organiz Self-organizing ing tuple reconstruction reconstruction in columnstores. SIGMOD Conference , pages 297–308, 2009. [35] S. Idreos, S. Manegold, H. Kuno, and G. Graefe. Graefe. Merging Merging what’s cracked, cracked, cracking cracking what’s merged:: adaptive merged adaptive indexing indexing in main-memory main-memory column-stores. column-stores.   VLDB , 2011. [36] Intel Intel Inc. TLBs, Inc. TLBs, Paging-Structure Caches, and Their Invalidation , 2008. [37] Intel Intel Inc. Inc. Intel 64 and IA-32 Architectures Optimization Reference Manual , 2011. [38] R. Kallman, H. Kimura, J. Natkins, Natkins, A. Pavlo, Pavlo, A. Rasin, S. Zdonik, E. Jones, S. Madden, M. Stonebraker, Stonebraker, and Y. Zhang. H-store: H-store: a high-performance, high-performance, distributed distributed main memory transaction processing system. Proceedings of the VLDB Endowment archive , 1(2):1496– 1499, 2008. [39] B. Kao and H. Garcia-Molina Garcia-Molina.. An overview of real-time database systems . Prentice-Hall, Inc, 1995. [40] A. Kemper and T. Neu Neuman mann. n. HyPer: HyPer: A hybri hybrid d OLTP& OLTP&OLA OLAP P main main memory memory database database system based on virtual memory snapshots. ICDE , 2011. [41] A. Kemper, T. Neumann, F. Funk Funke, e, V. Leis, and H. Muehe. HyPer: HyPer: Adapting Adapting Columnar Main-Memory Data Management for Transactional AND Query Processing. Bulletin of the Technical Committee on Data Engineering , 2012. [42] K. Kim, J. Shim, and I.-h. Lee. Lee. Cache Cache conscious conscious trees: how do they perform on contemcontemporary commodity commodity microprocessor microprocessors? s? ICCSA’07 , 2007. [43] D. E. Knuth. Knuth. Art Art of Computer Programming, Volume 3: Sorting and Searching . AddisonWesley Professional, 1973. 81

REFERENCES

[44]] J. Krueger, [44 Krueger, M. Grund, Grund, C. Tinnef Tinnefeld eld,, H. Plattner Plattner,, A. Zeier, Zeier, and F. Faerber. aerber. Optimi Optimizzing Write Performance for Read Optimized Databases. Database Systems for Advanced Applications , pages 291–305, 2010. [45]] J. [45 Krueger, Krueger, C. Tinnef Tinnefeld eld,, M. ,Grund, Zeier, and H. Plattner. Plattner. A case case for onlin onlinee mixed mixed workload processing. DBTest 2010. A. Zeier, [46] J. Kr¨ Kruger, u Schwalb, and C. Jatin. Fast Updates ¨ ger, K. Changkyu, M. Grund, N. Satish, D. Schwalb, on Read Optimized Databases Using Multi Core CPUs. VLDB , 2011. [47] I. Lee, S. Lee, and J. Shim. Making Making T-Trees T-Trees Cache Cache Conscious on Commodity MicroproMicroprocessors. Journal of information science and engineering , 27:143–161, 2011. [48] T. J. Lehman and M. J. Carey. Carey. A Study of Index Structures Structures for Main Memory Database Database Management Systems. VLDB , 1968. [49] S. Listgarten. Listgarten. Modelling Modelling Costs for a MM-DBMS. MM-DBMS. RTDB , 1996. [50]] R. MacNic [50 MacNicol ol and B. French rench.. Sybase Sybase IQ Multip Multiplex lex — Design Designed ed For Analyti Analytics. cs. VLDB , 2004. [51] S. Manegold, Manegold, P. A. Boncz, and M. L. Kersten. Generic Generic database cost models for hierarhierarchical memory systems. VLDB , 2002. [52]] S. Manegold [52 Manegold and M. Kersten. Kersten. Generi Genericc databas databasee cost cost models models for hierarc hierarchic hical al mem memory ory systems. Pr Proc oceeedings edings of the 28th international international conferenc conferencee on Very Large arge Data Bases , pages 191–202, 2002. [53] A. Mazreah, M. Sahebi, M. Manzuri, and S. Hosseini. A Novel Zero-Aware Four-T Four-Transistor ransistor SRAM Cell for High Density and Low Power Cache Application. In Advanced Computer Theory and Engineering, 2008. ICACTE ’08. International Conference on , pages 571–575, 2008. [54] P. Mishra and M. H. Eich. Eich. Join Processing Processing in Relational Databases. Databases.   CSUR, CSUR, 1992. [55] G. Moore. Cramming Cramming more components components onto integrated integrated circuits. circuits. Electronics , 38:114 ﬀ., 1965. [56] J. Muller. u ¨ ller. A Real-Time In-Memory Discovery Service . PhD thesis, Hasso Plattner Institute, 2012. [57]] H. Pirk. [57 Pirk. Cache Cache Consci Conscious ous Data La Layo youti uting ng for In-Memory In-Memory Database Databases. s. Master Master’s ’s thesis, thesis, Humboldt-Universit¨at Humboldt-Universit ¨at zu Berlin, 2010. [58] H. Plattner. A Common Database Approach Approach for OLTP OLTP and OLAP Using an In-Memory Column Database. ACM Sigmod Records , 2009. 82

REFERENCES

[59] H. Plattner Plattner and A. Zeier. Zeier. In-Memory Data Management: An Inﬂection Point for Enterprise Applications . Springer, 2011. [60] T. Puzak, Puzak, A. Hartstein Hartstein,, P. Emma, Emma, V. Srinivasa Srinivasan, n, and A. Nadus. Nadus. Analyz Analyzing ing the Cost Cost of a Cache Miss Using Pipeline Spectroscopy. Journal of Instruction-Level Parallelism , 10:1–33, 2008. [61] R. Ramakrishnan and J. Gehrke. Database Database Management Management Systems . McGr McGraw aw-H -Hil illl Science/Engineering/Math, 3rd edition, 1999. [62] J. Rao and K. A. Ross. Cache Cache Conscious Conscious Indexing for Decision-Su Decision-Support pport in Main Memory. VLDB , 1999. [63] J. Rao and and K. A. Ross. Ross. Making B+ -Trees Cache Conscious in Main Memory. In SIGMOD In SIGMOD , pages 475–486, 2000. [64] R. Saavedr Saavedraa and A. Smith. Smith. Measur Measuring ing cache cache and TLB performanc performancee and thei theirr eﬀect eﬀect on benchmark runtimes. IEEE Transactions on Computersn , 1995. [65] M. Sleiman, L. Lipsky Lipsky, and K. Konwar. Konwar. Performance Performance Modeling Modeling of Hierarchical Hierarchical Memories. Memories. CAINE , 2006. [66] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O’Neil, P. E. O’Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: C-Stor e: A Column-orie Column-oriente nted d DBMS. DBMS. VLDB , 2005. [67] M. Stonebraker Stonebraker and U. Cetintemel Cetintemel.. “One size ﬁts all”: all”: an idea whose time has come and gone. ICDE , 2005. [68] J. A. Storer. Data Compression: Methods and Theory . Computer Science Press, 1988. [69] C. T. Team. eam. In-mem In-memory ory data manage managemen mentt for consum consumer er transa transacti ctions ons the timest timesten en approach.   SIGMOD , 1999. proach. [70] K. Whang. Query optimization optimization in a memory-residen memory-residentt domain relational calculus calculus database system.   TODS , 1990. system. [71] T. Willhalm Willhalm,, N. Popovici Popovici,, Y. Boshmaf, Boshmaf, H. Plattne Plattner, r, A. Zeier, Zeier, and J. Schaﬀ Schaﬀner ner.. SIMDSIMDscan: ultra fast in-memory table scan using on-chip vector processing units. Proceedings of the VLDB Endowment , 2(1):385–394, 2009. [72] K. Wu, Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indices with eﬃcient compression. compression. TODS , 2006. [73] N. Zhang, A. Mukher Mukherjee jee,, T. Tao, Tao, T. Bell, R. Vijay VijayaSat aSatya ya,, and D. Adjero Adjeroh. h. A Flexib Flexible le Compressed Text Retrieval System Using a Modiﬁed LZW Algorithm. Data Compression Conference , 2005. 83

REFERENCES

[74] M. Zukowski. Zukowski.   Balancing Vectorized Query Execution with Bandwidth-Optimized Storage . PhD thesis, CWI Amsterdam, 2009. [75] M. Zukowski, Zukowski, P. Boncz, N. Nes, and S. H´eman. eman. MonetDB/X100 MonetDB/X100 - A DBMS In The CPU Cache. IEEE Data Engineering Bulletin , 2005.

84

Aktuelle Technische Berichte des Hasso-Plattner-Instituts

Band

ISBN

Titel

Autoren / Redaktion

66

978-3-86956227-8

Model-Driven Engineering of Adaptation Engines for Self-Adaptive Software

Thomas Vogel, Holger Giese

65

978-3-86956226-1

Scalable Compatibility for Embedded Real-Time components via Language Progressive Timed Automata

Stefan Neumann, Holger Giese

64

978-3-86956217-9

Cyber-Physical Systems with Dynamic Structure: Towards Modeling and Verification of Inductive Invariants

Basil Becker, Holger Giese

63

978-3-86956204-9

Theories and Intricacies of Information Security Problems

Anne V. D. M. Kayem, Kayem, Christoph Meinel (Eds.)

62

978-3-86956212-4

Covering or Complete? Discovering Conditional Inclusion Dependencies

Jana Bauckmann, Ziawasch Abedjan, Ulf Leser, Heiko Müller, Müller, Felix Naumann

61

978-3-86956194-3

Vierter Deutscher IPv6 Gipfel 2011

Christoph Meinel, Harald Sack (Hrsg.)

60

978-3-86956201-8

Understanding Cryptic Schemata in Large Alexander Albrecht, Extract-Transform-Load Systems Felix Naumann

59

978-3-86956193-6

The JCop Language Specification

Malte Appeltauer, Robert Hirschfeld

58

978-3-86956192-9

MDE Settings in SAP: A Descriptive Field Study

Regina Hebig, Holger Giese

57

978-3-86956191-2

Industrial Case Study on the Integration of Holger Giese, Stephan SysML and AUTOSAR with Triple Graph Hildebrandt, Stefan Neumann, Grammars Sebastian Wätzoldt

56

978-3-86956171-4

Quantitative Modeling and Analysis of Service-Oriented Real-Time Systems using Interval Probabilistic Timed Automata

Christian Krause, Holger Giese

55

978-3-86956169-1

Proceedings of the 4th Many-core Applications Research Community (MARC) Symposium

Peter Tröger, Andreas Polze (Eds.)

54

978-3-86956158-5

An Abstraction for Version Control Systems

Matthias Kleine, Robert Hirschfeld, Gilad Bracha

53

978-3-86956160-8

Web-based Development in the Lively Kernel

Jens Lincke, Robert Hirschfeld (Eds.)

52

978-3-86956156-1

Einführung von IPv6 in Unternehmensnetzen: Ein Leitfaden

Wilhelm Boeddinghaus, Christoph Meinel, Harald Sack

51

978-3-86956148-6

Advancing the Discovery of Unique Column Combinations

Ziawasch Abedjan, Felix Naumann

50

978-3-86956144-8

Data in Business Processes

Andreas Meyer, Sergey Sergey Smirnov, Mathias Weske

49

978-3-86956143-1

Adaptive Windows for Duplicate Detection Uwe Draisbach, Felix Naumann, Sascha Szott, Oliver Wonneberg

ISBN 978-3-86956-228-5 ISSN 1613-5652

Column-Store Strings

Comments

Content

Sponsor Documents

Recommended