Information about All About Bitmap Indexes... And Sorting Them

A review of bitmap index from an academic perspective. Several theoretical results are presented. The talk also discuss technical issues regarding sorting the tables prior to indexing, as a way to improve the indexes.

Much of the talk is based on the following preprint:

Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes.

http://arxiv.org/abs/0901.3751

Much of the talk is based on the following preprint:

Daniel Lemire, Owen Kaser, Kamel Aouiche, Sorting improves word-aligned bitmap indexes.

http://arxiv.org/abs/0901.3751

Database Indexes Databases use precomputed indexes (auxiliary data structures) to speed processing. An index costs memory, can hurt update speed. Improving indexes is practically important. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What make indexes fast? We are going to use these three ideas: Expect speciﬁc queries? Avoid a full scan! Data is not random? Compress it! A speciﬁc computer architecture? taylor your code for it! Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap indexes Bitmap indexes have a long SELECT * FROM history. (1972 at IBM.) T WHERE x=a Long history with DW & OLAP. AND y=b; (Sybase IQ since mid 1990s). Main competition: B-trees. Above, compute {r | r is the row id of a row where x = a} ∩ {r | r is the row id of a row where y = b} Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmaps and fast AND/OR operations Computing the union of two sets of integers between 1 and 64 (eg row ids, trivial table). . . E.g., {1, 5, 8} ∪ {1, 3, 5}? Can be done in one operation by a CPU: BitwiseOR( 10001001, 10101000) Extend to sets from 1..N using N/64 operations. To compute [a0 , . . . , aN−1 ] ∨ [b0 , b1 , . . . , bN−1 ] : a0 , . . . , a63 BitwiseOR b0 , . . . , b63 ; a64 , . . . , a127 BitwiseOR b64 , . . . , b127 ; a128 , . . . , a192 BitwiseOR b128 , . . . , b192 ; ... Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Common applications of the bitmaps The Java language has had a bitmap class since the beginning: java.util.BitSet. (Sun’s implementation is based on 64-bit words.) Search engines use bitmaps to ﬁlter queries, e.g. Apache Lucene Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap compression A column with n rows and L distinct column index bitmaps values ⇒ nL bits x=3 x=1 x=2 x E.g., n = 106 , L = 104 → 10 Gbits 1 1 0 0 Uncompressed bitmaps are often 3 1 0 0 impractical n 1 1 0 0 Moreover, bitmaps often contain long 2 1 0 0 streams of zeroes. . . ... ... ... ... Logical operations over these zeroes is a L waste of CPU cycles. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

How to compress bitmaps? Must handle long streams of zeroes eﬃciently ⇒ Run-length encoding? (RLE) Bitmap: a run of 0s, a run of 1s, a run of 0s, a run of 1s, . . . So just encode the run lengths, e.g., 0001111100010111 → 3, 5, 3, 1,1,3 Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Compressing better with delta codes RLE can make things worse. E.g., Use 8-bit counters, then 11 may become 000000101. How many bits to use for the counters? Universal coding like delta codes use no more than c log x bits to represent value x. Recall Gamma codes: 0 is 0, 1 is 1, 01 is 2, 001 is 3, 0001 is 4, etc. Delta codes build on Gamma codes. Has two steps: x = 2N + (x mod 2N ). Write N − 1 as gamma code; write x mod 2N as an N − 1-bit number. E.g. 17 = 24 + 1, 0010001 Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

RLE with delta codes is pretty good In some (weak) sense, RLE compression with delta codes is optimal! Theorem A bitmap index over an N-value column of length n, compressed with RLE and delta codes, uses O(n log N) bits. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Is the compression rate what matters? There is endless debate about whether more compression is better: Solid-State Drives (SSD) have 10× the bandwidth? All problems are CPU-bound! Multi-core CPUs? All problems I/O-bound! Store your indexes in RAM? All problems are CPU-bound! ... No deﬁnitive answer on whether more compression is better. It depends! Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Byte/Word-aligned RLE RLE variants can focus on runs that align with machine-word boundaries. Trade compression for speed. That is what Oracle is doing. Variants: BBC (byte aligned), WAH Our EWAH extends Wu et al.’s (was known to Wu as WBC) word-aligned hybrid. 0101000000000000 000. . . 000 000. . . 000 0011111111111100 . . . ⇒ dirty word, run of 2 “clean 0” words, dirty word. . . Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Computational and storage bounds n → number of rows, c → number of 1s per row; Model storage cost as #(dirty words) + #(clean words, 0x00) Storage is in O(nc); Bounds do not depend on the number of bitmaps. (Assuming O(n) bitmaps). Construction time is proportional to index size. (Data is written sequentially on disk.) Implementation scales to millions of bitmaps. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What about other compression types? Why not compress using other techniques (Huﬀman, LZ77, Arithmetic Coding, . . . )? With RLE-like compression we have B1 ∨ B2 or B1 ∧ B2 in time O(|B1 | + |B2 |). We don’t know how to do this using the other compression techniques! Hence, with RLE, compress saves both storage and CPU cycles!!!! Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What happens when you have many bitmaps? Consider B1 ∨ B2 ∨ . . . ∨ BN . First compute the ﬁrst two : B1 ∨ B2 in time O(|B1 | + |B2 |). |B3 ∨ B4 | is in O(|B3 | + |B4 |). Thus (B1 ∨ B2 ) ∨ (B3 ∨ B4 ) takes O(2 |Bi |). . . i N i=1 |Bi | log N) Total is in O( [Lemire et al., 2009]. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Improving compression by sorting the table RLE, BBC, WAH, EWAH are order-sensitive: they compress sorted tables better; But ﬁnding the best row ordering is NP-hard [Lemire et al., 2009]. Lexicographic row sorting is fast, even for very large tables. easy: sort is a Unix staple. Substantial index-size reductions (often 2.5 times) Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Improving compression via k-of-N encoding With L bitmaps, you can represent L values by mapping each value to one bitmap; Alternatively, you can represent L 1-of-N value 2-of-N 2 = L(L − 1)/2 values by mapping each 100000 cat 1100 value to a pair of bitmaps; 010000 dog 1010 L More generally, you can represent k values 001000 dish 1001 by mapping each value to a k-tuple of 000100 ﬁsh 0110 bitmaps; 000010 cow 0101 At query time, you need to load k bitmaps 100000 cat 1100 in a look-up for one value; 000001 pony 0011 You trade query-time performance for fewer bitmaps; Often, fewer bitmaps translates into a smaller index, created faster. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Encode then sort? Or vice versa? Two diﬀerent conceptual approaches: Encode attributes in table, obtaining an uncompressed index 1 Sort the index rows Compress each column Sort the table rows 2 Encode attributes in table, build compressed index on-the-ﬂy. paint maker 1 1 0 1 0 1 0 1 1 1 0 1 red ford ⇒ ⇒ 1 0 1 0 1 1 1 0 1 0 1 1 blue honda 0 1 1 1 0 1 1 1 0 1 0 1 green ford ... ... ... ... ... ... paint maker paint maker 1 0 1 0 1 1 red ford blue honda ⇒ ⇒ 1 1 0 1 0 1 blue honda green ford 0 1 1 1 0 1 green ford red ford ... ... ... ... ... ... Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Gray-code order Gray-code (GC) order is an Gray-code Lex. order alternative to lexicographical 011 order (deﬁned only for bit 011 011 arrays); 011 101 110 May improve compression more 101 110 than lex. sort (k > 1); 110 111 [Pinar et al., 2005] process an 110 111 uncompressed bitmap index. 111 111 Slow, if uncompressed index 111 101 does not ﬁt in RAM. 111 101 GC order is not supported by DBMSes or Unix utilities. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Gray-code sorting, cheaply Size improvement is small (usually < 4%), but it’s essentially free: What Pinar et al. do: expensive GC sort after encoding 1 eg: [Tax, Cat, Girl, Cat] → sort([1100, 0110, 1001, 0110]); Instead, sort the table lexicographically—comparing values 2 alphabetically or by frequency (easy); eg: [Tax, Cat, Girl, Cat] → [Cat, Cat, Girl, Tax] Map ordered values to k-tuples of bitmaps ordered as Gray 3 codes: Cat: 0011, Dog: 0110, Girl: 0101, Tax: 1100; Lex ascending sequence: Cat, Dog, Girl, Tax. GC ascending sequence: 0011, 0110, 0101, 1100 for codes eg: [Cat, Cat, Girl, Tax] → [0011, 0011, 0101, 1100] (generates a GC-sorted result without expensive GC sorting). Easily extended for > 1 columns. 4 In our tests, this is as good as a Gray-code bitmap index sort [Pinar et al., 2005], but technically much easier. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What about “other” Gray-codes? Deﬁne Gray-code to be a way to list all bitvectors while minimizing Hamming distances [Knuth, 2005, § 7.2.1.1] There are other alternatives [Goddyn and Gvozdjak, 2003, Savage and Winkler, 1995]. Our tests suggest traditional Gray codes are best. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Test data sets Previous studies used data sets where the uncompressed index would ﬁt in RAM. Do their results apply to more realistic data sets? Our tests: Mix of real and synthetic data, up to 877 M rows, 22 GB, 4 M attribute values. using 4–10 columns/dimensions Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

When sorting, column order matters The ﬁrst column(s) gain Compression on TWEED-4d more from the sort (column 1 is primary sort 1 Gray Random-sort key); 0.8 Its bitmaps (ﬁrst 11 in example) are compressed 0.6 1-C/N well, compared to a 0.4 “randomsort”. (Red above green) 0.2 Least important column’s 0 11 18 43 49 bitmaps (43–49) don’t rang des bitmaps gain much (red vs green) Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

When sorting, column order matters Netﬂix: 24 column orderings 5.5e+08 1-of-N encoding Conceptually, we may 4-of-N encoding 5e+08 4.5e+08 wish to reorder columns, 4e+08 index size eg swap columns 1 & 3. 3.5e+08 Column order is crucial 3e+08 2.5e+08 (to successful sorting). 2e+08 Finding the best ordering 1.5e+08 1e+08 quickly remains open. 432 4311 4232 4211 4133 4122 3423 3411 3242 3211 3144 3122 2434 2411 2343 2311 2144 2133 1434 1422 1343 1322 1244 1233 4 column permutation Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Progress toward choosing column order Paper models “gain” of putting a given column ﬁrst. Idea: order columns greedily (by max gain). Experimentally, this approach is not promising: the best orderings don’t seem to depend on gain. Factors: skews of columns number of distinct values k density of column’s bitmaps Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What usually works for dimension ordering?: k=1 For 1-of-N bitmaps, a density-based approach was okay: Ordering rule, k = 1 : “sparse but not too sparse” Order columns by decreasing 1 1 − 1/ni min , , where ni 4w − 1 50 100 150 200 250 300 distinct values in column ni → the number of distinct values in column i, w → the word size. See 30–40% size reduction, merely knowing dimension sizes (ni ). Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

What usually works for dimension ordering?: k > 1 √ Density formula (ni → k ni ) recommends poorly when k > 1. Our experiments on synthetic data give some guidance: When k > 1, order columns by descending skew 1 descending size 2 (And do the reverse when k = 1.) Open issues, k > 1 How do we balance skew & size factors? 1 What other properties of the histograms are needed? 2 Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap-by-bitmap reordering One might instead make the index, reorder its columns, then apply GC sort [Canahuate et al., 2006]. Our best implementation of this is ≈ 100 times slower, cannot handle larger data sets. We tried several bitmap orders on DBGEN and Census. Out of 8 cases, only one gained, and only by 3%. Canahaute suggests ordering does not matter much, but we see factor-of-2 diﬀerences (??) Seems suﬃcient (and much faster) to work with groups of bitmaps (reorder attributes, not bitmaps) Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Index size versus block-wise sorting Netﬂix Instead of fully sorting the 900 k=1 table, we sorted it k=2 800 k=3 block-wise; k=4 700 taille de l’index (Mo) Fewer blocks means a 600 500 more complete sort; 400 Larger k means smaller 300 index (in this case); 200 Index size diminishes 100 0 100 200 300 400 500 600 700 drastically with sorting. # de blocs Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

How do 64-bit words compare to 32-bit words? We implemented EWAH using 16-bit, 32-bit and 64-bit words; Only 32-bit and 64-bit are eﬃcient; 64-bit indexes are nearly twice as large; 64-bit indexes are between 5%-40% faster (despite higher I/O costs). Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Open Source Software? Lemur Bitmap Index C++ Library: http://code.google.com/p/lemurbitmapindex/. JavaEWAH: A compressed alternative to the Java BitSet class http://code.google.com/p/javaewah/. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Future direction? Need better mathematical modelling of bitmap compressed size in sorted tables; Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Questions? ? Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Canahuate, G., Ferhatosmanoglu, H., and Pinar, A. (2006). Improving bitmap index compression by data reorganization. http: //hpcrd.lbl.gov/~apinar/papers/TKDE06.pdf (checked 2008-12-15). Goddyn, L. and Gvozdjak, P. (2003). Binary gray codes with long bit runs. Electronic Journal of Combinatorics, 10(R27):1–10. Knuth, D. E. (2005). The Art of Computer Programming, volume 4, chapter fascicle 2. Addison Wesley. Lemire, D., Kaser, O., and Aouiche, K. (2009). Sorting improves word-aligned bitmap indexes. available from http://arxiv.org/abs/0901.3751. Pinar, A., Tao, T., and Ferhatosmanoglu, H. (2005). Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Compressing bitmap indices by data reorganization. In ICDE’05, pages 310–321. Savage, C. and Winkler, P. (1995). Monotone gray codes and the middle levels problem. Journal of Combinatorial Theory, A, 70(2):230–248. Wu, K., Otoo, E. J., and Shoshani, A. (2006). Optimizing bitmap indices with eﬃcient compression. ACM Transactions on Database Systems, 31(1):1–38. Daniel Lemire All About Bitmap Indexes. . . And Sorting Them

Bitmap indexes SELECT*FROM TWHEREx=a ANDy=b; Bitmap indexes have a long history. (1972 at IBM.) Long history with DW & OLAP. (Sybase IQ since mid 1990s).

Read more

... many of them are ... Software can compress each bitmap in a bitmap index to ... the total number of distinct indexes to satisfy all ...

Read more

View 492 Bitmap posts, presentations, experts, and more. ... Articles, experts, jobs, and more: get all the professional insights you need on LinkedIn.

Read more

The mythical bitmap index. ... that WAH is the best of all kinds of bitmap indices? ... slideshare.net/lemire/all-about-bitmap-indexes-and-sorting-them.

Read more

View 366753 Sorting posts, presentations, ... get all the professional insights you need on LinkedIn. ... All About Bitmap Indexes... And Sorting Them ...

Read more

... 00:00:00.00 SQL> create bitmap index idx_emp ... I am sorting them by ... into test_bitmap_index select 2010,2,rownum+10 from all_objects ...

Read more

The Secrets of Oracle Bitmap Indexes ... the overhead on maintaining them is enormous. A modification to a bitmap index ... all _objects ...

Read more

You also may see these indexes, or want to use them from time to time. ... bitmap indexes. ... , from ones that index across all the partitions ...

Read more

Social graphs using bitsets. ... graph in high-performing persistent compressed bitmap ... net/lemire/all-about-bitmap-indexes-and-sorting-them.

Read more

## Add a comment