Optimising Column stores with statistical analysis

50 %
50 %
Information about Optimising Column stores with statistical analysis
Technology

Published on March 8, 2014

Author: tkejser1

Source: slideshare.net

Description

A presentation about column stores, how they work and how you can optimise compression with them

Turning Rows into Columns Sales Product Customer I D Value I D Customer 1 Beer 1 Thomas 2 GBP 2 Beer 2 Thomas 2011-1125 10 GBP 3 Vodka 3 Thomas 4 Whiskey 4 Christian 5 Whiskey 5 Christian 6 Vodka 6 Alexei 7 Vodka 7 Alexei Product Customer Date Sale Beer Thomas 2011-1125 2 GBP Beer Thomas 2011-1125 Vodka Thomas And so on… until… Whiskey Christian 2011-1125 5 GBP Whiskey Christian 2011-1125 5 GBP Vodka Alexei 2011-1125 10 GBP

And we get… Product Sale Date Customer I D Value I D Customer I D Date I D Sale 1 Beer 1 Thomas 1 2011-11-25 1 2 GBP 2 Beer 2 Thomas 2 2011-11-25 2 2 GBP 3 Vodka 3 Thomas 3 2011-11-25 3 10 GBP 4 Whiskey 4 Christian 4 2011-11-25 4 5 GBP 5 Whiskey 5 Christian 5 2011-11-25 5 5 GBP 6 Vodka 6 Alexei 6 2011-11-25 6 10 GBP 7 Vodka 7 Alexei 7 2011-11-25 7 10 GBP

And what now? Product I D Value 1 Beer 2 Beer 3 Vodka 4 Whiskey 5 Whiskey 6 Vodka 7 Vodka Product’ ID 1-2 Run length Encode Value Beer 3 Vodka 4-5 Whiskey 6-7 Vodka

Applying Compression Product’ Customer’ Sale’ Date’ ID Value ID Customer ID Date ID Sale 1-2 Beer 1-3 Thomas 1-7 2011-11-25 1-2 2 GBP 3 Vodka 4-5 Christian 3 10 GBP 4-5 Whiskey 6-7 Alexei 4-5 5 GBP 6-7 Vodka 6-7 10 GBP

Insights Product’ ID One RL Value 12 3 Beer 45 Whiskey 67 Vodka Vodka

Ordering Example Product Customer Product Customer Product Customer Beer Thomas Beer Thomas Beer Christian Whiskey Christian Beer Thomas Vodka Thomas Vodka Thomas Vodka Thomas Whiskey Christian Whiskey Christian Whiskey Christian Beer Thomas Beer Thomas Whiskey Christian Vodka Alexei Vodka Alexei Vodka Alexei Vodka Alexei Customer Thomas Whiskey Product Vodka Alexei Beer Thomas Vodka Whiskey Christian Vodka Alexei

There is some overhead… Cluster on ID Heap Data Size 327MB 327MB Column Index Size 59MB 142MB

Rule of Thumb? Lowest first is worse!

OK, so what about highest first? Loose correlation Highest first is worse!

What are we looking for?

Just Read the Magic Code?

Coming to Terms with Entropy SKEW SPLAT ID 10001 10001 1000000 0.01 0.01 1 Histogram COUNT DISTINCT DISTINCT / COUNT

SKEW SPLAT ID ≈ 0.21 ≈ 13 ≈ 20

You will NEVER win

Column that “cluster” with other columns?

I(X;Y)

d(A, B) is zero!

Reflecting on Information Distance

There are MORE than n! routes http://arxiv.org/pdf/1207.2189.pdf

Heuristics are your Best Bet

Add a comment

Related presentations

Related pages

Optimising Column stores with statistical analysis - HubSlide

A presentation about column stores, how they work and how you can optimise compression with them
Read more

Thomas Kejser - HubSlide

Optimising Column stores with statistical analysis A presentation about column stores, how they work and how y...
Read more

PASS SQLRally 2012 Nordic > Agenda > Speaker Details

Optimising Column Stores with Statistical Analysis (Database Administration (DBA)) Big Data Customer Solution Architectures (Business Intelligence (BI))
Read more

Sessões do PASS SQLRally Nordic 2012 disponíveis | SQLPort

Thomas Kejser - Optimising Column Stores with Statistical Analysis ... Joe Chang - Automating DMV performance collection, analysis and archival
Read more

Eventos | SQLPort

Thomas Kejser - Optimising Column Stores with Statistical Analysis Conor Cunningham - SQL Server 2012 Query Optimizer deep dive Michael Frandsen ...
Read more

Excel stats nicar2013 - IRE

UsingExcel&forStatistical&Analysis& ... From&the&listof&tools,&choose&DESCRIPTIVE&STATISTICS:& Highlightthe&column&containing&the ... Excel_stats_nicar2013 ...
Read more

SQLPort - Eventos

Thomas Kejser - Optimising Column Stores with Statistical Analysis Conor Cunningham - SQL Server 2012 Query Optimizer deep dive Michael Frandsen ...
Read more

PASS SQLRally 2012 Nordic > Agenda > Build My Schedule

PASS SQLRally 2012 Nordic - Build Your Schedule. Keyword: Category:
Read more

PASS SQLRally 2012 Nordic > Agenda > Main conference ...

... be awkward and cumbersome. It is much more effective to do this in a proper programming language. Much of the tedious analysis work can be ...
Read more