Bulk Loading Data into Cassandra

50 %
50 %
Information about Bulk Loading Data into Cassandra
Technology

Published on March 7, 2014

Author: DataStax

Source: slideshare.net

Description

Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system’s write path.

In this webinar, we will look at how data is stored on disk in sstables, how to generate these structures directly, and how to load this data rapidly into your cluster using sstableloader. We'll also review different use cases for when you should and shouldn't use this method.

Planet Cassandra 2014 Bulk-Loading Data into Cassandra Patricia Gorla @patriciagorla Cassandra Consultant www.thelastpickle.com

About Us • Work with clients to deliver and improve Apache Cassandra services • Apache Cassandra committer, Datastax MVP, Hector maintainer, Apache Usergrid committer • Based in New Zealand & USA

Why is bulk loading useful? • Performance tests

Why is bulk loading useful? • Performance tests • Migrating historical data

Why is bulk loading useful? • Performance tests • Migrating historical data • Changing topologies

! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion

Cassandra Write Path write[0]

Cassandra Write Path • write[0] Writes written to both the commit log and memtable. commitlog memtable

Cassandra Write Path • • write[0] Writes written to both the commit log and memtable. Memtable is sorted. commitlog memtable

Cassandra Write Path • write[0] Memtable flushed out to sstables. commitlog memtable sstable[0] sstable[2] sstable[1]

Cassandra Write Path • write[0] Compaction helps keep the read latency low. commitlog memtable sstable[0] sstable[2] sstable[1] sstable[n]

Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Contains all data needed to regenerate components

Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Index of row keys

Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Index summary from Index.db file

Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Bloom filter over sstable

Sorted String Tables mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Table of contents of all components

! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion

create keyspace test with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; ! create column family test with comparator = 'AsciiType' and default_validation_class = 'AsciiType' and key_validation_class = 'AsciiType'; Set up keyspace and column family

AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java

AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java

AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, // subcomparator for super columns size_per_sstable_mb ); SStableGen.java

ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024)); KeyGenerator keyGen = new KeyGenerator(); long dataSize = 0; writer = new SSTableSimpleUnsortedWriter(…); while (dataSize < max_data_bytes) { writer.newRow(key); for (int j=0; j<num_cols; j++) { ByteBuffer colName = ByteBufferUtil.bytes("col_" + j); ByteBuffer colValue = ByteBuffer.wrap(new byte[20]); randomBytes.get(colValue.array()); colValue.position(0); writer.addColumn(colName, colValue, timestamp); if (randomBytes.remaining() < colValue.limit()) { randomBytes.position(0); } else { randomBytes.position(randomBytes.position() + colValue.limit()); } } } }

patricia@dev:~/../data$ total 64 -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia -rw-r--r-- 1 patricia ls -lh mykeyspace/mycf staff staff staff staff staff staff staff 43B 79K 16B 36B 4.3K 80B 79B Feb Feb Feb Feb Feb Feb Feb 2 2 2 2 2 2 2 15:31 15:31 15:31 15:31 15:31 15:31 15:31 mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt Examining sstable output

$ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

$ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

$ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

$ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

$ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server

$ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server • Throttle command

$ bin/sstableloader Keyspace1/ColFam1 • Run command on separate server • Throttle command • Parallelise processes

! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion

// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()

// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()

// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); ! // assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } ! customerOrders.close() orders.close()

! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion

$ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]

$ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]

$ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d cass1,cass2,cass3 ! Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] ! progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]

$ bin/sstableloader Keyspace1/ColFam1 patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats pending tasks: 30 Active compaction remaining time : n/a

! • How Data is Stored • Case Studies - Generating Dummy Data - Backfilling Historical Data - Changing Topologies • Conclusion

cqlsh> CREATE KEYSPACE "test" WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; ! cqlsh> CREATE COLUMNFAMILY "test" (id text PRIMARY KEY ) ; CQL: Keep schema consistent

CQL3 Considerations • Uses CompositeType comparator

Planet Cassandra 2014 Q&A Patricia Gorla @patriciagorla Cassandra Consultant www.thelastpickle.com

Add a comment

Related presentations

Related pages

Using the Cassandra Bulk Loader | DataStax

Bulk loading data in Cassandra has historically been difficult. Although Cassandra has had the BinaryMemtable interface from the very beginning ...
Read more

Planet Cassandra – Bulk Loading Data into Cassandra

Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system’s write path. In this ...
Read more

apache - Load data in bulk into cassandra - Stack Overflow

Cassandra provides sstableloader for loading CSV data into cassandra. By using this sstabler, you can load the bulk data into cassandra. You can know more ...
Read more

Bulk Loading Data into Cassandra Using SSTableLoader ...

When you want to move the data from any database to Cassandra database the best option is SSTableloader in Cassandra. By using this we can transfer the ...
Read more

Using the Cassandra Bulk Loader, Updated | DataStax

We introduced sstableloader back in 0.8.1, in order to do bulk loading data into Cassandra. When it was first introduced, we wrote a blog post about its ...
Read more

Using the Cassandra Bulk Loader with Hadoop ...

For the purpose of bulk-loading external data into a cluster Cassandra 0 ... a Cassandra bulk loading ... Bulk Loader with Hadoop BulkOutputFormat ...
Read more

Bulk Loading Data into Cassandra - YouTube

Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system's write path ...
Read more

BulkLoading - Cassandra Wiki

Bulk Loading into Cassandra. A common task is to bulk load data into Cassandra. This page gives links to projects and information on how to do this.
Read more

sqoop - Bulk Loading in Cassandra - Stack Overflow

I have a requirement where I need to do load bulk data in Cassandra. ... Bulk Loading in Cassandra. ... Extract from Datastax Cassandra and load into HBase ...
Read more