Published on March 10, 2014
1 Parquet data format & Impala overview
2 Agenda • Objective • Various data formats • Use case • Parquet • Impala
3 Objective • 2 fold: • Quest for a more performant data format than Avro for nested data • Understand and test new data formats in general
4 Hadoop data formats • Sequence file. It stores key-value pairs of data in a flat binary file. Rows stored as values. • ORC. Stores column oriented data. Added RLE and Dictionary encoding, and statistics, single file output. Will add Bloom filter. • Avro. Data serialization framework: serialization format & exchange service, for any language. Data accompanied by schema (in JSON). Supports schema evolution.
5 Parquet • Columnar storage • Automatic dictionary encoding and run-length encoding. Separation of encoding vs compression. • Run-length encoding: replaces sequences ("runs") of consecutive repeated characters (or other units of data) with a single character and the length of the run. • Dictionary encoding takes the different values present in a column, and represents each one in compact 2-byte form
6 Parquet • Parquet can handle multiple schemas. Support schema evolution. • LogType A : organizationId, userId, timestamp, recordId, cpuTime • LogType V : userId, organizationId, timestamp, foo, bar • Can be used by any project in the Hadoop ecosystem. Integrations provided for M/R, Pig, Hive, Cascading and Impala.
7 Parquet • SELECT vs INSERT. • Parquet tables require relatively little memory to query, because a query reads and decompresses data in 8MB chunks. • Inserting into a Parquet table is a more memory- intensive operation because the data for each data file (with a maximum size of 1GB) is stored in memory until encoded, compressed, and written to disk.
8 Parquet • Memory issues (Heap space error) resolved by: • Reducing the parquet.block.size.The block size is the size of a row group being buffered in memory and its default value is 256 MB. • The total memory allocated was around 1 GB. • Using multiple Hive partitions -> multiple buffers were getting created (one for writing into each partition ) . • So writing data using parquet will always have a high memory requirement . • Hive’s Distribute by: was workaround to memory issues!
9 Parquet vs other formats Performance test with 100G data over multiple queries Parquet wins
10 Impala overview • MPP implementation of a query engine • Impala vs Hive: SQL queries for interactive exploratory analytics on large data sets. Vs Hive, runs as batch. • Not using M/R – but uses HDFS • Not CEP – closer to a RDBMS. • Impala uses the same metadata store as Hive to record information about table structure and properties
11 Impala overview • Can create a table in Hive, and use it in Impala • E.g. Impala doesn’t support Avro, but Hive does • Language is mix between SQL & HiveQL • Requires a lot of memory (128 G min./node) • Initial load of data via Refresh; can take a lot of time • loads the block location data for newly added data files
12 Impala overview • Shortcomings • Impala doesn’t support nested types at this point (version 1.2.3) as long as it contains only Impala- compatible data types – it cannot contain nested types such as array, map, or struct. • Impala currently does not "spill to disk" • if intermediate results being processed on a node exceed the memory reserved for Impala on that node. • No Custom Serializer/Deserializer classes (SerDes) • Impala cancels a running query if any host on which that query is executing fails
13 Impala overview • Example. For create a PARQUET table in IMPALA there are 3 ways: • -> PARQUET table created in HIVE (with no nested data types). • -> Create and load with data a normal text table in IMPALA: • IMPALA> create table parquet_table_name LIKE text_table_name STORED AS PARQUET LOCATION /user/hdfs/..’; • Create Parquet format table and then insert into parquet table using normal text table. • IMPALA> insert overwrite table parquet_table_name select * from text_table_name;
14 Use Case • Can't query Avro table in Impala because having nested columns. • Avro table created through Hive, we can use it in Impala as long as it contains only Impala-compatible data types. • (cannot contain nested types such as array, map, orstruct).
15 Use Case • How to deal with nested XML data in Hadoop? • There is no direct mapping from xml to avro. Process goes: • Parse XML and Convert to Avro : Parse XML using XMLStreamReader and • Perform JAXB unmarshalling and Create Avro Records from JAXB objects.Need to write a java class for this.Tried using Parquet/Avro: • Tested: Process Xml – first convert into Avro and then store into Parquet format using parquet-avro apis. • The problem is the Schema provided has some arrays which is union of type string and null both. • Currently this AvroSchemaConverter is not able to handle such avro schema and it gives exception. • Tested: Impala 1.2.3 on CDH 4.5 • Impala doesn’t support nested types at this point
16 Thank you
Overview of Impala Tables. Tables are ... EXTERNAL TABLE sets up an Impala table that ... started with Impala. The Parquet file format offers the ...
View 694 Parquet posts, presentations, experts, and more. Get the professional knowledge you need on LinkedIn. ... Industry Analysis, Overview, ...
... how can I convert them in parquet file format with Impala? I know that I can create parquet ... overview of the site ... External table from compressed ...
External Parquet Table with ... en/documentation/cloudera-impala/latest/topics/impala_tutorial.html#tut_external_partition ... Overview. group:
View 7663 Impala posts, presentations, experts, and more. Get the professional knowledge you need on LinkedIn. ... Parquet and impala overview external.
If you specify the EXTERNAL clause, Impala treats the table ... see Overview of Impala ... or a file from an existing Impala Parquet ...
How to increase parquet file size inserted ... I use impala version 1.1.1. How to increase parquet ... and the table created with CREATE EXTERNAL ...
... Impala external table in Parquet format . ) Can you create an Impala external table in Parquet ... be split into multiple hdfs ... Overview. Tagged ...