Parquet and impala overview external

100 %
0 %
Information about Parquet and impala overview external

Published on March 10, 2014

Author: mattlieber



Evaluation of Hadoop formats along with Parquet.
Brief intro and overview of Impala.

1 Parquet data format & Impala overview

2 Agenda • Objective • Various data formats • Use case • Parquet • Impala

3 Objective • 2 fold: • Quest for a more performant data format than Avro for nested data • Understand and test new data formats in general

4 Hadoop data formats • Sequence file. It stores key-value pairs of data in a flat binary file. Rows stored as values. • ORC. Stores column oriented data. Added RLE and Dictionary encoding, and statistics, single file output. Will add Bloom filter. • Avro. Data serialization framework: serialization format & exchange service, for any language. Data accompanied by schema (in JSON). Supports schema evolution.

5 Parquet • Columnar storage • Automatic dictionary encoding and run-length encoding. Separation of encoding vs compression. • Run-length encoding: replaces sequences ("runs") of consecutive repeated characters (or other units of data) with a single character and the length of the run. • Dictionary encoding takes the different values present in a column, and represents each one in compact 2-byte form

6 Parquet • Parquet can handle multiple schemas. Support schema evolution. • LogType A : organizationId, userId, timestamp, recordId, cpuTime • LogType V : userId, organizationId, timestamp, foo, bar • Can be used by any project in the Hadoop ecosystem. Integrations provided for M/R, Pig, Hive, Cascading and Impala.

7 Parquet • SELECT vs INSERT. • Parquet tables require relatively little memory to query, because a query reads and decompresses data in 8MB chunks. • Inserting into a Parquet table is a more memory- intensive operation because the data for each data file (with a maximum size of 1GB) is stored in memory until encoded, compressed, and written to disk.

8 Parquet • Memory issues (Heap space error) resolved by: • Reducing the parquet.block.size.The block size is the size of a row group being buffered in memory and its default value is 256 MB. • The total memory allocated was around 1 GB. • Using multiple Hive partitions -> multiple buffers were getting created (one for writing into each partition ) . • So writing data using parquet will always have a high memory requirement . • Hive’s Distribute by: was workaround to memory issues!

9 Parquet vs other formats Performance test with 100G data over multiple queries Parquet wins

10 Impala overview • MPP implementation of a query engine • Impala vs Hive: SQL queries for interactive exploratory analytics on large data sets. Vs Hive, runs as batch. • Not using M/R – but uses HDFS • Not CEP – closer to a RDBMS. • Impala uses the same metadata store as Hive to record information about table structure and properties

11 Impala overview • Can create a table in Hive, and use it in Impala • E.g. Impala doesn’t support Avro, but Hive does • Language is mix between SQL & HiveQL • Requires a lot of memory (128 G min./node) • Initial load of data via Refresh; can take a lot of time • loads the block location data for newly added data files

12 Impala overview • Shortcomings • Impala doesn’t support nested types at this point (version 1.2.3) as long as it contains only Impala- compatible data types – it cannot contain nested types such as array, map, or struct. • Impala currently does not "spill to disk" • if intermediate results being processed on a node exceed the memory reserved for Impala on that node. • No Custom Serializer/Deserializer classes (SerDes) • Impala cancels a running query if any host on which that query is executing fails

13 Impala overview • Example. For create a PARQUET table in IMPALA there are 3 ways: • -> PARQUET table created in HIVE (with no nested data types). • -> Create and load with data a normal text table in IMPALA: • IMPALA> create table parquet_table_name LIKE text_table_name STORED AS PARQUET LOCATION /user/hdfs/..’; • Create Parquet format table and then insert into parquet table using normal text table. • IMPALA> insert overwrite table parquet_table_name select * from text_table_name;

14 Use Case • Can't query Avro table in Impala because having nested columns. • Avro table created through Hive, we can use it in Impala as long as it contains only Impala-compatible data types. • (cannot contain nested types such as array, map, orstruct).

15 Use Case • How to deal with nested XML data in Hadoop? • There is no direct mapping from xml to avro. Process goes: • Parse XML and Convert to Avro : Parse XML using XMLStreamReader and • Perform JAXB unmarshalling and Create Avro Records from JAXB objects.Need to write a java class for this.Tried using Parquet/Avro: • Tested: Process Xml – first convert into Avro and then store into Parquet format using parquet-avro apis. • The problem is the Schema provided has some arrays which is union of type string and null both. • Currently this AvroSchemaConverter is not able to handle such avro schema and it gives exception. • Tested: Impala 1.2.3 on CDH 4.5 • Impala doesn’t support nested types at this point

16 Thank you

Add a comment

Related presentations

Related pages

Overview of Impala Tables - Cloudera

Overview of Impala Tables. Tables are ... EXTERNAL TABLE sets up an Impala table that ... started with Impala. The Parquet file format offers the ...
Read more

Parquet | LinkedIn

View 694 Parquet posts, presentations, experts, and more. Get the professional knowledge you need on LinkedIn. ... Industry Analysis, Overview, ...
Read more

Impala: create parquet from mysql dump - Stack Overflow

... how can I convert them in parquet file format with Impala? I know that I can create parquet ... overview of the site ... External table from compressed ...
Read more

External Parquet Table with Partitions - Grokbase

External Parquet Table with ... en/documentation/cloudera-impala/latest/topics/impala_tutorial.html#tut_external_partition ... Overview. group:
Read more

Impala | LinkedIn

View 7663 Impala posts, presentations, experts, and more. Get the professional knowledge you need on LinkedIn. ... Parquet and impala overview external.
Read more

CREATE TABLE Statement - Cloudera

If you specify the EXTERNAL clause, Impala treats the table ... see Overview of Impala ... or a file from an existing Impala Parquet ...
Read more

insert - How to increase parquet file size inserted with ...

How to increase parquet file size inserted ... I use impala version 1.1.1. How to increase parquet ... and the table created with CREATE EXTERNAL ...
Read more

[Impala-User] Impala external table in Parquet format

... Impala external table in Parquet format . ) Can you create an Impala external table in Parquet ... be split into multiple hdfs ... Overview. Tagged ...
Read more