Advanced HDF5 Features

45 %
55 %
Information about Advanced HDF5 Features
Technology

Published on February 17, 2014

Author: HDFEOS

Source: slideshare.net

Description

This tutorial is designed for the HDF5 users with some HDF5 experience.

It will cover advanced features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features will be discussed: partial I/O, chunked storage layout, compression and other filters including new n-bit and scale+offset filters. Significant time will be devoted to the discussion of complex HDF5 datatypes such as strings, variable-length datatypes, array and compound datatypes.

HDF5 Advanced Topics October 15, 2008 HDF and HDF-EOS Workshop XII 1

Outline • Part I • Overview of HDF5 datatypes • Part II • Partial I/O in HDF5 • Hyperslab selection • Dataset region references • Chunking and compression • Part III • Performance issues (how to do it right) October 15, 2008 HDF and HDF-EOS Workshop XII 2

Part I HDF5 Datatypes Quick overview of the most difficult topics October 15, 2008 HDF and HDF-EOS Workshop XII 3

HDF5 Datatypes • HDF5 has a rich set of pre-defined datatypes and supports the creation of an unlimited variety of complex user-defined datatypes. • Datatype definitions are stored in the HDF5 file with the data. • Datatype definitions include information such as byte order (endianess), size, and floating point representation to fully describe how the data is stored and to insure portability across platforms. • Datatype definitions can be shared among objects in an HDF file, providing a powerful and efficient mechanism for describing data. October 15, 2008 HDF and HDF-EOS Workshop XII 4

Example Array of of integers on Linux platform Native integer is little-endian, 4 bytes Array of of integers on Solaris platform Native integer is big-endian, Fortran compiler uses -i8 flag to set integer to 8 bytes H5T_NATIVE_INT H5T_NATIVE_INT Little-endian 4 bytes integer H5Dwrite H5Dread H5Dwrite H5T_SDT_I32LE October 15, 2008 HDF and HDF-EOS Workshop XII VAX G-floating 5

Storing Variable Length Data in HDF5 October 15, 2008 HDF and HDF-EOS Workshop XII 6

HDF5 Fixed and Variable Length Array Storage •Data •Data Time •Data •Data •Data •Data Time •Data •Data •Data October 15, 2008 HDF and HDF-EOS Workshop XII 7

Storing Strings in HDF5 • Array of characters • Access to each character • Extra work to access and interpret each string • Fixed length string_id = H5Tcopy(H5T_C_S1); H5Tset_size(string_id, size); • Overhead for short strings • Can be compressed • Variable length string_id = H5Tcopy(H5T_C_S1); H5Tset_size(string_id, H5T_VARIABLE); • Overhead as for all VL datatypes • Compression will not be applied to actual data October 15, 2008 HDF and HDF-EOS Workshop XII 8

Storing Variable Length Data in HDF5 • Each element is represented by C structure typedef struct { size_t length; void *p; } hvl_t; • Base type can be any HDF5 type H5Tvlen_create(base_type) October 15, 2008 HDF and HDF-EOS Workshop XII 9

Example hvl_t data[LENGTH]; for(i=0; i<LENGTH; i++) { data[i].p=HDmalloc( (i+1)*sizeof(unsigned int)); data[i].len=i+1; } tvl = H5Tvlen_create (H5T_NATIVE_UINT); data[0].p •Data •Data •Data •Data data[4].len October 15, 2008 •Data HDF and HDF-EOS Workshop XII 10

Reading HDF5 Variable Length Array On read HDF5 Library allocates memory to read data in, application only needs to allocate array of hvl_t elements (pointers and lengths). hvl_t rdata[LENGTH]; /* Discover the type in the file */ tvl = H5Tvlen_create (H5T_NATIVE_UINT); ret = H5Dread(dataset,tvl,H5S_ALL,H5S_ALL, H5P_DEFAULT, rdata); /* Reclaim the read VL data */ H5Dvlen_reclaim(tvl,H5S_ALL,H5P_DEFAULT,rdata ); October 15, 2008 HDF and HDF-EOS Workshop XII 11

Storing Tables in HDF5 file October 15, 2008 HDF and HDF-EOS Workshop XII 12

Example a_name (integer) b_name (float) c_name (double) 0 0. 1.0000 1 1. 0.5000 2 4. 0.3333 3 9. 0.2500 4 16. 0.2000 5 25. 0.1667 6 36. 0.1429 7 49. 0.1250 8 64. 0.1111 9 81. 0.1000 October 15, 2008 Multiple ways to store a table Dataset for each field Dataset with compound datatype If all fields have the same type: 2-dim array 1-dim array of array datatype continued….. Choose to achieve your goal! How much overhead each type of storage will create? Do I always read all fields? Do I need to read some fields more often? Do I want to use compression? Do I want to access some records? HDF and HDF-EOS Workshop XII 13

HDF5 Compound Datatypes • Compound types • Comparable to C structs • Members can be atomic or compound types • Members can be multidimensional • Can be written/read by a field or set of fields • Not all data filters can be applied (shuffling, SZIP) October 15, 2008 HDF and HDF-EOS Workshop XII 14

HDF5 Compound Datatypes • Which APIs to use? • H5TB APIs • • • • Create, read, get info and merge tables Add, delete, and append records Insert and delete fields Limited control over table’s properties (i.e. only GZIP compression, level 6, default allocation time for table, extendible, etc.) • PyTables http://www.pytables.org • Based on H5TB • Python interface • Indexing capabilities • HDF5 APIs • H5Tcreate(H5T_COMPOUND), H5Tinsert calls to create a compound datatype • H5Dcreate, etc. • See H5Tget_member* functions for discovering properties of the HDF5 compound datatype October 15, 2008 HDF and HDF-EOS Workshop XII 15

Creating and Writing Compound Dataset h5_compound.c example typedef struct s1_t { int a; float b; double c; } s1_t; s1_t October 15, 2008 s1[LENGTH]; HDF and HDF-EOS Workshop XII 16

Creating and Writing Compound Dataset /* Create datatype in memory. */ s1_tid = H5Tcreate (H5T_COMPOUND, sizeof(s1_t)); H5Tinsert(s1_tid, "a_name", HOFFSET(s1_t, a), H5T_NATIVE_INT); H5Tinsert(s1_tid, "c_name", HOFFSET(s1_t, c), H5T_NATIVE_DOUBLE); H5Tinsert(s1_tid, "b_name", HOFFSET(s1_t, b), H5T_NATIVE_FLOAT); Note: • Use HOFFSET macro instead of calculating offset by hand. • Order of H5Tinsert calls is not important if HOFFSET is used. October 15, 2008 HDF and HDF-EOS Workshop XII 17

Creating and Writing Compound Dataset /* Create dataset and write data */ dataset = H5Dcreate(file, DATASETNAME, s1_tid, space, H5P_DEFAULT); status = H5Dwrite(dataset, s1_tid, H5S_ALL, H5S_ALL, H5P_DEFAULT, s1); Note: • In this example memory and file datatypes are the same. • Type is not packed. • Use H5Tpack to save space in the file. s2_tid = H5Tpack(s1_tid); status = H5Dcreate(file, DATASETNAME, s2_tid, space, H5P_DEFAULT); October 15, 2008 HDF and HDF-EOS Workshop XII 18

File Content with h5dump HDF5 "SDScompound.h5" { GROUP "/" { DATASET "ArrayOfStructures" { DATATYPE { H5T_STD_I32BE "a_name"; H5T_IEEE_F32BE "b_name"; H5T_IEEE_F64BE "c_name"; } DATASPACE { SIMPLE ( 10 ) / ( 10 ) } DATA { { [ 0 ], [ 0 ], [ 1 ] }, { [ 1 ], … October 15, 2008 HDF and HDF-EOS Workshop XII 19

Reading Compound Dataset /* Create datatype in memory and read data. */ dataset = H5Dopen(file, DATSETNAME); s2_tid = H5Dget_type(dataset); mem_tid = H5Tget_native_type (s2_tid); s1 = malloc((sizeof(mem_tid)*number_of_elements) status = H5Dread(dataset, mem_tid, H5S_ALL, H5S_ALL, H5P_DEFAULT, s1); Note: • We could construct memory type as we did in writing example. • For general applications we need to discover the type in the file, find out corresponding memory type, allocate space and do read. October 15, 2008 HDF and HDF-EOS Workshop XII 20

Reading Compound Dataset by Fields typedef struct s2_t { double c; int a; } s2_t; s2_t s2[LENGTH]; … s2_tid = H5Tcreate (H5T_COMPOUND, sizeof(s2_t)); H5Tinsert(s2_tid, "c_name", HOFFSET(s2_t, c), H5T_NATIVE_DOUBLE); H5Tinsert(s2_tid, “a_name", HOFFSET(s2_t, a), H5T_NATIVE_INT); … status = H5Dread(dataset, s2_tid, H5S_ALL, H5S_ALL, H5P_DEFAULT, s2); October 15, 2008 HDF and HDF-EOS Workshop XII 21

New Way of Creating Datatypes Another way to create a compound datatype #include H5LTpublic.h ….. s2_tid = H5LTtext_to_dtype( "H5T_COMPOUND {H5T_NATIVE_DOUBLE "c_name"; H5T_NATIVE_INT "a_name"; }", H5LT_DDL); October 15, 2008 HDF and HDF-EOS Workshop XII 22

Need Help with Datatypes? Check our support web pages http://www.hdfgroup.uiuc.edu/UserSupport/example http://www.hdfgroup.uiuc.edu/UserSupport/example October 15, 2008 HDF and HDF-EOS Workshop XII 23

Part II Working with subsets October 15, 2008 HDF and HDF-EOS Workshop XII 24

Collect data one way …. Array of images (3D) October 15, 2008 HDF and HDF-EOS Workshop XII 25

Display data another way … Stitched image (2D array) October 15, 2008 HDF and HDF-EOS Workshop XII 26

Data is too big to read…. October 15, 2008 HDF and HDF-EOS Workshop XII 27

Refer to a region… Need to select and access the same elements of a dataset October 15, 2008 HDF and HDF-EOS Workshop XII 28

HDF5 Library Features • HDF5 Library provides capabilities to • Describe subsets of data and perform write/read operations on subsets • Hyperslab selections and partial I/O • Store descriptions of the data subsets in a file • Object references • Region references • Use efficient storage mechanism to achieve good performance while writing/reading subsets of data • Chunking, compression October 15, 2008 HDF and HDF-EOS Workshop XII 29

Partial I/O in HDF5 October 15, 2008 HDF and HDF-EOS Workshop XII 30

How to Describe a Subset in HDF5? • Before writing and reading a subset of data one has to describe it to the HDF5 Library. • HDF5 APIs and documentation refer to a subset as a “selection” or “hyperslab selection”. • If specified, HDF5 Library will perform I/O on a selection only and not on all elements of a dataset. October 15, 2008 HDF and HDF-EOS Workshop XII 31

Types of Selections in HDF5 • Two types of selections • Hyperslab selection • Regular hyperslab • Simple hyperslab • Result of set operations on hyperslabs (union, difference, …) • Point selection • Hyperslab selection is especially important for doing parallel I/O in HDF5 (See Parallel HDF5 Tutorial) October 15, 2008 HDF and HDF-EOS Workshop XII 32

Regular Hyperslab Collection of regularly spaced equal size blocks October 15, 2008 HDF and HDF-EOS Workshop XII 33

Simple Hyperslab Contiguous subset or sub-array October 15, 2008 HDF and HDF-EOS Workshop XII 34

Hyperslab Selection Result of union operation on three simple hyperslabs October 15, 2008 HDF and HDF-EOS Workshop XII 35

Hyperslab Description • Offset - starting location of a hyperslab (1,1) • Stride - number of elements that separate each block (3,2) • Count - number of blocks (2,6) • Block - block size (2,1) • Everything is “measured” in number of elements October 15, 2008 HDF and HDF-EOS Workshop XII 36

Simple Hyperslab Description • Two ways to describe a simple hyperslab • As several blocks • Stride – (1,1) • Count – (2,6) • Block – (2,1) • As one block • Stride – (1,1) • Count – (1,1) • Block – (4,6) No performance penalty for one way or another October 15, 2008 HDF and HDF-EOS Workshop XII 37

H5Sselect_hyperslab Function space_id Identifier of dataspace op Selection operator H5S_SELECT_SET or H5S_SELECT_OR offset Array with starting coordinates of hyperslab stride Array specifying which positions along a dimension to select count Array specifying how many blocks to select from the dataspace, in each dimension block Array specifying size of element block (NULL indicates a block size of a single element in a dimension) October 15, 2008 HDF and HDF-EOS Workshop XII 38

Reading/Writing Selections Programming model for reading from a dataset in a file 1. Open a dataset. 2. Get file dataspace handle of the dataset and specify subset to read from. a. H5Dget_space returns file dataspace handle a. File dataspace describes array stored in a file (number of dimensions and their sizes). b. H5Sselect_hyperslab selects elements of the array that participate in I/O operation. 3. Allocate data buffer of an appropriate shape and size October 15, 2008 HDF and HDF-EOS Workshop XII 39

Reading/Writing Selections Programming model (continued) 4. Create a memory dataspace and specify subset to write to. 1. 2. 3. Memory dataspace describes data buffer (its rank and dimension sizes). Use H5Screate_simple function to create memory dataspace. Use H5Sselect_hyperslab to select elements of the data buffer that participate in I/O operation. 4. Issue H5Dread or H5Dwrite to move the data between file and memory buffer. 5. Close file dataspace and memory dataspace when done. October 15, 2008 HDF and HDF-EOS Workshop XII 40

Example : Reading Two Rows 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 -1 -1 -1 Data in a file 4x6 matrix Buffer in memory 1-dim array of length 14 -1 -1 October 15, 2008 -1 -1 -1 -1 -1 HDF and HDF-EOS Workshop XII -1 -1 41 -1 -1

Example: Reading Two Rows 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 offset count block stride 24 filespace = H5Dget_space (dataset); H5Sselect_hyperslab (filespace, H5S_SELECT_SET, offset, NULL, count, NULL) October 15, 2008 HDF and HDF-EOS Workshop XII 42 = = = = {1,0} {2,6} {1,1} {1,1}

Example: Reading Two Rows offset = {1} count = {12} -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 memspace = H5Screate_simple(1, 14, NULL); H5Sselect_hyperslab (memspace, H5S_SELECT_SET, offset, NULL, count, NULL) October 15, 2008 HDF and HDF-EOS Workshop XII 43 -1

Example: Reading Two Rows 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 H5Dread (…, …, memspace, filespace, …, …); -1 7 October 15, 2008 8 9 10 11 12 13 14 15 16 17 18 -1 HDF and HDF-EOS Workshop XII 44

Things to Remember • Number of elements selected in a file and in a memory buffer should be the same • H5Sget_select_npoints returns number of selected elements in a hyperslab selection • HDF5 partial I/O is tuned to move data between selections that have the same dimensionality; avoid choosing subsets that have different ranks (as in example above) • Allocate a buffer of an appropriate size when reading data; use H5Tget_native_type and H5Tget_size to get the correct size of the data element in memory. October 15, 2008 HDF and HDF-EOS Workshop XII 45

Things to Remember • When calling H5Sselect_hyperslab in a loop close the obtained dataspace handle in a loop to avoid application memory growth. Only offset parameter is changing; block and stride parameters stay the same. offset October 15, 2008 HDF and HDF-EOS Workshop XII 46

Example offset[0] = 0; offset[1] = 0; fspace_id = H5Dget_space(...); for (k=0; k < DIM3; k++) { /* Start for loop */ offset[2] = k; … tmp_id = H5Sselect_hyperslab(fspace_id, …, offset, …); H5Dwrite(dset_id, type_id, H5S_ALL, tmp_id, ..); H5Sclose(tmp_id); … } /* End for loop */ October 15, 2008 HDF and HDF-EOS Workshop XII 47

HDF5 Region References and Selections October 15, 2008 HDF and HDF-EOS Workshop XII 48

Saving Selected Region in a File Need to select and access the same elements of a dataset October 15, 2008 HDF and HDF-EOS Workshop XII 49

Reference Datatype • Reference to an HDF5 object • Pointer to a group or a dataset in a file • Predefined datatype H5T_STD_REG_OBJ describe object references • Reference to a dataset region (or to selection) • Pointer to the dataspace selection • Predefined datatype H5T_STD_REF_DSETREG to describe regions October 15, 2008 HDF and HDF-EOS Workshop XII 50

Reference to Dataset Region REF_REG.h5 Root Matrix Object References 1 1 2 3 3 4 5 5 6 1 2 2 3 4 4 5 6 6 October 15, 2008 HDF and HDF-EOS Workshop XII 51

Reference to Dataset Region Example dsetr_id = H5Dcreate(file_id, “REGION REFERENCES”, H5T_STD_REF_DSETREG, …); H5Sselect_hyperslab(space_id, H5S_SELECT_SET, start, NULL, …); H5Rcreate(&ref[0], file_id, “MATRIX”, H5R_DATASET_REGION, space_id); H5Dwrite(dsetr_id, H5T_STD_REF_DSETREG, H5S_ALL, H5S_ALL, H5P_DEFAULT,ref); October 15, 2008 HDF and HDF-EOS Workshop XII 52

Reference to Dataset Region HDF5 "REF_REG.h5" { GROUP "/" { DATASET "MATRIX" { …… } DATASET "REGION_REFERENCES" { DATATYPE H5T_REFERENCE DATASPACE SIMPLE { ( 2 ) / ( 2 ) } DATA { (0): DATASET /MATRIX {(0,3)-(1,5)}, (1): DATASET /MATRIX {(0,0), (1,6), (0,8)} } } } } October 15, 2008 HDF and HDF-EOS Workshop XII 53

Chunking in HDF5 October 15, 2008 HDF and HDF-EOS Workshop XII 54

HDF5 Chunking • Dataset data is divided into equally sized blocks (chunks). • Each chunk is stored separately as a contiguous block in HDF5 file. Metadata cache Dataset data Dataset header …………. Datatype Dataspace …………. Attributes … File October 15, 2008 A B C D Chunk index Application memory header Chunk index A C HDF and HDF-EOS Workshop XII D B 55

HDF5 Chunking • Chunking is needed for • Enabling compression and other filters • Extendible datasets October 15, 2008 HDF and HDF-EOS Workshop XII 56

HDF5 Chunking • If used appropriately chunking improves partial I/O for big datasets Only two chunks are involved in I/O October 15, 2008 HDF and HDF-EOS Workshop XII 57

HDF5 Chunking • Chunk has the same rank as a dataset • Chunk’s dimensions do not need to be factors of dataset’s dimensions October 15, 2008 HDF and HDF-EOS Workshop XII 58

Creating Chunked Dataset 1. 2. 3. Create a dataset creation property list. Set property list to use chunked storage layout. Create dataset with the above property list. crp_id = H5Pcreate(H5P_DATASET_CREATE); rank = 2; ch_dims[0] = 100; ch_dims[1] = 100; H5Pset_chunk(crp_id, rank, ch_dims); dset_id = H5Dcreate (…, crp_id); H5Pclose(crp_id); October 15, 2008 HDF and HDF-EOS Workshop XII 59

Writing or Reading Chunked Dataset 1. 2. Chunking mechanism is transparent to application. Use the same set of operation as for contiguous dataset, for example, H5Dopen(…); H5Sselect_hyperslab (…); H5Dread(…); 3. Selections do not need to coincide precisely with the chunks boundaries. October 15, 2008 HDF and HDF-EOS Workshop XII 60

HDF5 Filters • • HDF5 filters modify data during I/O operations Available filters: 1. 2. 3. 4. Checksum (H5Pset_fletcher32) Shuffling filter (H5Pset_shuffle) Data transformation (in 1.8.*) Compression • • • • October 15, 2008 Scale + offset (in 1.8.*) N-bit (in 1.8.*) GZIP (deflate), SZIP (H5Pset_deflate, H5Pset_szip) User-defined filters (BZIP2) • Example of a user-defined compression filter can be found http://www.hdfgroup.uiuc.edu/papers/papers/bzip2/ HDF and HDF-EOS Workshop XII 61

Creating Compressed Dataset 1. 2. 3. 4. Create a dataset creation property list Set property list to use chunked storage layout Set property list to use filters Create dataset with the above property list crp_id = H5Pcreate(H5P_DATASET_CREATE); rank = 2; ch_dims[0] = 100; ch_dims[1] = 100; H5Pset_chunk(crp_id, rank, ch_dims); H5Pset_deflate(crp_id, 9); dset_id = H5Dcreate (…, crp_id); H5Pclose(crp_id); October 15, 2008 HDF and HDF-EOS Workshop XII 62

Writing Compressed Dataset Chunked dataset A C Chunk cache (per dataset) C B Filter pipeline File B A ………….. C Default chunk cache size is 1MB. Filters including compression are applied when chunk is evicted from cache. Chunks in the file may have different sizes October 15, 2008 HDF and HDF-EOS Workshop XII 63

Chunking Basics to Remember • • • Chunking creates storage overhead in the file. Performance is affected by • Chunking and compression parameters • Chunking cache size (H5Pset_cache call) Some hints for getting better performance • Use chunk size not smaller than block size (4k) on a file system. • Use compression method appropriate for your data. • Avoid using selections that do not coincide with the chunking boundaries. October 15, 2008 HDF and HDF-EOS Workshop XII 64

Example Creates a compressed 1000x20 integer dataset in a file %h5dump –p –H zip.h5 HDF5 "zip.h5" { GROUP "/" { GROUP "Data" { DATASET "Compressed_Data" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 1000, 20 )……… STORAGE_LAYOUT { CHUNKED ( 20, 20 ) SIZE 5316 } October 15, 2008 HDF and HDF-EOS Workshop XII 65

Example (continued) FILTERS { COMPRESSION DEFLATE { LEVEL 6 } } FILLVALUE { FILL_TIME H5D_FILL_TIME_IFSET VALUE 0 } ALLOCATION_TIME { H5D_ALLOC_TIME_INCR } } } } } October 15, 2008 HDF and HDF-EOS Workshop XII 66

Example (bigger chunk) Creates a compressed integer dataset 1000x20 in a file; better compression ratio is achieved. h5dump –p –H zip.h5 HDF5 "zip.h5" { GROUP "/" { GROUP "Data" { DATASET "Compressed_Data" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 1000, 20 )……… STORAGE_LAYOUT { CHUNKED ( 200, 20 ) SIZE 2936 } October 15, 2008 HDF and HDF-EOS Workshop XII 67

Part III Performance Issues (How to Do it Right) October 15, 2008 HDF and HDF-EOS Workshop XII 68

Performance of Serial I/O Operations • Next slides show the performance effects of using different access patterns and storage layouts. • We use three test cases which consist of writing a selection to an array of characters. • Data is stored in a row-major order. • Tests were executed on THG Linux x86_64 box using h5perf_serial and HDF5 version 1.8.0 October 15, 2008 HDF and HDF-EOS Workshop XII 69

Serial Benchmarking Tool • Benchmarking tool, h5perf_serial, introduced in 1.8.1 release. • Features inlcude: • Support for POSIX and HDF5 I/O calls. • Support for datasets and buffers with multiple dimensions. • Entire dataset access using a single or several I/O operations. • Selection of contiguous and chunked storage for HDF5 operations. October 15, 2008 HDF and HDF-EOS Workshop XII 70

Contiguous Storage (Case 1) • Rectangular dataset of size 48K x 48K, with write selections of 512 x 48K. • HDF5 storage layout is contiguous. • Good I/O pattern for POSIX and HDF5 because each selection is contiguous. • POSIX: 5.19 MB/s • HDF5: 5.36 MB/s October 15, 2008 HDF and HDF-EOS Workshop XII 1 2 3 4 1 2 71 3 4

Contiguous Storage (Case 2) • Rectangular dataset of 48K x 48K, with write selections of 48K x 512. • HDF5 storage layout is contiguous. • Bad I/O pattern for POSIX and HDF5 because each selection is noncontiguous. • POSIX: 1.24 MB/s • HDF5: 0.05 MB/s October 15, 2008 HDF and HDF-EOS Workshop XII 1 1 2 3 4 2 1 3 2 72 4 3 4 …….

Chunked Storage • Rectangular dataset of 48K x 48K, with write selections of 48K x 512. • HDF5 storage layout is chunked. Chunks and selections sizes are equal. • Bad I/O case for POSIX because selections are noncontiguous. • Good I/O case for HDF5 since selections are contiguous due to chunking layout settings. • POSIX: 1.51 MB/s • HDF5: 5.58 MB/s 1 HDF and HDF-EOS Workshop XII 3 4 POSIX 1 2 3 4 1 2 3 4 ……. HDF5 1 October 15, 2008 2 2 3 73 4

Conclusions • Access patterns with small I/O operations incur high latency and overhead costs many times. • Chunked storage may improve I/O performance by affecting the contiguity of the data selection. October 15, 2008 HDF and HDF-EOS Workshop XII 74

Writing Chunked Dataset • 1000x100x100 dataset • 4 byte integers • Random values 0-99 • 50x100x100 chunks (20 total) • Chunk size: 2 MB • Write the entire dataset using 1x100x100 slices • Slices are written sequentially October 15, 2008 HDF and HDF-EOS Workshop XII 75

Test Setup • 20 Chunks • 1000 slices • Chunk size is 2MB October 15, 2008 HDF and HDF-EOS Workshop XII 76

Test Setup (continued) • Tests performed with 1 MB and 5MB chunk cache size • Cache size set with H5Pset_cache function H5Pget_cache (fapl, NULL, &rdcc_nelmts, &rdcc_nbytes, &rdcc_w0); H5Pset_cache (fapl, 0, rdcc_nelmts, 5*1024*1024, rdcc_w0); • Tests performed with no compression and with gzip (deflate) compression October 15, 2008 HDF and HDF-EOS Workshop XII 77

Effect of Chunk Cache Size on Write No compression Cache size I/O operations Total data written File size 1 MB (default) 1002 75.54 MB 38.15 MB 5 MB 22 38.16 MB 38.15 MB Gzip compression Cache size I/O operations Total data written File size 1 MB (default) 1982 335.42 MB (322.34 MB read) 13.08 MB 5 MB 22 13.08 MB 13.08 MB October 15, 2008 HDF and HDF-EOS Workshop XII 78

Effect of Chunk Cache Size on Write • With the 1 MB cache size, a chunk will not fit into the cache • All writes to the dataset must be immediately written to disk • With compression, the entire chunk must be read and rewritten every time a part of the chunk is written to • Data must also be decompressed and recompressed each time • Non sequential writes could result in a larger file • Without compression, the entire chunk must be written when it is first written to the file • If the selection were not contiguous on disk, it could require as much as 1 I/O operation for each element October 15, 2008 HDF and HDF-EOS Workshop XII 79

Effect of Chunk Cache Size on Write • With the 5 MB cache size, the chunk is written only after it is full • Drastically reduces the number of I/O operations • Reduces the amount of data that must be written (and read) • Reduces processing time, especially with the compression filter October 15, 2008 HDF and HDF-EOS Workshop XII 80

Conclusion • It is important to make sure that a chunk will fit into the raw data chunk cache • If you will be writing to multiple chunks at once, you should increase the cache size even more • Try to design chunk dimensions to minimize the number you will be writing to at once October 15, 2008 HDF and HDF-EOS Workshop XII 81

Reading Chunked Dataset • Read the same dataset, again by slices, but the slices cross through all the chunks • 2 orientations for read plane • Plane includes fastest changing dimension • Plane does not include fastest changing dimension • Measure total read operations, and total size read • Chunk sizes of 50x100x100, and 10x100x100 • 1 MB cache October 15, 2008 HDF and HDF-EOS Workshop XII 82

Test Setup • Chunks • Read slices • Vertical and horizontal October 15, 2008 HDF and HDF-EOS Workshop XII 83

Results • Read slice includes fastest changing dimension Chunk size Compression I/O operations Total data read 50 Yes 2010 1307 MB 10 Yes 10012 1308 MB 50 No 100010 38 MB 10 No 10012 3814 MB October 15, 2008 HDF and HDF-EOS Workshop XII 84

Results (continued) • Read slice does not include fastest changing dimension Chunk size Compression I/O operations Total data read 50 Yes 2010 1307 MB 10 Yes 10012 1308 MB 50 No 10000010 38 MB 10 No 10012 3814 MB October 15, 2008 HDF and HDF-EOS Workshop XII 85

Effect of Cache Size on Read • When compression is enabled, the library must always read each entire chunk once for each call to H5Dread. • When compression is disabled, the library’s behavior depends on the cache size relative to the chunk size. • If the chunk fits in cache, the library reads each entire chunk once for each call to H5Dread • If the chunk does not fit in cache, the library reads only the data that is selected • More read operations, especially if the read plane does not include the fastest changing dimension • Less total data read October 15, 2008 HDF and HDF-EOS Workshop XII 86

Conclusion • In this case cache size does not matter when reading if compression is enabled. • Without compression, a larger cache may not be beneficial, unless the cache is large enough to hold all of the chunks. • The optimum cache size depends on the exact shape of the data, as well as the hardware. October 15, 2008 HDF and HDF-EOS Workshop XII 87

Questions? October 15, 2008 HDF and HDF-EOS Workshop XII 88

Acknowledgement • This Tutorial is based upon work supported in part by a Cooperative Agreement with the National Aeronautics and Space Administration (NASA) under NASA Awards NNX06AC83A and NNX08AO77A. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Aeronautics and Space Administration. October 15, 2008 HDF and HDF-EOS Workshop XII 89

Add a comment

Related presentations

Related pages

PSH5X Advanced Features - Hierarchical Data Format

Using the HDF5 Core Virtual File Driver (VFD) By default PSH5X creates and opens files with the H5FD_SEC2 driver. The H5FD_CORE driver enables an ...
Read more

Advanced HDF5 Features - Technology - docslide.us

1. HDF5 Advanced TopicsElena Pourmal The HDF Group The 15th HDF and HDF-EOS Workshop April 17, 2012 April 17-19HDF/HDF-EOS Workshop XV1 2. Goal • To ...
Read more

HDF Group - HDF5

The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, ... HDF5 Tools and Software. Special Features: ...
Read more

HDF-EOS Tools and Information Center

Advanced HDF5 Features. It will cover features of the HDF5 library for achieving better I/O performance and efficient storage. The following HDF5 features ...
Read more

Agenda - HDF and HDF-EOS Workshop XIII

HDF and HDF-EOS Workshop XIII. November 3-5, 2009 Riverdale, MD ... Advanced HDF5 Features: Abstract: Presentation1 Presentation2: 10:50: Peter Cao: The ...
Read more

Upcoming New Features in HDF5 | Advanced Photon Source

The HDF Group has been working on the new features in the HDF5 library to achieve data I/O rates and address data management needs as identified by several ...
Read more

HDF5 Advanced Topics - Chunking - Technology

The following HDF5 features will be discussed: chunked storage ... Advanced HDF5 Features. Advanced HDF5 Features. Advanced HDF5 Features. Chunking. Chunking.
Read more