HDF5 in Bioinformatics

50 %
50 %
Information about HDF5 in Bioinformatics

Published on February 18, 2014

Author: HDFEOS



DNA sequencing workflows can be very complex, and face a number of data management challenges. Typical workflows are characterized by diverse formats, highly redundant data, multiple levels of information, complex associations, repeated file processing, non-scalable storage, and lack of persistence. Recent work has investigated the use of HDF5 to manage such data.

Two strengths of HDF5 in particular are exploited in these studies: the ability of HDF5 to store and access very large arrays efficiently, and the ability of HDF5 to serve as a container for heterogeneous data. A possible data model was developed for describing the objects involved in a genome experiment, and some experiments were conducted to investigate the use of HDF5 for three applications. One is the use of HDF5 as a project file containing all data involved in a genome experiment. The second is for storing very large tables of haplotype data. The third is for creating, storing and accessing a very large "linkage disequilibrium" matrix.

Bioinformatics caacaagccaaaactcgtacaa Cgagatatctcttggaaaaact gctcacaatattgacgtacaag gttgttcatgaaactttcggta Acaatcgttgacattgcgacct aatacagcccagcaagcagaat Managing genomic data

DNA sequencing workflows • • • • • Diverse formats Redundant data Repeated file processing In-core processing models Lack of persistence

Multiple Levels of Information SNP Score Contig Summaries Discrepancies Contig Qualities Coverage Depth Trace Reads Aligned bases Read quality Contig Percent match

HDF5 as format for bioinformatics

