Ga4 gh meeting at the the sanger institute

50 %
50 %
Information about Ga4 gh meeting at the the sanger institute

Published on March 5, 2014

Author: mattmassie


ADAM: Fast, Scalable Genome Analysis Matt Massie Twitter: @matt_massie Email: University of California, Berkeley

Design • Create a platform with an easy programming environment for developers • Provide both single and multi-sample methods that are fast and scalable for whole genome, high-coverage data • Allow for multiple views of the same data, e.g. SQL/Table, Graph Analysis, Iterator on Records, Resilient Distributed Datasets • Leverage existing open-source systems and plug into current “Big Data” ecosystems • Deployable on an in-house cluster or any cloud vendor Amazon EC2, Google Compute Engine or Microsoft Azure • Everything is a file - bulk data transfer only requires standard tools like rsync, scp, distcp, S3sync, etc.

Commits Implementation • • Accelerated work began September, 2013 • Built using Apache Spark execution engine and Apache Avro and Parquet for file formats • • 20K lines of Scala code Nine contributors from Mt. Sinai, GenomeBridge, The Broad Institute and others Apache-licensed open-source

Read Pre-Processing Raw Reads Features • Mapping Sorted Mapping Local Alignment Mark Duplicates Base Quality Score Recalibration Calling-Ready Reads • ADAM • • • Read pre-processing: sort, mark dups, BQSR Read comparison across multiple covariates Converters between legacy and ADAM formats Avocado - A variant caller, distributed • • • SNP caller • Fully configurable pipeline via a config file Local assembler Support for integrating aligners in M/R frameworks

Avro • Serialization system similar to Google Protobuf and Apache Thrift • • Data formats are fully described with a schema • • • Datafile format is self-descriptive and record-oriented Bindings for Java, C, C++, C#, JavaScript, Python, Ruby, PHP and Perl (R in the works) Provides schema evolution, resolution and projection Numerous conversion utilities to print Avro as JSON, extract schema from JAXB, turn XSD/XML to Avro


Parquet • • Based on Google Dremel design • Columnar File Format Created by Twitter and Cloudera with contributions from dozens of open-source developers Limits I/O to only data that is needed • • • • Fast scans - load only columns you need, e.g. scan a read flag on a whole genome, high-coverage file in less than a minute Compresses very well - ADAM files are 5-25% smaller than BAM files without loss of data Integrates easily with Avro, Hadoop, Hive, Shark, Impala, Pig, Jackson/JSON, Scrooge and others

Read Data Example chrom20 TCGA chrom20 GAAT 4M1D chrom20 CCGAT Projection Predicate 4M 5M Row Oriented chrom20 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M Column Oriented chrom20 chrom20 chrom20 TCGA GAAT CCGAT 4M 4M1D 5M

Apache Spark • Grew out of Berkeley AMPLab research - now a top-level Apache project, commercially-supported • Ease of Use - Spark offers over 80 high-level operators that make it easy to build parallel apps using Scala, Java, Python or R • • Easy to test code in “local” mode • Speed - Spark has an advanced DAG execution engine that is 10-100x faster than Hadoop M/R • Runs well on in-house clusters, Amazon EC2 and Google Compute Engine Can use it interactively for ad-hoc analysis from the Scala, Python and R shells or using iPython notebook

Performance as Proof Sort 24 Hours 20 Mark Duplicates BQSR 20.37 17.73 16 12 8.93 8 4 0.33 0.47 0.75 0 Picard ADAM Single Node ADAM 100 EC2 Nodes 1000g NA12878 Whole Genome, 60x Coverage For comparison, Bina Technologies quotes .94 hours for BQSR at only 37x coverage

Summary • Schema-driven design allows developers to think at the logical layer • Well-designed execution systems allows developers to focus on science and algorithms instead of implementation details • Modern data formats enable distributed, fast computation and easier integration • Moving computation to the data reduces transfers and improves performance

Thank you

Extra slides

Rank variants by read depth and print the top 100 val join : RDD[(ADAMVariant, ADAMRecord)] = partitionAndJoin(sc, dict, variants, reads) val readCounts = p => (p._1, 1) ).reduceByKey(_ + _) val sorted = p=> (p._2, p._1) ).sortByKey() val top100 = sorted.take(100) top100.foreach {   case (count, variant) => " println("%dt%s".format(count, variant.getId)) }

Flagstat $ time adam flagstat NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.adam 757704193 + 0 in total (QC-passed reads + QC-failed reads) 8158052 + 0 primary duplicates 7594332 + 0 primary duplicates - both read and mate mapped 563720 + 0 primary duplicates - only read mapped 10344 + 0 primary duplicates - cross chromosome 10227903 + 0 secondary duplicates 10142158 + 0 secondary duplicates - both read and mate mapped 85745 + 0 secondary duplicates - only read mapped 4026853 + 0 secondary duplicates - cross chromosome 750027254 + 0 mapped (98.99%:0.00%) 757704193 + 0 paired in sequencing 377464374 + 0 read1 380239819 + 0 read2 724651663 + 0 properly paired (95.64%:0.00%) 745340038 + 0 with itself and mate mapped 4687216 + 0 singletons (0.62%:0.00%) 11135947 + 0 with mate mapped to a different chr 5557972 + 0 with mate mapped to a different chr (mapQ>=5) real    1m58.688s user    25m52.453s sys     0m43.879s Would take 40 minutes just to read from a single disk (assuming 100mb/s)

Concordance between ADAM and GATK BQSR RMSE: 1.48 Exact Matches: 50.06% 50 ADAM 40 30 20 10 0 0 10 20 30 GATK 40 50

Hadoop Distributed File System (HDFS) • Based on GoogleFS • Single namespace across entire cluster • Uses commodity hardware - JBOD • Files are broken into blocks (e.g. 128MB) • Blocks replicated for durability and performance • Write-once, read-many access pattern

$ adam e d8b /Y88b / Y88b /____Y88b / Y88b 888~-_ 888 888 | 888 | 888 / 888_-~ e d8b /Y88b / Y88b /____Y88b / Y88b e e d8b d8b d888bdY88b / Y88Y Y888b / YY Y888b / Y888b Choose one of the following commands: transform print_tags flagstat reads2ref mpileup print aggregate_pileups listdict compare compute_variants bam2adam adam2vcf vcf2adam findreads fasta2adam : : : : : : : : : : : : : : : Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations Prints the values and counts of all tags in a set of records Print statistics on reads in an ADAM file (similar to samtools flagstat) Convert an ADAM read-oriented file to an ADAM reference-oriented file Output the samtool mpileup text from ADAM reference-oriented data Print an ADAM formatted file Aggregate pileups in an ADAM reference-oriented file Print the contents of an ADAM sequence dictionary Compare two ADAM files based on read name Compute variant data from genotypes Single-node BAM to ADAM converter (Note: the 'transform' command can take SAM or BAM as input) Convert an ADAM variant to the VCF ADAM format Convert a VCF file to the corresponding ADAM format Find reads that match particular individual or comparative criteria Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which represents assembled sequences. plugin : Executes an AdamPlugin

Add a comment

Related pages

GA4GH Beacon (Global Alliance for Genomics and Health ...

Wellcome Trust Sanger Institute, Genome Research Limited (reg no. 2742969) is a charity registered in England with number 1021457 | Legal | ...
Read more

Home Page | Sanger Institute

The Wellcome Trust Sanger Institute provides a stimulating and vibrant working environment where every individual is motivated to give ... From Sanger Blog.
Read more

Talk on ADAM at Global Alliance for Genomics and Health ...

Ga4 gh meeting at the the sanger institute from Matt Massie. ... Talk on ADAM at Global Alliance for Genomics and Health Meeting Matt ...
Read more

Sanger | LinkedIn

Independent Legal Consultant at Sanger Consulting Services, Independent Legal Consultant at Sanger Consulting Services Past
Read more

The Wellcome Trust Genome Campus | Wellcome Trust

... new buildings were constructed for the Wellcome Trust Sanger Institute and the ... The Wellcome Trust Conference Centre hosts meetings and ...
Read more

Global Alliance for Genomics and Health Marks Two Years of ...

At today’s third Plenary Meeting, ... Chair of the GA4GH Data Working Group and Scientific Director of the Genomics Institute at UC Santa Cruz. “No
Read more

Allan Bradley | Wellcome Trust Sanger Institute |

Allan Bradley leads the Sanger Institute's ... Professor Allan Bradley will receive the award at the next Transgenic Technology meeting ... A B C D E F G H ...
Read more