Adam bosc-071114

•

1 gefällt mir•903 views

fnothaft

Technologie Bildung

What is in ADAM/BDG?
ADAM:
Core API +
CLIs
bdg-formats:
Data schemas
RNAdam:
RNA analysis on
ADAM
avocado:
Distributed local
assembler
Guacamole:
Distributed
somatic caller
xASSEMBLEx:
GraphX-based de
novo assembler
bdg-services:
ADAM clusters

Design Goals
• Develop processing pipeline that enables
efficient, scalable use of cluster/cloud

• Provide data format that has efficient
parallel/distributed access across platforms

• Enhance semantics of data and allow more
flexible data access patterns

Implementation Overview
• 27K lines of Scala code

• 100% Apache-licensed open-source

• 21 contributors from 8 institutions

• Working towards a production quality release late 2014

ADAM Stack
Physical
File/Block
Record/Split
‣Commodity Hardware

‣Cloud Systems - Amazon, GCE, Azure
‣Hadoop Distributed Filesystem

‣Local Filesystem
‣Schema-driven records w/ Apache Avro

‣Store and retrieve records using Parquet

‣Read BAM Files using Hadoop-BAM
In-Memory
RDD
‣Transform records using Apache Spark

‣Query with SQL using Shark

‣Graph processing with GraphX

‣Machine learning using MLBase

• Abstract as much as possible: schema
oriented design makes format easy to evolve

• Provide rich and scalable APIs for manipulating
and transforming genomic data and regions

• Don’t lock data in: play nicely with other tools
Design Principles

• OSS Created by Twitter and Cloudera, based on
Google Dremel, just entered Apache Incubator

• Columnar File Format:

• Limits I/O to only data that is needed

• Compresses very well - ADAM ﬁles are 5-25%
smaller than BAM ﬁles without loss of data

• Fast scans - load only columns you need, e.g.
scan a read ﬂag on a whole genome, high-
coverage ﬁle in less than a minute
Parquet

Scaling Genomics: BQSR
• Broadcast 3 GB table of
variants, used for masking

• Break reads down to
bases and map bases to
covariates

• Calculate empirical values
per covariate

• Broadcast observation,
apply across reads

Performance/Acc’y
ADAM
0
10
20
30
40
50
GATK
0 10 20 30 40 50
• Fully concordant with Picard for MarkDup, >99%
concordant with GATK for BQSR
Hours
0
4
8
12
16
20
24
Sort Mark Duplicates
BQSR
Picard ADAM 100 EC2 Nodes

Future Work
• Pushing hard towards production release

• Are building out a complete analysis
pipeline

• Plan to release Python bindings

• Work on interoperability with Global
Alliance for Genomic Health API (http://
genomicsandhealth.org/)

Call for contributions
• As an open source project, we welcome
contributions

• We maintain a list of open enhancements at
our Github issue trackers

• Github: https://www.github.com/bdgenomics

• UC Berkeley is looking to hire two full time
engineers to support this work

Acknowledgements
• UC Berkeley: Matt Massie,André Schumacher, Jey Kottalam,
Christos Kozanitis

• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff
Hammerbacher

• GenomeBridge: Timothy Danford, CarlYeksigian

• The Broad Institute: Chris Hartl

• Cloudera: Uri Laserson

• Microsoft Research: Jeremy Elson, Ravi Pandya

• Michael Heuer

• And other open source contributors!

Acknowledgements
This research is supported in part by NSF CISE
Expeditions Award CCF-1139158, LBNL Award
7076018, and DARPA XData Award
FA8750-12-2-0331, and gifts from Amazon Web
Services, Google, SAP, The Thomas and Stacey
Siebel Foundation,Apple, Inc., C3Energy, Cisco,
Cloudera, EMC, Ericsson, Facebook, GameOnTalis,
Guavus, HP, Huawei, Intel, Microsoft, NetApp,
Pivotal, Splunk,Virdata,VMware,WANdisco and
Yahoo!.

Empfohlen

Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...Coburn Watson

Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...Spark Summit

Performance improvements in etcd 3.5 releaseLibbySchulze

Prashant Vichare ResumePrashant Vichare

Scaling Graphite At YelpPaul O'Connor

Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Nick Galbreath

Vineetha.pptVineetha Vishnu

goto; London: Keeping your Cloud Footprint in CheckCoburn Watson

Empfohlen

Surge 2013: Maximizing Scalability, Resiliency, and Engineering Velocity in t...Coburn Watson

Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...Spark Summit

Performance improvements in etcd 3.5 releaseLibbySchulze

Prashant Vichare ResumePrashant Vichare

Scaling Graphite At YelpPaul O'Connor

Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013Nick Galbreath

Vineetha.pptVineetha Vishnu

goto; London: Keeping your Cloud Footprint in CheckCoburn Watson

Fast and Reliable Apache Spark SQL EngineDatabricks

Dataflow in 104corp - AWS UserGroup TW 2018Gavin Lin

IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...Lixi Conrads

Realizing the promise of portability with Apache BeamJ On The Beach

Monitoring MicroservicesWeaveworks

Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...Flink Forward

Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward

Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...Dataconomy Media

Circonus: Design failures - A Case StudyHeinrich Hartmann

Spark Summit EU talk by Sital KediaSpark Summit

How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...Spark Summit

Portable batch and streaming pipelines with Apache Beam (Big Data Application...Malo Denielou

Autoscaling with KubernetesJohannes Würbach

Lifting the Blinds: Monitoring Windows Server 2012Datadog

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkVasia Kalavri

Querying Dynamic Datasources with Continuously Mapped Sensor DataRuben Taelman

Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Toby Bloom

Autoscaling on KubernetesJames Sturtevant

Portable Streaming Pipelines with Apache Beamconfluent

Gobblin on-awsVasanth Rajamani

AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...Amazon Web Services

The Open Chemistry ProjectMarcus Hanwell

Weitere ähnliche Inhalte

Was ist angesagt?

Fast and Reliable Apache Spark SQL EngineDatabricks

Dataflow in 104corp - AWS UserGroup TW 2018Gavin Lin

IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...Lixi Conrads

Realizing the promise of portability with Apache BeamJ On The Beach

Monitoring MicroservicesWeaveworks

Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...Flink Forward

Till Rohrmann – Fault Tolerance and Job Recovery in Apache FlinkFlink Forward

Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...Dataconomy Media

Circonus: Design failures - A Case StudyHeinrich Hartmann

Spark Summit EU talk by Sital KediaSpark Summit

How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...Spark Summit

Portable batch and streaming pipelines with Apache Beam (Big Data Application...Malo Denielou

Autoscaling with KubernetesJohannes Würbach

Lifting the Blinds: Monitoring Windows Server 2012Datadog

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache FlinkVasia Kalavri

Querying Dynamic Datasources with Continuously Mapped Sensor DataRuben Taelman

Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011Toby Bloom

Autoscaling on KubernetesJames Sturtevant

Portable Streaming Pipelines with Apache Beamconfluent

Gobblin on-awsVasanth Rajamani

Was ist angesagt? (20)

Fast and Reliable Apache Spark SQL Engine

Dataflow in 104corp - AWS UserGroup TW 2018

IGUANA: A Generic Framework for Benchmarking the Read-Write Performance of Tr...

Realizing the promise of portability with Apache Beam

Monitoring Microservices

Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...

Till Rohrmann – Fault Tolerance and Job Recovery in Apache Flink

Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...

Circonus: Design failures - A Case Study

Spark Summit EU talk by Sital Kedia

How to Share State Across Multiple Apache Spark Jobs using Apache Ignite with...

Portable batch and streaming pipelines with Apache Beam (Big Data Application...

Autoscaling with Kubernetes

Lifting the Blinds: Monitoring Windows Server 2012

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

Querying Dynamic Datasources with Continuously Mapped Sensor Data

Cloud Computing: Safe Haven from the Data Deluge? AGBT 2011

Autoscaling on Kubernetes

Portable Streaming Pipelines with Apache Beam

Gobblin on-aws

Ähnlich wie Adam bosc-071114

AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...Amazon Web Services

The Open Chemistry ProjectMarcus Hanwell

Avogadro, Open Chemistry and SemanticsMarcus Hanwell

Open Chemistry: Input Preparation, Data Visualization & AnalysisMarcus Hanwell

HPC and cloud distributed computing, as a journeyPeter Clapham

GlobusWorld 2020 KeynoteGlobus

Big Data Streams Architectures. Why? What? How?Anton Nazaruk

Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit

Introduction to Apache Mesos and DC/OSSteve Wong

Ceph used in Cancer Research at OICRCeph Community

Scientific marpierc

Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems

Application Profiling at the HPCAC High Performance Centerinside-BigData.com

Scaling Hadoop at LinkedInDataWorks Summit

QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...QuantUniversity

Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...Accumulo Summit

Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein

Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit

Getting started with postgresqlbotsplash.com

Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...Shadab Ali Khan

Ähnlich wie Adam bosc-071114 (20)

AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...

The Open Chemistry Project

Avogadro, Open Chemistry and Semantics

Open Chemistry: Input Preparation, Data Visualization & Analysis

HPC and cloud distributed computing, as a journey

GlobusWorld 2020 Keynote

Big Data Streams Architectures. Why? What? How?

Processing 70Tb Of Genomics Data With ADAM And Toil

Introduction to Apache Mesos and DC/OS

Ceph used in Cancer Research at OICR

Scientific

Improving Efficiency of Machine Learning Algorithms using HPCC Systems

Application Profiling at the HPCAC High Performance Center

Scaling Hadoop at LinkedIn

QuTrack: Model Life Cycle Management for AI and ML models using a Blockchain ...

Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...

Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo

Realizing the Promise of Portable Data Processing with Apache Beam

Getting started with postgresql

Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Us...

Mehr von fnothaft

Scalable Genome Analysis with ADAMfnothaft

Rethinking Data-Intensive Science Using Scalable Analytics Systems fnothaft

Scalable Genome Analysis With ADAMfnothaft

Fast Variant Calling with ADAM and avocadofnothaft

Scaling Genomic Analysesfnothaft

Scaling up genomic analysis with ADAMfnothaft

Reproducible Emulation of Analog Behavioral Modelsfnothaft

Scalable up genomic analysis with ADAMfnothaft

CS176: Genome Assemblyfnothaft

Execution Environmentsfnothaft

PacMin @ AMPLab All-Handsfnothaft

Design for Scalability in ADAMfnothaft

ADAM—Spark Summit, 2014fnothaft

Mehr von fnothaft (14)

Scalable Genome Analysis with ADAM

Rethinking Data-Intensive Science Using Scalable Analytics Systems

Scalable Genome Analysis With ADAM

Fast Variant Calling with ADAM and avocado

Scaling Genomic Analyses

Scaling up genomic analysis with ADAM

Reproducible Emulation of Analog Behavioral Models

Scalable up genomic analysis with ADAM

CS176: Genome Assembly

Execution Environments

PacMin @ AMPLab All-Hands

Design for Scalability in ADAM

ADAM—Spark Summit, 2014

Kürzlich hochgeladen

Data governance with Unity Catalog PresentationKnoldus Inc.

QCon London: Mastering long-running processes in modern architecturesBernd Ruecker

[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra

Top 10 Hubspot Development Companies in 2024TopCSSGallery

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

A Journey Into the Emotions of Software DevelopersNicole Novielli

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes

Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Scale your database traffic with Read & Write split using MySQL RouterMydbops

Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

2024 April Patch TuesdayIvanti

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Decarbonising Buildings: Making a net-zero built environment a realityIES VE

React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech

How to write a Business Continuity PlanDatabarracks

Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Kürzlich hochgeladen (20)

Data governance with Unity Catalog Presentation

QCon London: Mastering long-running processes in modern architectures

[Webinar] SpiraTest - Setting New Standards in Quality Assurance

Top 10 Hubspot Development Companies in 2024

The State of Passkeys with FIDO Alliance.pptx

TeamStation AI System Report LATAM IT Salaries 2024

A Journey Into the Emotions of Software Developers

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Scale your database traffic with Read & Write split using MySQL Router

Zeshan Sattar- Assessing the skill requirements and industry expectations for...

Time Series Foundation Models - current state and future directions

2024 April Patch Tuesday

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Decarbonising Buildings: Making a net-zero built environment a reality

React Native vs Ionic - The Best Mobile App Framework

How to write a Business Continuity Plan

Genislab builds better products and faster go-to-market with Lean project man...

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Adam bosc-071114

1. ADAM: Fast, Scalable Genome Analysis Frank Austin Nothaft AMPLab, University of California, Berkeley, @fnothaft with: Matt Massie,André Schumacher,Timothy Danford, CarlYeksigian, Chris Hartl, Jey Kottalam,Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher,Anthony Joseph, and Dave Patterson https://github.com/bigdatagenomics http://www.bdgenomics.org

2. What is in ADAM/BDG? ADAM: Core API + CLIs bdg-formats: Data schemas RNAdam: RNA analysis on ADAM avocado: Distributed local assembler Guacamole: Distributed somatic caller xASSEMBLEx: GraphX-based de novo assembler bdg-services: ADAM clusters

3. Design Goals • Develop processing pipeline that enables efficient, scalable use of cluster/cloud • Provide data format that has efficient parallel/distributed access across platforms • Enhance semantics of data and allow more flexible data access patterns

4. Implementation Overview • 27K lines of Scala code • 100% Apache-licensed open-source • 21 contributors from 8 institutions • Working towards a production quality release late 2014

5. ADAM Stack Physical File/Block Record/Split ‣Commodity Hardware ‣Cloud Systems - Amazon, GCE, Azure ‣Hadoop Distributed Filesystem ‣Local Filesystem ‣Schema-driven records w/ Apache Avro ‣Store and retrieve records using Parquet ‣Read BAM Files using Hadoop-BAM In-Memory RDD ‣Transform records using Apache Spark ‣Query with SQL using Shark ‣Graph processing with GraphX ‣Machine learning using MLBase

6. • Abstract as much as possible: schema oriented design makes format easy to evolve • Provide rich and scalable APIs for manipulating and transforming genomic data and regions • Don’t lock data in: play nicely with other tools Design Principles

7. • OSS Created by Twitter and Cloudera, based on Google Dremel, just entered Apache Incubator • Columnar File Format: • Limits I/O to only data that is needed • Compresses very well - ADAM files are 5-25% smaller than BAM files without loss of data • Fast scans - load only columns you need, e.g. scan a read flag on a whole genome, high- coverage file in less than a minute Parquet

8. Scaling Genomics: BQSR • Broadcast 3 GB table of variants, used for masking • Break reads down to bases and map bases to covariates • Calculate empirical values per covariate • Broadcast observation, apply across reads

9. Performance/Acc’y ADAM 0 10 20 30 40 50 GATK 0 10 20 30 40 50 • Fully concordant with Picard for MarkDup, >99% concordant with GATK for BQSR Hours 0 4 8 12 16 20 24 Sort Mark Duplicates BQSR Picard ADAM 100 EC2 Nodes

10. Future Work • Pushing hard towards production release • Are building out a complete analysis pipeline • Plan to release Python bindings • Work on interoperability with Global Alliance for Genomic Health API (http:// genomicsandhealth.org/)

11. Call for contributions • As an open source project, we welcome contributions • We maintain a list of open enhancements at our Github issue trackers • Github: https://www.github.com/bdgenomics • UC Berkeley is looking to hire two full time engineers to support this work

12. Acknowledgements • UC Berkeley: Matt Massie,André Schumacher, Jey Kottalam, Christos Kozanitis • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher • GenomeBridge: Timothy Danford, CarlYeksigian • The Broad Institute: Chris Hartl • Cloudera: Uri Laserson • Microsoft Research: Jeremy Elson, Ravi Pandya • Michael Heuer • And other open source contributors!

13. Acknowledgements This research is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, The Thomas and Stacey Siebel Foundation,Apple, Inc., C3Energy, Cisco, Cloudera, EMC, Ericsson, Facebook, GameOnTalis, Guavus, HP, Huawei, Intel, Microsoft, NetApp, Pivotal, Splunk,Virdata,VMware,WANdisco and Yahoo!.