SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
ADAM

https://github.com/massie/adam
Matt Massie
University of California, Berkeley
massie@berkeley.edu

Saturday, November 2, 13
SAM

BAM

ADAM

Sequence Alignment Map (SAM)
Binary Alignment Map (BAM)
Avro Data Alignment Map (ADAM)

Saturday, November 2, 13
Pipeline Issues Today:
Time and Scale
• The time to go from reads to answers is
too long

• Processing thousands of BAM files for
statistical analysis doesn’t scale

Saturday, November 2, 13
ADAM:
Speed and Scale
• Read BAM once, perform transformations
(e.g. sort, mark duplicates, BQSR) in
distributed memory, write the analysisready ADAM file once

• Use a distribute filesystem (HDFS), a fast
execution system (Spark) and columnar
data formats (Parquet) to scale

Saturday, November 2, 13
Unlocking Genomic Data
Shark (SQL)
Hadoop
M/R

Spark

Impala (SQL)

ADAM ADAM ADAM ADAM ADAM
ADAM ADAM ADAM ADAM ADAM
ADAM ADAM ADAM ADAM ADAM ADAM
Hadoop Distributed File System (HDFS)

Local Filesystem
ADAM ADAM
ADAM
ADAM
Saturday, November 2, 13

BAM
record ADAMRecord {
union
union
union
union
union
union
union
union
union
union
union

{
{
{
{
{
{
{
{
{
{
{

null,
null,
null,
null,
null,
null,
null,
null,
null,
null,
null,

string } referenceName = null;
int } referenceId = null;
long } start = null;
int } mapq = null;
string } readName = null;
string } sequence = null;
string } mateReference = null;
long } mateAlignmentStart = null;
string } cigar = null;
string } qual = null;
string } recordGroupId = null;

union
union
union
union
union
union
union
union
union
union
union

{
{
{
{
{
{
{
{
{
{
{

boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,
boolean,

null
null
null
null
null
null
null
null
null
null
null

}
}
}
}
}
}
}
}
}
}
}

http://avro.apache.org/

readPaired = false;
properPair = false;
readMapped = false;
mateMapped = false;
readNegativeStrand = false;
mateNegativeStrand = false;
firstOfPair = false;
secondOfPair = false;
primaryAlignment = false;
failedVendorQualityChecks = false;
duplicateRead = false;

union { null, string } mismatchingPositions = null;
union { null, string } attributes = null;
union
union
union
union
union
union
union
union
union
union

}

{
{
{
{
{
{
{
{
{
{

null,
null,
null,
null,
null,
null,
null,
null,
null,
null,

string } recordGroupSequencingCenter = null;
string } recordGroupDescription = null;
long } recordGroupRunDateEpoch = null;
string } recordGroupFlowOrder = null;
string } recordGroupKeySequence = null;
string } recordGroupLibrary = null;
int } recordGroupPredictedMedianInsertSize = null;
string } recordGroupPlatform = null;
string } recordGroupPlatformUnit = null;
string } recordGroupSample = null;

union { null, int } mateReferenceId = null;

Saturday, November 2, 13
Parquet
http://parquet.io

Column-oriented layout
Row-oriented layout

https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Saturday, November 2, 13
Genomic Data Example
chrom20 TCGA

4M

chrom20 GAAT

4M1D

chrom20 CCGAT

5M

Column Oriented
chrom20 chrom20 chrom20

TCGA

GAAT

CCGAT

4M

4M1D

5M

Row Oriented
chrom20

Saturday, November 2, 13

TCGA

4M

chrom20

GAAT

4M1D

chrom20 CCGAT

5M
http://spark.incubator.apache.org/

Saturday, November 2, 13
Low-Coverage BAM
Experiment
• 14GB Low-coverage BAM with 145M reads
• 10-node ec2 cluster m2.4xlarge
• Reduced to 13GB with ADAM
• Conversion/upload to HDFS 22mins
• Sorted in 7minutes
Saturday, November 2, 13
High-Coverage BAM
Experiment
• Input: 237GB NA12878- high coverage,
PCR free, whole-genome BAM

• Conversion took 4hrs on ec2 m2.4xlarge
(8cpu, 68.4gb mem)

• Output size: 237GB BAM reduced to
212GB ADAM

Saturday, November 2, 13
Current Features
•
•
•
•
•

Saturday, November 2, 13

Convert BAM to ADAM (read-oriented)
Sort an ADAM file by reference
Generate ADAMPileups
Print mpileup output
Very soon ADAM will be able to mark duplicates
(initial benchmarks look good)
In progress...
•

Frank is working on a distributed variant caller (https://
github.com/fnothaft/avocado), local realignment, adam2bam

•

Chris Hartl is integrating ADAM with GATK (https://
github.com/chartl/GAParquet) DiagnoseTargets, adding new
VCF formats to ADAM, BQSR

•

Christos Kozanitis has been working on Shark and Impala
integration for ad-hoc SQL read queries

•

Collaborations with Mt. Sinai, GenomeBridge and the Broad
Institute who are interested in using ADAM

Saturday, November 2, 13

Weitere ähnliche Inhalte

Was ist angesagt?

Apache spark on planet scale
Apache spark on planet scaleApache spark on planet scale
Apache spark on planet scaleDenis Chapligin
 
PostgreSql query planning and tuning
PostgreSql query planning and tuningPostgreSql query planning and tuning
PostgreSql query planning and tuningFederico Campoli
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R StudioRupak Roy
 
Backup recovery with PostgreSQL
Backup recovery with PostgreSQLBackup recovery with PostgreSQL
Backup recovery with PostgreSQLFederico Campoli
 
Introduction to Hadoop - FinistJug
Introduction to Hadoop - FinistJugIntroduction to Hadoop - FinistJug
Introduction to Hadoop - FinistJugDavid Morin
 
PostgreSQL, the big the fast and the (NOSQL on) Acid
PostgreSQL, the big the fast and the (NOSQL on) AcidPostgreSQL, the big the fast and the (NOSQL on) Acid
PostgreSQL, the big the fast and the (NOSQL on) AcidFederico Campoli
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture Rupak Roy
 
PostgreSQL - backup and recovery with large databases
PostgreSQL - backup and recovery with large databasesPostgreSQL - backup and recovery with large databases
PostgreSQL - backup and recovery with large databasesFederico Campoli
 

Was ist angesagt? (8)

Apache spark on planet scale
Apache spark on planet scaleApache spark on planet scale
Apache spark on planet scale
 
PostgreSql query planning and tuning
PostgreSql query planning and tuningPostgreSql query planning and tuning
PostgreSql query planning and tuning
 
Introduction to R and R Studio
Introduction to R and R StudioIntroduction to R and R Studio
Introduction to R and R Studio
 
Backup recovery with PostgreSQL
Backup recovery with PostgreSQLBackup recovery with PostgreSQL
Backup recovery with PostgreSQL
 
Introduction to Hadoop - FinistJug
Introduction to Hadoop - FinistJugIntroduction to Hadoop - FinistJug
Introduction to Hadoop - FinistJug
 
PostgreSQL, the big the fast and the (NOSQL on) Acid
PostgreSQL, the big the fast and the (NOSQL on) AcidPostgreSQL, the big the fast and the (NOSQL on) Acid
PostgreSQL, the big the fast and the (NOSQL on) Acid
 
Map Reduce Execution Architecture
Map Reduce Execution Architecture Map Reduce Execution Architecture
Map Reduce Execution Architecture
 
PostgreSQL - backup and recovery with large databases
PostgreSQL - backup and recovery with large databasesPostgreSQL - backup and recovery with large databases
PostgreSQL - backup and recovery with large databases
 

Ähnlich wie ADAM

Ruby 2.0 / Rails 4.0, A selection of new features.
Ruby 2.0 / Rails 4.0, A selection of new features.Ruby 2.0 / Rails 4.0, A selection of new features.
Ruby 2.0 / Rails 4.0, A selection of new features.lrdesign
 
Basics of Metaprogramming in Ruby
Basics of Metaprogramming in RubyBasics of Metaprogramming in Ruby
Basics of Metaprogramming in RubyDigital Natives
 
Symfony2 and MongoDB - MidwestPHP 2013
Symfony2 and MongoDB - MidwestPHP 2013   Symfony2 and MongoDB - MidwestPHP 2013
Symfony2 and MongoDB - MidwestPHP 2013 Pablo Godel
 
Riak intro to..
Riak intro to..Riak intro to..
Riak intro to..Adron Hall
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelDean Wampler
 
Cloudera Impala - HUG Karlsruhe, July 04, 2013
Cloudera Impala - HUG Karlsruhe, July 04, 2013Cloudera Impala - HUG Karlsruhe, July 04, 2013
Cloudera Impala - HUG Karlsruhe, July 04, 2013Alexander Alten
 
Consideration for Building a Private Cloud
Consideration for Building a Private CloudConsideration for Building a Private Cloud
Consideration for Building a Private CloudOpenStack Foundation
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookThe Hive
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadAll Things Open
 
X$Tables And Sga Scanner, DOAG2009
X$Tables And Sga Scanner, DOAG2009X$Tables And Sga Scanner, DOAG2009
X$Tables And Sga Scanner, DOAG2009Frank
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache HivemallMakoto Yui
 
The WordPress Hacker's Guide to the \Galaxy() [@MidwestPHP]
The WordPress Hacker's Guide to the \Galaxy() [@MidwestPHP]The WordPress Hacker's Guide to the \Galaxy() [@MidwestPHP]
The WordPress Hacker's Guide to the \Galaxy() [@MidwestPHP]Jason Rhodes
 
The WordPress Hacker's Guide to the \Galaxy() [@Baltimore PHP]
The WordPress Hacker's Guide to the \Galaxy() [@Baltimore PHP]The WordPress Hacker's Guide to the \Galaxy() [@Baltimore PHP]
The WordPress Hacker's Guide to the \Galaxy() [@Baltimore PHP]Jason Rhodes
 
Continuous Delivery at Netflix
Continuous Delivery at NetflixContinuous Delivery at Netflix
Continuous Delivery at NetflixRob Spieldenner
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
 

Ähnlich wie ADAM (20)

Ruby 2.0 / Rails 4.0, A selection of new features.
Ruby 2.0 / Rails 4.0, A selection of new features.Ruby 2.0 / Rails 4.0, A selection of new features.
Ruby 2.0 / Rails 4.0, A selection of new features.
 
Basics of Metaprogramming in Ruby
Basics of Metaprogramming in RubyBasics of Metaprogramming in Ruby
Basics of Metaprogramming in Ruby
 
Symfony2 and MongoDB - MidwestPHP 2013
Symfony2 and MongoDB - MidwestPHP 2013   Symfony2 and MongoDB - MidwestPHP 2013
Symfony2 and MongoDB - MidwestPHP 2013
 
StORM preview
StORM previewStORM preview
StORM preview
 
Riak intro to..
Riak intro to..Riak intro to..
Riak intro to..
 
Why Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) ModelWhy Spark Is the Next Top (Compute) Model
Why Spark Is the Next Top (Compute) Model
 
Cloudera Impala - HUG Karlsruhe, July 04, 2013
Cloudera Impala - HUG Karlsruhe, July 04, 2013Cloudera Impala - HUG Karlsruhe, July 04, 2013
Cloudera Impala - HUG Karlsruhe, July 04, 2013
 
Consideration for Building a Private Cloud
Consideration for Building a Private CloudConsideration for Building a Private Cloud
Consideration for Building a Private Cloud
 
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of FacebookTech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
 
Regex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language InsteadRegex Considered Harmful: Use Rosie Pattern Language Instead
Regex Considered Harmful: Use Rosie Pattern Language Instead
 
X$Tables And Sga Scanner, DOAG2009
X$Tables And Sga Scanner, DOAG2009X$Tables And Sga Scanner, DOAG2009
X$Tables And Sga Scanner, DOAG2009
 
Incubating Apache Hivemall
Incubating Apache HivemallIncubating Apache Hivemall
Incubating Apache Hivemall
 
The WordPress Hacker's Guide to the \Galaxy() [@MidwestPHP]
The WordPress Hacker's Guide to the \Galaxy() [@MidwestPHP]The WordPress Hacker's Guide to the \Galaxy() [@MidwestPHP]
The WordPress Hacker's Guide to the \Galaxy() [@MidwestPHP]
 
The WordPress Hacker's Guide to the \Galaxy() [@Baltimore PHP]
The WordPress Hacker's Guide to the \Galaxy() [@Baltimore PHP]The WordPress Hacker's Guide to the \Galaxy() [@Baltimore PHP]
The WordPress Hacker's Guide to the \Galaxy() [@Baltimore PHP]
 
Wphackergalaxy
WphackergalaxyWphackergalaxy
Wphackergalaxy
 
Continuous Delivery at Netflix
Continuous Delivery at NetflixContinuous Delivery at Netflix
Continuous Delivery at Netflix
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Linked Data on Rails
Linked Data on RailsLinked Data on Rails
Linked Data on Rails
 
Hadoop
HadoopHadoop
Hadoop
 

Kürzlich hochgeladen

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Kürzlich hochgeladen (20)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

ADAM

  • 1. ADAM https://github.com/massie/adam Matt Massie University of California, Berkeley massie@berkeley.edu Saturday, November 2, 13
  • 2. SAM BAM ADAM Sequence Alignment Map (SAM) Binary Alignment Map (BAM) Avro Data Alignment Map (ADAM) Saturday, November 2, 13
  • 3. Pipeline Issues Today: Time and Scale • The time to go from reads to answers is too long • Processing thousands of BAM files for statistical analysis doesn’t scale Saturday, November 2, 13
  • 4. ADAM: Speed and Scale • Read BAM once, perform transformations (e.g. sort, mark duplicates, BQSR) in distributed memory, write the analysisready ADAM file once • Use a distribute filesystem (HDFS), a fast execution system (Spark) and columnar data formats (Parquet) to scale Saturday, November 2, 13
  • 5. Unlocking Genomic Data Shark (SQL) Hadoop M/R Spark Impala (SQL) ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM ADAM Hadoop Distributed File System (HDFS) Local Filesystem ADAM ADAM ADAM ADAM Saturday, November 2, 13 BAM
  • 6. record ADAMRecord { union union union union union union union union union union union { { { { { { { { { { { null, null, null, null, null, null, null, null, null, null, null, string } referenceName = null; int } referenceId = null; long } start = null; int } mapq = null; string } readName = null; string } sequence = null; string } mateReference = null; long } mateAlignmentStart = null; string } cigar = null; string } qual = null; string } recordGroupId = null; union union union union union union union union union union union { { { { { { { { { { { boolean, boolean, boolean, boolean, boolean, boolean, boolean, boolean, boolean, boolean, boolean, null null null null null null null null null null null } } } } } } } } } } } http://avro.apache.org/ readPaired = false; properPair = false; readMapped = false; mateMapped = false; readNegativeStrand = false; mateNegativeStrand = false; firstOfPair = false; secondOfPair = false; primaryAlignment = false; failedVendorQualityChecks = false; duplicateRead = false; union { null, string } mismatchingPositions = null; union { null, string } attributes = null; union union union union union union union union union union } { { { { { { { { { { null, null, null, null, null, null, null, null, null, null, string } recordGroupSequencingCenter = null; string } recordGroupDescription = null; long } recordGroupRunDateEpoch = null; string } recordGroupFlowOrder = null; string } recordGroupKeySequence = null; string } recordGroupLibrary = null; int } recordGroupPredictedMedianInsertSize = null; string } recordGroupPlatform = null; string } recordGroupPlatformUnit = null; string } recordGroupSample = null; union { null, int } mateReferenceId = null; Saturday, November 2, 13
  • 8. Genomic Data Example chrom20 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M Column Oriented chrom20 chrom20 chrom20 TCGA GAAT CCGAT 4M 4M1D 5M Row Oriented chrom20 Saturday, November 2, 13 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M
  • 10. Low-Coverage BAM Experiment • 14GB Low-coverage BAM with 145M reads • 10-node ec2 cluster m2.4xlarge • Reduced to 13GB with ADAM • Conversion/upload to HDFS 22mins • Sorted in 7minutes Saturday, November 2, 13
  • 11. High-Coverage BAM Experiment • Input: 237GB NA12878- high coverage, PCR free, whole-genome BAM • Conversion took 4hrs on ec2 m2.4xlarge (8cpu, 68.4gb mem) • Output size: 237GB BAM reduced to 212GB ADAM Saturday, November 2, 13
  • 12. Current Features • • • • • Saturday, November 2, 13 Convert BAM to ADAM (read-oriented) Sort an ADAM file by reference Generate ADAMPileups Print mpileup output Very soon ADAM will be able to mark duplicates (initial benchmarks look good)
  • 13. In progress... • Frank is working on a distributed variant caller (https:// github.com/fnothaft/avocado), local realignment, adam2bam • Chris Hartl is integrating ADAM with GATK (https:// github.com/chartl/GAParquet) DiagnoseTargets, adding new VCF formats to ADAM, BQSR • Christos Kozanitis has been working on Shark and Impala integration for ad-hoc SQL read queries • Collaborations with Mt. Sinai, GenomeBridge and the Broad Institute who are interested in using ADAM Saturday, November 2, 13