SlideShare a Scribd company logo
1 of 35
Download to read offline
Hadoop 101 for 
bioinformaticians 
Attila Csordas 
Anybody used Hadoop before?
Aim 
1 hour:! 
theoretical introduction -> be able to assess whether your 
problem fits MR/Hadoop! 
! 
practical session -> be able to start developing code! 
!
Hadoop & me 
mostly non-coding bioinformatician at PRIDE! 
Cloudera Certified Hadoop Developer! 
~1300 hadoop jobs ~ 4000 MR jobs! 
3+ yrs of operator experience!
Linear scalability 
Search time = f(job complexity) 
6+ days on a 4 core 
local machine! 
Db build time = f(# of peptides) 
# of proteins: 4 mill (quarter nr) 8 mill (half nr) 16 mill (nr) 
5000 spectra 
100,000 
spectra 
200,000 
spectra
Theoretical introduction 
• Big Data! 
• Data Operating System! 
• Hadoop 1.0! 
• MapReduce! 
• Hadoop 2.0
Big Data 
difficult to process on a single machine! 
Volume: low TB - low PB! 
Velocity: generation rate! 
Variety: tab separated, machine data, documents! 
biological repositories fit the bell: e.g.. PRIDE
Data Operating System 
features! components! 
distributed storage 
scalable resource management 
fault-tolerant processing engine 1 
redundant processing engine 2
Hadoop 1.0 
storage 
Hadoop Distributed 
File System! 
aka HDFS! 
redundant, reliable, 
distributed 
processing engine/ 
cluster resource 
management 
MapReduce! 
distributed, fault-tolerant 
resource 
management, 
scheduling & 
processing engine 
single use data platform
MapReduce 
programming model for large scale, distributed, fault 
tolerant, batch data processing 
execution framework, the “runtime” 
software implementation
Stateless algorithms 
output only depends on the current input but not on 
previous inputs 
dependent: Fibonacci series: F(k+2)=F(k+1)+F(k)
Parallelizable problems 
easy to separate to parallel tasks 
might need a to maintain a global, shared state
Embarassingly/pleasingly 
parallel problems 
easy to separate to parallel tasks 
no dependency or communication between those tasks
Functional programming 
roots 
higher-order functions that accept other functions 
as arguments 
! 
map: applies its argument f() to all elements of a 
list 
! 
fold: takes g() + initial value -> final value 
! 
sum of squares: map(x2), fold(+) + 0 as initial 
value
MapReduce 
map(key, value) : [(key, value)] 
! 
shuffle&sort: values grouped by key 
! 
reduce(key, iterator<value>) : [(key, value)]
Word Count 
Map(String docid, String text):! 
for each word w in text:! 
Emit(w, 1);! 
! 
Reduce(String term, Iterator<Int> values):! 
int sum = 0;! 
for each v in values:! 
sum += v;! 
Emit(term, value);! 
! 
source: Jimmy Lin and Chris Dryer: 
Data-Intensive Text Processing with MapReduce 
Morgan & Claypool Publishers, 2010.
Count amino acids in peptides 
Map(byte offset of the line, peptide sequence): 
for each amino acid in peptide: 
Emit(amino acid, 1); 
! 
Reduce(amino acid, Iterator<Int> values): 
int sum = 0; 
for each v in values: 
sum += v; 
Emit(amino acid, value);
Mapper Mapper Mapper Mapper Mapper 
Intermediates Intermediates Intermediates Intermediates Intermediates 
Partitioner Partitioner Partitioner Partitioner Partitioner 
Intermediates Intermediates Intermediates 
Reducer Reducer Reduce 
(combiners omitted here) 
Source: redrawn from a slide by Cloduera, cc-licensed 
source: Jimmy Lin and Chris Dryer: 
Data-Intensive Text Processing with MapReduce 
Morgan & Claypool Publishers, 2010.
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 
map map map map 
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 
combine combine combine combine 
a 1 b 2 c 9 a 5 c 2 b 7 c 8 
partition partition partition partition 
Shuffle and Sort: aggregate values by keys 
a 1 5 b 2 7 c 2 9 8 
reduce reduce reduce 
r1 s1 r2 s2 r3 s3 
3 6 8 
source: Jimmy Lin and Chris Dryer: 
Data-Intensive Text Processing with MapReduce 
Morgan & Claypool Publishers, 2010.
0 AGELTEDEVER 12 DTNGSQFFITTVK 
26 IEVEKPFAIAKE 
map map map 
A 1 G 1 E 1 D 1 T 1 N 1 I 1 E 1 G 1 
Shuffle & sort: aggregate values by keys 
A 1 D 1 E 1 1 G 1 1 I 1 N 1 T 1 
reduce reduce reduce 
E 2 G 2 N 1
Hadoop 1.0: single use 
Pig: scripting 4 Hadoop, Yahoo!, 2006 
Hive: SQL queries 4 Hadoop Facebook 
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); 
hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
Hadoop 2.0 
processing engines: 
MapReduce, Tez, 
HBase, Storm, 
Giraph, Spark… 
batch, interactive, 
online, streaming, 
graph … 
cluster resource 
management 
YARN! 
distributed, fault-tolerant 
resource 
management, 
scheduling & 
processing engine 
storage HDFS2! 
redundant, reliable, 
distributed 
multi use data platform
Practical session 
Objective: up & running 
w/ Hadoop in 30 mins
requirements 
• java 
• intellij idea 
• maven
Outline 
• check out https://github.com/attilacsordas/ 
hadoop_introduction 
• elements of a MapReduce app, basic Hadoop API & 
data types 
• bioinformatics toy example: counting amino acids in 
sequences 
• executing a hadoop job locally 
• 2 more examples building on top of AminoAcidCounter
https://github.com/attilacsordas/hadoop_for_bioinformatics
MapReduce app 
components 
• driver 
• mapper 
• reducer 
repo: bioinformatics.hadoop.AminoAcidCounter.java
Inputformats 
specifies data passed to Mappers 
! 
specified in driver 
! 
determines input splits and RecordReaders extracting key-value 
pairs 
! 
default formats: TextInputFormat, KeyValueTextInputFormat, 
SequenceFileInputFormat … 
! 
, LongWritable …
Data types 
everything implements Writable, de/serialization 
! 
all keys are WritableComparable: sorting keys 
! 
box classes for primitive data types: Text, IntWritable, 
LongWritable …
bioinformatics toy example 
Our task is to prepare data to form a statistical model on what 
physico-chemico properties of the constituent amino acid 
residues are affecting the visibility of the high-flyer peptides for 
a mass spectrometer. 
! 
AGELTEDEVER 
QSVHLVENEIQASIDQIFSHLER 
DTNGSQFFITTVK 
IEVEKPFAIAKE 
repo: headpeptides.txt in input folder
run the code! 
local 
pseudodistributed 
cluster
Debugging 
do it locally if possible 
print statements 
! 
write to map output 
! 
mapred.map.child.log.level=DEBUG 
mapred.reduce.child.log.level=DEBUG 
-> syslog task logfile 
! 
running debuggers remotely is hard 
! 
JVM debugging options 
! 
task profilers
Counters 
track records for statistics, malformed records 
AminoAcidCounterwithMalformedCounter
• Task 1: modify mapper to count positions too: 
R_2 -> AminoAcidPositionCounter 
! 
• Task 2: count positions & normalise to peptide 
length -> 
AminoAcidPositionCounterNormalizedToPeptide 
Length
Run jar on the cluster 
set up an account on a hadoop cluster 
! 
move input data into HDFS 
! 
set mainClass in pom.xml 
! 
mvn clean install 
! 
cp jar from target to servers 
! 
ssh into hadoop 
! 
run hadoop command line
homework, next steps 
count all the 3 outputs for 1 input k,v pair 
! 
count all amino acid pairs 
! 
run FastaAminoAcidCounter & check FastaInputFormat 
! 
run the jar on a cluster 
! 
https://github.com/lintool/MapReduce-course-2013s/ 
tree/master/slides 
! 
n * (trial, error) -> success

More Related Content

What's hot

Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaGlenn K. Lockwood
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsViswanath Gangavaram
 
Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on HadoopChung-Tsai Su
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDataWorks Summit
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 

What's hot (20)

Hadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without JavaHadoop Streaming: Programming Hadoop without Java
Hadoop Streaming: Programming Hadoop without Java
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLabMapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
MapReduce - Basics | Big Data Hadoop Spark Tutorial | CloudxLab
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labsApache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction of R on Hadoop
Introduction of R on HadoopIntroduction of R on Hadoop
Introduction of R on Hadoop
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other OptimizationsMastering Hadoop Map Reduce - Custom Types and Other Optimizations
Mastering Hadoop Map Reduce - Custom Types and Other Optimizations
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 

Similar to Hadoop 101 for bioinformaticians

The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
Streaming API, Spark and Ruby
Streaming API, Spark and RubyStreaming API, Spark and Ruby
Streaming API, Spark and RubyManohar Amrutkar
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"Giivee The
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processingjins0618
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEPaco Nathan
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 

Similar to Hadoop 101 for bioinformaticians (20)

The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Streaming API, Spark and Ruby
Streaming API, Spark and RubyStreaming API, Spark and Ruby
Streaming API, Spark and Ruby
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"OCF.tw's talk about "Introduction to spark"
OCF.tw's talk about "Introduction to spark"
 
Hadoop
HadoopHadoop
Hadoop
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 

More from attilacsordas

Aging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingAging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingattilacsordas
 
Towards a consensus definition of biological aging
Towards a consensus definition of biological agingTowards a consensus definition of biological aging
Towards a consensus definition of biological agingattilacsordas
 
Aging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsAging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsattilacsordas
 
What is it like to be 572 year old?
What is it like to be 572 year old?What is it like to be 572 year old?
What is it like to be 572 year old?attilacsordas
 
Cell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationCell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationattilacsordas
 
The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...attilacsordas
 
Open Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original PositionOpen Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original Positionattilacsordas
 
Pride quality controlattilacsordasbiocuration2012
Pride quality controlattilacsordasbiocuration2012Pride quality controlattilacsordasbiocuration2012
Pride quality controlattilacsordasbiocuration2012attilacsordas
 
Ultrcentifugation: Basic Training
Ultrcentifugation: Basic TrainingUltrcentifugation: Basic Training
Ultrcentifugation: Basic Trainingattilacsordas
 
Google's Palimpsest Project
Google's Palimpsest ProjectGoogle's Palimpsest Project
Google's Palimpsest Projectattilacsordas
 
SENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest OldSENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest Oldattilacsordas
 

More from attilacsordas (15)

Aging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological agingAging is agings: a recursive definition of biological aging
Aging is agings: a recursive definition of biological aging
 
Towards a consensus definition of biological aging
Towards a consensus definition of biological agingTowards a consensus definition of biological aging
Towards a consensus definition of biological aging
 
Aging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitionsAging vs agings: limits and consequences of biomedical definitions
Aging vs agings: limits and consequences of biomedical definitions
 
What is it like to be 572 year old?
What is it like to be 572 year old?What is it like to be 572 year old?
What is it like to be 572 year old?
 
Cell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenationCell lineage trees and the limiting problem of comprehensive rejuvenation
Cell lineage trees and the limiting problem of comprehensive rejuvenation
 
The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...The problematic openness behind the first capability concerning the end of a ...
The problematic openness behind the first capability concerning the end of a ...
 
Open Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original PositionOpen Lifespan and (not) knowing our age in Rawls’ Original Position
Open Lifespan and (not) knowing our age in Rawls’ Original Position
 
Pride quality controlattilacsordasbiocuration2012
Pride quality controlattilacsordasbiocuration2012Pride quality controlattilacsordasbiocuration2012
Pride quality controlattilacsordasbiocuration2012
 
Ultrcentifugation: Basic Training
Ultrcentifugation: Basic TrainingUltrcentifugation: Basic Training
Ultrcentifugation: Basic Training
 
Merry XOmas
Merry XOmasMerry XOmas
Merry XOmas
 
Google's Palimpsest Project
Google's Palimpsest ProjectGoogle's Palimpsest Project
Google's Palimpsest Project
 
LindaPowers onSENS3
LindaPowers onSENS3LindaPowers onSENS3
LindaPowers onSENS3
 
SENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest OldSENS3: Stephen Coles on the Secrets of the Oldest Old
SENS3: Stephen Coles on the Secrets of the Oldest Old
 
SENS3: Michael Rose
SENS3: Michael RoseSENS3: Michael Rose
SENS3: Michael Rose
 
Microvesiclesslide
MicrovesiclesslideMicrovesiclesslide
Microvesiclesslide
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Hadoop 101 for bioinformaticians

  • 1. Hadoop 101 for bioinformaticians Attila Csordas Anybody used Hadoop before?
  • 2. Aim 1 hour:! theoretical introduction -> be able to assess whether your problem fits MR/Hadoop! ! practical session -> be able to start developing code! !
  • 3. Hadoop & me mostly non-coding bioinformatician at PRIDE! Cloudera Certified Hadoop Developer! ~1300 hadoop jobs ~ 4000 MR jobs! 3+ yrs of operator experience!
  • 4. Linear scalability Search time = f(job complexity) 6+ days on a 4 core local machine! Db build time = f(# of peptides) # of proteins: 4 mill (quarter nr) 8 mill (half nr) 16 mill (nr) 5000 spectra 100,000 spectra 200,000 spectra
  • 5. Theoretical introduction • Big Data! • Data Operating System! • Hadoop 1.0! • MapReduce! • Hadoop 2.0
  • 6. Big Data difficult to process on a single machine! Volume: low TB - low PB! Velocity: generation rate! Variety: tab separated, machine data, documents! biological repositories fit the bell: e.g.. PRIDE
  • 7. Data Operating System features! components! distributed storage scalable resource management fault-tolerant processing engine 1 redundant processing engine 2
  • 8. Hadoop 1.0 storage Hadoop Distributed File System! aka HDFS! redundant, reliable, distributed processing engine/ cluster resource management MapReduce! distributed, fault-tolerant resource management, scheduling & processing engine single use data platform
  • 9. MapReduce programming model for large scale, distributed, fault tolerant, batch data processing execution framework, the “runtime” software implementation
  • 10. Stateless algorithms output only depends on the current input but not on previous inputs dependent: Fibonacci series: F(k+2)=F(k+1)+F(k)
  • 11. Parallelizable problems easy to separate to parallel tasks might need a to maintain a global, shared state
  • 12. Embarassingly/pleasingly parallel problems easy to separate to parallel tasks no dependency or communication between those tasks
  • 13. Functional programming roots higher-order functions that accept other functions as arguments ! map: applies its argument f() to all elements of a list ! fold: takes g() + initial value -> final value ! sum of squares: map(x2), fold(+) + 0 as initial value
  • 14. MapReduce map(key, value) : [(key, value)] ! shuffle&sort: values grouped by key ! reduce(key, iterator<value>) : [(key, value)]
  • 15. Word Count Map(String docid, String text):! for each word w in text:! Emit(w, 1);! ! Reduce(String term, Iterator<Int> values):! int sum = 0;! for each v in values:! sum += v;! Emit(term, value);! ! source: Jimmy Lin and Chris Dryer: Data-Intensive Text Processing with MapReduce Morgan & Claypool Publishers, 2010.
  • 16. Count amino acids in peptides Map(byte offset of the line, peptide sequence): for each amino acid in peptide: Emit(amino acid, 1); ! Reduce(amino acid, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(amino acid, value);
  • 17. Mapper Mapper Mapper Mapper Mapper Intermediates Intermediates Intermediates Intermediates Intermediates Partitioner Partitioner Partitioner Partitioner Partitioner Intermediates Intermediates Intermediates Reducer Reducer Reduce (combiners omitted here) Source: redrawn from a slide by Cloduera, cc-licensed source: Jimmy Lin and Chris Dryer: Data-Intensive Text Processing with MapReduce Morgan & Claypool Publishers, 2010.
  • 18. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 reduce reduce reduce r1 s1 r2 s2 r3 s3 3 6 8 source: Jimmy Lin and Chris Dryer: Data-Intensive Text Processing with MapReduce Morgan & Claypool Publishers, 2010.
  • 19. 0 AGELTEDEVER 12 DTNGSQFFITTVK 26 IEVEKPFAIAKE map map map A 1 G 1 E 1 D 1 T 1 N 1 I 1 E 1 G 1 Shuffle & sort: aggregate values by keys A 1 D 1 E 1 1 G 1 1 I 1 N 1 T 1 reduce reduce reduce E 2 G 2 N 1
  • 20. Hadoop 1.0: single use Pig: scripting 4 Hadoop, Yahoo!, 2006 Hive: SQL queries 4 Hadoop Facebook hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING); hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
  • 21. Hadoop 2.0 processing engines: MapReduce, Tez, HBase, Storm, Giraph, Spark… batch, interactive, online, streaming, graph … cluster resource management YARN! distributed, fault-tolerant resource management, scheduling & processing engine storage HDFS2! redundant, reliable, distributed multi use data platform
  • 22. Practical session Objective: up & running w/ Hadoop in 30 mins
  • 23. requirements • java • intellij idea • maven
  • 24. Outline • check out https://github.com/attilacsordas/ hadoop_introduction • elements of a MapReduce app, basic Hadoop API & data types • bioinformatics toy example: counting amino acids in sequences • executing a hadoop job locally • 2 more examples building on top of AminoAcidCounter
  • 26. MapReduce app components • driver • mapper • reducer repo: bioinformatics.hadoop.AminoAcidCounter.java
  • 27. Inputformats specifies data passed to Mappers ! specified in driver ! determines input splits and RecordReaders extracting key-value pairs ! default formats: TextInputFormat, KeyValueTextInputFormat, SequenceFileInputFormat … ! , LongWritable …
  • 28. Data types everything implements Writable, de/serialization ! all keys are WritableComparable: sorting keys ! box classes for primitive data types: Text, IntWritable, LongWritable …
  • 29. bioinformatics toy example Our task is to prepare data to form a statistical model on what physico-chemico properties of the constituent amino acid residues are affecting the visibility of the high-flyer peptides for a mass spectrometer. ! AGELTEDEVER QSVHLVENEIQASIDQIFSHLER DTNGSQFFITTVK IEVEKPFAIAKE repo: headpeptides.txt in input folder
  • 30. run the code! local pseudodistributed cluster
  • 31. Debugging do it locally if possible print statements ! write to map output ! mapred.map.child.log.level=DEBUG mapred.reduce.child.log.level=DEBUG -> syslog task logfile ! running debuggers remotely is hard ! JVM debugging options ! task profilers
  • 32. Counters track records for statistics, malformed records AminoAcidCounterwithMalformedCounter
  • 33. • Task 1: modify mapper to count positions too: R_2 -> AminoAcidPositionCounter ! • Task 2: count positions & normalise to peptide length -> AminoAcidPositionCounterNormalizedToPeptide Length
  • 34. Run jar on the cluster set up an account on a hadoop cluster ! move input data into HDFS ! set mainClass in pom.xml ! mvn clean install ! cp jar from target to servers ! ssh into hadoop ! run hadoop command line
  • 35. homework, next steps count all the 3 outputs for 1 input k,v pair ! count all amino acid pairs ! run FastaAminoAcidCounter & check FastaInputFormat ! run the jar on a cluster ! https://github.com/lintool/MapReduce-course-2013s/ tree/master/slides ! n * (trial, error) -> success