SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
SeqPig
A simple and scalable scripting language for
large sequencing data sets in Hadoop
arian pasquali
june 6, 2014
/me
Arian Pasquali
Master's student in Data Mining
Data engineer at Semasio
background
- engineering - cloud computing
- data mining on big data - social networks
study case
SeqPig: simple and scalable scripting for large
sequencing data sets in Hadoop.
Schumacher A1, Pireddu L, Niemenmaa M, Kallio A, Korpelainen E,
Zanetti G, Heljanko K.
Bioinformatics. 2014 Jan 1;30(1):119-20. doi: 10.1093
/bioinformatics/btt601. Epub 2013 Oct 22.
http://www.ncbi.nlm.nih.gov/pubmed/24149054
but first, some background
● Real world bioinformatics datasets are huge
● Gigabytes/Petabytes are hard to handle on a
single computer
● in order to handle big data sets we have to
master parallel programming models
Parallel programming models
some high-performance
programming models
- Serial (doesn’t scale)
- MPI (expensive)
- MapReduce
- Hadoop
(cheap and scalable)
hadoop
Hadoop is an open source implementation of
that enables you to run MapReduce programs.
It is aimed to process huge volumes of data of
Tera or PetaBytes, what fits perfectly in many
bioinformatics scenarios.
http://hadoop.apache.org/
how mapreduce works on hadoop
Provides a framework for
MapReduce, a fault-tolerant
parallel programing model
- easier to write programs
than other paradigms
- easier means cheaper
- runs on clusters with
commodity hardware
- scales horizontally
- need more power?
just add more nodes
an application: BLAST algorithm
MapReduce Tasks
- load data
- map sequences
- partitionate
- reduce (merge)
- output results
MapReduce is easier, but not trivial
Apache Pig tries to solve that
Apache Pig solves that.
Under the hood it applies MapReduce
paradigm
It hides all the pitfalls about writing
MapReduce code
Pig version of the same code
Apache Pig in Bioinformatics
It is a platform for analyzing large data sets that consists of
a high-level language for expressing data analysis
programs.
It can be easier
SeqPig
Scalable scripting language based on
Apache Pig for large scale sequence
analysis
SeqPig
● a script language,
● a library,
● and a collection of tools to manipulate,
analyze and query sequencing datasets in a
scalable and simple manner
http://seqpig.sourceforge.net/
SeqPig and data format support
Currently it supports
BAM
SAM
FastQ
Qseq input and output
FASTA input
possible use cases
● converting data formats
● filters regions of a chromossome
● computing base frequencies
● alignments
● collecting read-mapping-quality-statistics
code example
run scripts/filter_defs.pig
A = load 'input.bam' using BamLoader('yes');
B = FILTER A BY not ReadUnmapped(flags) and not IsDuplicate(flags);
C = FOREACH B GENERATE ReadSplit(name,start,read,cigar,basequal,flags,mapqual,refindex,refname,
attributes#'MD');
D = FOREACH C GENERATE FLATTEN($0);
base_stats_data = FOREACH D GENERATE refbase, basepos, UPPER(readbase) AS readbase;
base_stats_grouped = GROUP base_stats_data BY (refbase, basepos, readbase);
base_stats_grouped_count = FOREACH base_stats_grouped GENERATE group.$0 AS refbase, group.$1 AS
basepos, group.$2 as readbase, COUNT($1) AS bcount;
base_stats_grouped = GROUP base_stats_grouped_count by (refbase, basepos);
base_stats = FOREACH base_stats_grouped {
TMP1 = FOREACH base_stats_grouped_count GENERATE readbase, bcount;
TMP2 = ORDER TMP1 BY bcount desc;
GENERATE group.$0, group.$1, TMP2;
}
STORE base_stats into 'outputfile_readstats.txt';
results
A 0 {(A,19),(G,2)}
A 1 {(A,10)}
A 2 {(A,18)}
A 3 {(A,16)}
A 4 {(A,14)}
A 5 {(A,15)}
A 6 {(A,16),(G,2)}
...
A 98 {(A,7)}
A 99 {(A,14)}
C 0 {(C,6)}
C 1 {(C,11)}
C 2 {(C,9)}
results plotted
scalability test
● 61Gb dataset
● running some
FastQC stats
* speed in minutes
related work
Biodoop: Bioinformatics on Hadoop
http://dl.acm.org/citation.cfm?id=1679817
BioPig: A Hadoop-based Analytic Toolkit for Large-Scale
Sequence Data, Oxford Journals
http://bioinformatics.oxfordjournals.
org/content/early/2013/09/10/bioinformatics.btt528
some cloud computing solutions
Amazon AWS , general use purpouse
http://aws.amazon.com/
Mortar Data , focused on data science
http://www.mortardata.com/
CloudGene, focused on bioinformatics users
http://cloudgene.uibk.ac.at/
cloudgene, mapreduce for bioinformatics
conclusions
Bioinformatics have been creating innovative algorithms
and solutions that sometimes are adopted in different fields
in computer science.
Neural networks in Artificial Intelligence and Machine
learning is an example.
Now, large scalable approaches from data mining are
helping Bioinformatics to move forward, faster and
cheaper.
thank you
hi@arianpasquali.com

Weitere ähnliche Inhalte

Was ist angesagt?

Studies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesStudies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesHPCC Systems
 
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...huguk
 
3.introduction to map reduce
3.introduction to map reduce3.introduction to map reduce
3.introduction to map reducedatabloginfo
 
Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource AvailabilityDeadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource AvailabilityJAYAPRAKASH JPINFOTECH
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Timothy Danford
 
Sap Hana and Virtustream for Predictive Maintenance and Big Data
Sap Hana and Virtustream for Predictive Maintenance and Big DataSap Hana and Virtustream for Predictive Maintenance and Big Data
Sap Hana and Virtustream for Predictive Maintenance and Big DataRiccardo Romani
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient DataCarol McDonald
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Makoto Yui
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010Yahoo Developer Network
 
R programming Language , Rahul Singh
R programming Language , Rahul SinghR programming Language , Rahul Singh
R programming Language , Rahul SinghRavi Basil
 

Was ist angesagt? (18)

Atul Mithe
Atul MitheAtul Mithe
Atul Mithe
 
Studies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning PerspectivesStudies of HPCC Systems from Machine Learning Perspectives
Studies of HPCC Systems from Machine Learning Perspectives
 
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
Hadoop for Data Science: Moving from BI dashboards to R models, using Hive st...
 
Hadoop
HadoopHadoop
Hadoop
 
R tutorial
R tutorialR tutorial
R tutorial
 
3.introduction to map reduce
3.introduction to map reduce3.introduction to map reduce
3.introduction to map reduce
 
Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource AvailabilityDeadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
Deadline-aware MapReduce Job Scheduling with Dynamic Resource Availability
 
Tutorial5
Tutorial5Tutorial5
Tutorial5
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Sap Hana and Virtustream for Predictive Maintenance and Big Data
Sap Hana and Virtustream for Predictive Maintenance and Big DataSap Hana and Virtustream for Predictive Maintenance and Big Data
Sap Hana and Virtustream for Predictive Maintenance and Big Data
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith Introduction to Spark R with R studio - Mr. Pragith
Introduction to Spark R with R studio - Mr. Pragith
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient Data
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010Hive integration: HBase and Rcfile__HadoopSummit2010
Hive integration: HBase and Rcfile__HadoopSummit2010
 
R programming Language , Rahul Singh
R programming Language , Rahul SinghR programming Language , Rahul Singh
R programming Language , Rahul Singh
 

Ähnlich wie Seqpig script language for large bioinformatic datasets

Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Sanath pabba hadoop resume 1.0
Sanath pabba hadoop resume 1.0Sanath pabba hadoop resume 1.0
Sanath pabba hadoop resume 1.0Pabba Gupta
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData
 
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData ResumeAnil Sokhal
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Ian Pointer
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...ijcses
 

Ähnlich wie Seqpig script language for large bioinformatic datasets (20)

Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Sanath pabba hadoop resume 1.0
Sanath pabba hadoop resume 1.0Sanath pabba hadoop resume 1.0
Sanath pabba hadoop resume 1.0
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Poorna Hadoop
Poorna HadoopPoorna Hadoop
Poorna Hadoop
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
Anil_BigData Resume
Anil_BigData ResumeAnil_BigData Resume
Anil_BigData Resume
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
06 pig-01-intro
06 pig-01-intro06 pig-01-intro
06 pig-01-intro
 
Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916Introduction To Spark - Durham LUG 20150916
Introduction To Spark - Durham LUG 20150916
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...A comparative survey based on processing network traffic data using hadoop pi...
A comparative survey based on processing network traffic data using hadoop pi...
 

Kürzlich hochgeladen

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 

Kürzlich hochgeladen (17)

The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 

Seqpig script language for large bioinformatic datasets

  • 1. SeqPig A simple and scalable scripting language for large sequencing data sets in Hadoop arian pasquali june 6, 2014
  • 2. /me Arian Pasquali Master's student in Data Mining Data engineer at Semasio background - engineering - cloud computing - data mining on big data - social networks
  • 3. study case SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop. Schumacher A1, Pireddu L, Niemenmaa M, Kallio A, Korpelainen E, Zanetti G, Heljanko K. Bioinformatics. 2014 Jan 1;30(1):119-20. doi: 10.1093 /bioinformatics/btt601. Epub 2013 Oct 22. http://www.ncbi.nlm.nih.gov/pubmed/24149054
  • 4. but first, some background ● Real world bioinformatics datasets are huge ● Gigabytes/Petabytes are hard to handle on a single computer ● in order to handle big data sets we have to master parallel programming models
  • 5. Parallel programming models some high-performance programming models - Serial (doesn’t scale) - MPI (expensive) - MapReduce - Hadoop (cheap and scalable)
  • 6. hadoop Hadoop is an open source implementation of that enables you to run MapReduce programs. It is aimed to process huge volumes of data of Tera or PetaBytes, what fits perfectly in many bioinformatics scenarios. http://hadoop.apache.org/
  • 7. how mapreduce works on hadoop Provides a framework for MapReduce, a fault-tolerant parallel programing model - easier to write programs than other paradigms - easier means cheaper - runs on clusters with commodity hardware - scales horizontally - need more power? just add more nodes
  • 8. an application: BLAST algorithm MapReduce Tasks - load data - map sequences - partitionate - reduce (merge) - output results
  • 9. MapReduce is easier, but not trivial
  • 10. Apache Pig tries to solve that Apache Pig solves that. Under the hood it applies MapReduce paradigm It hides all the pitfalls about writing MapReduce code
  • 11. Pig version of the same code
  • 12. Apache Pig in Bioinformatics It is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs. It can be easier
  • 13. SeqPig Scalable scripting language based on Apache Pig for large scale sequence analysis
  • 14. SeqPig ● a script language, ● a library, ● and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner http://seqpig.sourceforge.net/
  • 15. SeqPig and data format support Currently it supports BAM SAM FastQ Qseq input and output FASTA input
  • 16. possible use cases ● converting data formats ● filters regions of a chromossome ● computing base frequencies ● alignments ● collecting read-mapping-quality-statistics
  • 17. code example run scripts/filter_defs.pig A = load 'input.bam' using BamLoader('yes'); B = FILTER A BY not ReadUnmapped(flags) and not IsDuplicate(flags); C = FOREACH B GENERATE ReadSplit(name,start,read,cigar,basequal,flags,mapqual,refindex,refname, attributes#'MD'); D = FOREACH C GENERATE FLATTEN($0); base_stats_data = FOREACH D GENERATE refbase, basepos, UPPER(readbase) AS readbase; base_stats_grouped = GROUP base_stats_data BY (refbase, basepos, readbase); base_stats_grouped_count = FOREACH base_stats_grouped GENERATE group.$0 AS refbase, group.$1 AS basepos, group.$2 as readbase, COUNT($1) AS bcount; base_stats_grouped = GROUP base_stats_grouped_count by (refbase, basepos); base_stats = FOREACH base_stats_grouped { TMP1 = FOREACH base_stats_grouped_count GENERATE readbase, bcount; TMP2 = ORDER TMP1 BY bcount desc; GENERATE group.$0, group.$1, TMP2; } STORE base_stats into 'outputfile_readstats.txt';
  • 18. results A 0 {(A,19),(G,2)} A 1 {(A,10)} A 2 {(A,18)} A 3 {(A,16)} A 4 {(A,14)} A 5 {(A,15)} A 6 {(A,16),(G,2)} ... A 98 {(A,7)} A 99 {(A,14)} C 0 {(C,6)} C 1 {(C,11)} C 2 {(C,9)}
  • 20. scalability test ● 61Gb dataset ● running some FastQC stats * speed in minutes
  • 21. related work Biodoop: Bioinformatics on Hadoop http://dl.acm.org/citation.cfm?id=1679817 BioPig: A Hadoop-based Analytic Toolkit for Large-Scale Sequence Data, Oxford Journals http://bioinformatics.oxfordjournals. org/content/early/2013/09/10/bioinformatics.btt528
  • 22. some cloud computing solutions Amazon AWS , general use purpouse http://aws.amazon.com/ Mortar Data , focused on data science http://www.mortardata.com/ CloudGene, focused on bioinformatics users http://cloudgene.uibk.ac.at/
  • 23. cloudgene, mapreduce for bioinformatics
  • 24. conclusions Bioinformatics have been creating innovative algorithms and solutions that sometimes are adopted in different fields in computer science. Neural networks in Artificial Intelligence and Machine learning is an example. Now, large scalable approaches from data mining are helping Bioinformatics to move forward, faster and cheaper.