ChistaDATA Real-Time DATA Analytics Infrastructure
Seqpig script language for large bioinformatic datasets
1. SeqPig
A simple and scalable scripting language for
large sequencing data sets in Hadoop
arian pasquali
june 6, 2014
2. /me
Arian Pasquali
Master's student in Data Mining
Data engineer at Semasio
background
- engineering - cloud computing
- data mining on big data - social networks
3. study case
SeqPig: simple and scalable scripting for large
sequencing data sets in Hadoop.
Schumacher A1, Pireddu L, Niemenmaa M, Kallio A, Korpelainen E,
Zanetti G, Heljanko K.
Bioinformatics. 2014 Jan 1;30(1):119-20. doi: 10.1093
/bioinformatics/btt601. Epub 2013 Oct 22.
http://www.ncbi.nlm.nih.gov/pubmed/24149054
4. but first, some background
● Real world bioinformatics datasets are huge
● Gigabytes/Petabytes are hard to handle on a
single computer
● in order to handle big data sets we have to
master parallel programming models
5. Parallel programming models
some high-performance
programming models
- Serial (doesn’t scale)
- MPI (expensive)
- MapReduce
- Hadoop
(cheap and scalable)
6. hadoop
Hadoop is an open source implementation of
that enables you to run MapReduce programs.
It is aimed to process huge volumes of data of
Tera or PetaBytes, what fits perfectly in many
bioinformatics scenarios.
http://hadoop.apache.org/
7. how mapreduce works on hadoop
Provides a framework for
MapReduce, a fault-tolerant
parallel programing model
- easier to write programs
than other paradigms
- easier means cheaper
- runs on clusters with
commodity hardware
- scales horizontally
- need more power?
just add more nodes
10. Apache Pig tries to solve that
Apache Pig solves that.
Under the hood it applies MapReduce
paradigm
It hides all the pitfalls about writing
MapReduce code
12. Apache Pig in Bioinformatics
It is a platform for analyzing large data sets that consists of
a high-level language for expressing data analysis
programs.
It can be easier
14. SeqPig
● a script language,
● a library,
● and a collection of tools to manipulate,
analyze and query sequencing datasets in a
scalable and simple manner
http://seqpig.sourceforge.net/
15. SeqPig and data format support
Currently it supports
BAM
SAM
FastQ
Qseq input and output
FASTA input
16. possible use cases
● converting data formats
● filters regions of a chromossome
● computing base frequencies
● alignments
● collecting read-mapping-quality-statistics
17. code example
run scripts/filter_defs.pig
A = load 'input.bam' using BamLoader('yes');
B = FILTER A BY not ReadUnmapped(flags) and not IsDuplicate(flags);
C = FOREACH B GENERATE ReadSplit(name,start,read,cigar,basequal,flags,mapqual,refindex,refname,
attributes#'MD');
D = FOREACH C GENERATE FLATTEN($0);
base_stats_data = FOREACH D GENERATE refbase, basepos, UPPER(readbase) AS readbase;
base_stats_grouped = GROUP base_stats_data BY (refbase, basepos, readbase);
base_stats_grouped_count = FOREACH base_stats_grouped GENERATE group.$0 AS refbase, group.$1 AS
basepos, group.$2 as readbase, COUNT($1) AS bcount;
base_stats_grouped = GROUP base_stats_grouped_count by (refbase, basepos);
base_stats = FOREACH base_stats_grouped {
TMP1 = FOREACH base_stats_grouped_count GENERATE readbase, bcount;
TMP2 = ORDER TMP1 BY bcount desc;
GENERATE group.$0, group.$1, TMP2;
}
STORE base_stats into 'outputfile_readstats.txt';
18. results
A 0 {(A,19),(G,2)}
A 1 {(A,10)}
A 2 {(A,18)}
A 3 {(A,16)}
A 4 {(A,14)}
A 5 {(A,15)}
A 6 {(A,16),(G,2)}
...
A 98 {(A,7)}
A 99 {(A,14)}
C 0 {(C,6)}
C 1 {(C,11)}
C 2 {(C,9)}
21. related work
Biodoop: Bioinformatics on Hadoop
http://dl.acm.org/citation.cfm?id=1679817
BioPig: A Hadoop-based Analytic Toolkit for Large-Scale
Sequence Data, Oxford Journals
http://bioinformatics.oxfordjournals.
org/content/early/2013/09/10/bioinformatics.btt528
22. some cloud computing solutions
Amazon AWS , general use purpouse
http://aws.amazon.com/
Mortar Data , focused on data science
http://www.mortardata.com/
CloudGene, focused on bioinformatics users
http://cloudgene.uibk.ac.at/
24. conclusions
Bioinformatics have been creating innovative algorithms
and solutions that sometimes are adopted in different fields
in computer science.
Neural networks in Artificial Intelligence and Machine
learning is an example.
Now, large scalable approaches from data mining are
helping Bioinformatics to move forward, faster and
cheaper.