SlideShare ist ein Scribd-Unternehmen logo
1 von 63
© ppt by OH22
TUGA IT 2016
LISBON, PORTUGAL
Bioinformatics on Azure
Tillmann Eitelberg, Ameli Kirse
© ppt by OH22
THANK YOU TO OUR SPONSORS
© ppt by OH22
THANK YOU TO OUR TEAM
ANDRÉ BATISTA ANDRÉ MELANCIA ANDRÉ VALA ANTÓNIO LOURENÇO BRUNO LOPES CLÁUDIO SILVA
NIKO NEUGEBAUER
RUI REISRICARDO CABRAL
NUNO CANCELO PAULO MATOS PEDRO SIMÕES
SANDRA MORGADO SANDRO PEREIRARUI BASTOS
NUNO ÁRIAS SILVA
© ppt by OH22
Bioinformatics on Azure
Insights to Biodiversity with the
Microsoft Data Platform
© ppt by OH22
About Us
Tillmann Eitelberg
 CEO oh22information services
GmbH
 PASS Regional Mentor
Germany
 Vice-president PASS Germany
 Chapter Leader Cologne/Bonn,
Germany
 Microsoft Data Platform MVP
 www.ssis-components.net
 www.sqlpodcast.de
Ameli Kirse
 Master of Science in
Organismic, Evolutionary and
Palaeobiology
 Working with Next Generation
Sequencing Data
 Focused on molecular
biodiversity research
 Currently PhD position on HTP
sequencing and analysis
© ppt by OH22
Biodiversity is important
Ecosystem engineers
are actively creating,
maintaining and
modifying habitats
spillmybeans.com
© ppt by OH22
Biodiversity is important
Changes in
food webs can
alter
community
structures with
unforeseeably
consequences
weedscience.weebly.com
© ppt by OH22
Biodiversity is important
Humans benefit from
a high biodiversity
 conservation sites
Ecosystem services
www.hamburg.deutschland-summt.de
www.lavendel-labyrinth.de
© ppt by OH22
Biodiversity is important
Osmoderma eremita
(hermit beetle)
Stuttgart 21
(Underground
train station)
© ppt by OH22
Examples for active biodiversity
protection
World Youth Day 2005
Epidalea calamita vs. Pope
© ppt by OH22
Species identification yesterday
1. Collection of species
(traps, canopy fogging, etc.)
2. Identification on the basis
of expertise knowledge and
Identification keys
©Mark Moffett
© ppt by OH22
Species identification yesterday
Phenotypic
plasticity
Cryptic
species
© ppt by OH22
Molecular Biology
© ppt by OH22
Barcoding
Sample collection
Sequencing
PCR
DNA extraction
Photo documentation
Metadata
© ppt by OH22
Barcoding
Time consuming collection of species
Everything found?
Unculturable bacteria
©Tasha Sturm Cabrillo College
© ppt by OH22
Next Generation Sequencing
Illumina Miseq
 Bridge Amplification
 Up to 15M reads per run
 15GB
 Paired-end reads
(300bp)
© ppt by OH22
Illumina MiSeq - NGS
page: 17http://www.biorigami.com
© ppt by OH22
Chimeras
- Incomplete extension leading to a foreign DNA strand,
which is copied to completion in the upcoming PCR cycles
page: 18
Primer
Primer
Primer
serves as a
primer
© ppt by OH22
Environmental Metabarcoding
Every organism is
leaving its DNA
unconsciously behind
blogs.scientificamerican.com
DNA Extraction
Analyzed via
bioinformatic tools
© Tetra GmbH
© ppt by OH22
Rarefaction Curve
How do we know if we have taken enough samples?
0
1
2
3
4
5
6
7
8
0 2 4 6 8 10
NumberofOTUs
x10000
Samples
Jackknife mean Bootstrap mean Coleman mean Chao1 mean
© ppt by OH22
Typical Analysis workflow
 Depending on NGS approach used
 Illumina: merge paired reads
 Demultiplexing
 Quality filtering
 Clustering (OTU detection)
 Chimera removal
 Comparisson to database / Alignment
 Identification of species
© ppt by OH22
Pipelines: QIIME
Interface
- Comprises several
programs and algorithms
- Written in Python
Pros
Cons
1) Installation
2) Error messages are
leaving room for
interpretations
3) Analysis steps are
often time consuming
4) No information if
program is still running
or how long it will take
5) Each step is generating
a temporary file
1) Broad range of implemented
programs + algorithms
2) Several scripts can be run
on multiple processors
3) Biom files and R-Packages
© ppt by OH22
Pipelines: QIIME
 Primarily based on python
 Can be used on Mac via MacQIIME
 For VirtualBox as well as for Amazon EC2
exist preconfigured images
 Original statement from QIIME website:
QIIME has a lot of dependencies and can
(but doesn’t have to) be very challenging
to install.
© ppt by OH22
Pipelines: Mothur
Pros
Cons
1) Expects the user to perform
the whole analysis
2) Each step is generating a
temporary file
1) Broad range of
implemented algorithms
and programs
2) Comparatively short
runtime
3) Several statistical and
graphical tools are
available
4) Very active online
community
Program
- Several programs and
algorithms are
implemented
- Written in C++
© ppt by OH22
Pipelines: Mothur
 Available for Linux, Mac and Windows
 Based on C++
 Installation is just decompressing the
provided ZIP file
 Active development / release cycles
 Source Code hosted on GitHub
© ppt by OH22
Pipeline: MG-Rast
Pros
Cons
1) Usage via a graphical user
interface
2) Very user friendly
3) Krona plots
4) Biom files
Web tool
- No installation necessary
1) Limited set of possible
settings
2) Misleading
Information of runtime
3) Assigns priorities 
Data privacy?
4) Download problems
5) OTUs?
6)Results?
© ppt by OH22
Pipeline: MG-Rast
 RESTful API
 Uses JSON as its data format
 Hosted on AWS
 MG-Rast is OpenSource and hosted on
GitHub
 Local Installation is possible but not
supported
(We have not been funded to create a
readily installable version)
© ppt by OH22
Pipelines - Cons
 Time consuming!
 Waiting for free cores
 Complex environment which is mostly
configured by an administrator
 Information about runtime are lacking
 Limited data storage
 Command line
 Basic knowledge about programming
languages is required
 Questionable results
© ppt by OH22
Further analysis
 All pipelines provide tools for the analysis
of the dataset
(alpha-diversity, beta-diversity etc.)
 Biologist love extravagant/individualized
plots
 R  Very powerful (ggplot2, vegan,
BiodiversityR)
 EstimateS, STAMP, etc.
© ppt by OH22
Why Big Data – Metasystematics databases
 Some online tools are available
500MB 2000 Sequences
© ppt by OH22
Current Data Storage Requirements
page: 31
0.5 PB
Twitter’s storage needs
today are estimated at
0.5 petabytes per year
1 EB
YouTube currently
requires from 100
Petabytes to 1 Exabyte
for storage
© ppt by OH22
Current Data Storage Requirements
page: 32
100 PB
The National
Astronomical
Observatory of Japan
devotes ~100 petabytes
to storage
100 PB
For genomics more
than 100 petabytes of
storage are currently
used by only 20 of the
largest institutions
1 EB
Square Kilometre Array
(SKA) project is
expected to lead to a
storage demand of 1
exabyte per year
© ppt by OH22
Future Data Storage Requirements
page: 33
2 EB
YouTube require
between 1 and 2
Exabyte's additional
storage per year by
2025
40 EB
40 Exabyte's of storage
capacity will be needed
by 2025 just for the
human genomes.
© ppt by OH22
Why Big Data?
 40% of all sequences derived from a
marine sediment sample could not be
assigned to Family level
 We have almost no idea about bacterial
diversity (uncultured species)
 Future goal: The genome of all species
should be stored in online available
databases
(generation gap)
© ppt by OH22
Big Data - 3 Vs of Big Data
page: 35
MB
GB
TB
PB
batch
periodic
real time
table
database
unstructured
web
EB
© ppt by OH22
Microsoft Azure “Pipeline”
 Use of new techniques that are optimized
for large data sets
 Independent Administration
 Virtually no limitation of space and / or
Performance
 Optimize Workflows
 Improved presentation of results
 Simple interactive use of the results
 Using established bioinformatics
algorithms and libraries
page: 36
© ppt by OH22
Azure Data Lake
page: 37
Batch, real-time, and interactive analytics made easy
A Platform for data scientists to store
and process data of any size and shape.
© ppt by OH22
Azure Data Lake Store
 HDFS - Apache
Hadoop Distributed
File System
 Integrated in Azure
Data Lake Analytics
and HDInsight
 Future integrations in
 Revolution R
 Hortonworks,
Cloudera, MapR
 Spark, Storm, Flume,
Sqoop, Kafka
 …
page: 38
© ppt by OH22
Azure Data Lake Store
 No fixed limits on file size
 Unstructured and structured
data in their native format
 Massive throughput to increase
analytic performance
 High durability, availability, and
reliability
 Azure Active Directory access
control
page: 39
Brontobytes
© ppt by OH22
Azure Data Lake Analytics
page: 40
 Using U-SQL for analysis
 Process unstructured and structured data
 Extend U-SQL with C#/.NET
 C# expressions
 User-defined operators (UDO)
 User-defined function (UDF)
 User-defined aggregates (UDAGG)
 Declarative nature of SQL and
custom imperative power of C#
© ppt by OH22
.NET BIO
 Bioinformatics library for .NET from
Microsoft Research
 includes parsers for common
bioinformatics file formats
 algorithms for manipulating DNA, RNA,
and protein sequences
 connectors to biological web services
 Open Source Project from Outercurve
 Hosted on GitHub
 Apache 2 License
© ppt by OH22
Power BI
page: 42
© ppt by OH22
Power BI
 Comprehensive toolset for reporting
 Power BI Service
 Power BI Desktop
 Power BI Mobile Apps
 Power BI Developer
 Allows the generation of different charts
 Additional charts can be generated via R
or D3.js
page: 43
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
 Import Sequences generated by
Illumina Miseq
 FASTQ is a text based format
 One record goes over 4 lines
 Record includes sequences as well as
quality information's

 We also import the SILVA
database to avoid using web
services
page: 44
@MSQ-M01442:174:000000000-AEW84:1:1101:10036:8807 1:N:0:CTCTCTACTAGATCGC
TAGCGTATATTAAAGTTGTTGAGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTCA ...
+
EFGGGFFEGGGGGDGGGGGGGGGGGGGGGFEFGGFCEGGEGFFGGGGG<FGGGDGGGGGGGGGGGGGGGGCCC
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
 Custom Extractor for U-SQL
 C# + dotnetbio Library
 Preparations:
 Libraries require strong names for
Azure Data Lake
 Some dotenetbio Code has to be
adjusted due to some limitations of
the original library
(memory limitations)
page: 45
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
 Filter data which doesn’t reach our
quality criteria
 Remove Primer
 Remove Sequences shorter than 100
nucleotides
 Remove non reverse complemented
sequences in dataset
 ACGTACCGAT
 TGCATGGCTA
 Ilumnia Quality Score
 Overall < 19%
 Per Nucleotid < 15%
page: 46
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
page: 47
public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
{
if (input.Length > 0)
{
var fqParser = new FastQParser
{
FormatType = FastQFormatType.Illumina_v1_8,
Alphabet = Alphabets.RNA
};
IList<QualitativeSequence> seqsOriginal = new FastQParser()
.Parse(input.BaseStream)
.Cast<QualitativeSequence>()
.ToList();
foreach (var item in seqsOriginal)
{
SequenceToRow(item, output);
yield return output.AsReadOnly();
}
}
else
{
throw new Exception("input file not found");
}
}
DEMO
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
page: 48
@q = SELECT SequenceId,
Sequence,
Counter,
ReversedSequence,
ComplementedSequence,
ReverseComplementedSequence,
IndexOfNonGap,
LastIndexOfNonGap,
AlphabetName,
HasAmbiguity,
HasGaps,
HasTerminations,
IsComplementSupported,
LowNucleotides,
QualityScore
FROM @e
WHERE LengthCounter > 100
AND IsComplementSupported == true
AND LowNucleotides <= 5
AND QualityScore > 19
AND HasAmbiguity == true;
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
page: 49
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
 Cluster sequences into OTU
(Operational Taxonomic Unit)
 Sequences with a similarity score
higher than 97% belong to the same
OTU
 MUMmer or NUCmer (NUCleotide
MUMmer) Algortihm is used by
dotnetbio
 Current disadvantage of our pipeline:
many common algorithms like
SortMeRNA, Mothur, TRIE, UCLUST,
USEARCH, BLAST, USEARCH61,
SUMACLUST, SWARM, CD-Hit can
not be used
page: 50
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
 Chimeras are a product of the
unwanted combination of two or
more sequences
 dotnetbio offers a comprehensive
range of functions for the Chimera
detection
page: 51
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
 Alignment or mapping is the
process of searching data in
reference databases like BLAST,
SILVA or GreenGenes
 SILVA is currently the most
comprehensive RNA database
 4 million 16S/18S
 400.000 23S/28S
page: 52
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
 From the perspective of a
(relational) database developer,
the system is very crude
 Data is stored in text files
 Sequences are stored in ARB files
 Taxonomy data is stored in CSV files
 Keys between both tables look like
AB002522.1.1416
 To search within the database the
Needleman-Wunsch algorithm is
used
page: 53
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
 Needleman-Wunsch algorithm is
used to compare biological
sequences
 the similarity between two
sequences is calculated
 ACTCCTTAA
 A-TCC--AA
page: 54
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
page: 55
public static int AlignmentScore(string sequenceA, string sequenceB,
int gapPenalty, int matchScore, int mismatchScore)
{
var align = Align(sequenceA, sequenceB, gapPenalty,
matchScore, mismatchScore)[1];
var c = align.Count(x => x == '-');
return Convert.ToInt32(100 / align.Length * c);
}
DEMO
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
 Taxonomy Assignment is usually
one step in the described workflow
 Size doesn‘t matter – simply join
the data
 With U-SQL we already joined the
data together in the alignment
process
page: 56
© ppt by OH22
Microsoft Azure “Pipeline"
Import
Quality
Filtering
OTU
Clustering
Chimera
Filtering
Alignement
Taxonomy
Assignment
Export
 Data is exported to an Azure SQL
Database
 Disadvantage:
 You need to create credentials via
PowerShell to use an external
database
 You will need to allow an IP range in
the Azure SQL Database firewall
(25.66.0.0 to 25.66.255.255).
 After export the data it can easily
consumed by external
applications like Power BI
page: 57
© ppt by OH22
Microsoft Azure “Pipeline"
 Visualize data with Microsoft Power BI
 Connect to our Azure SQL Database
 Make shure you have allowed windows
azure services for your server
 It’s also possible to access data directly
from ADLS
 Create Reports and Dashboards
 Ask questions about your data
page: 58
© ppt by OH22
Microsoft Azure “Pipeline"
page: 59
DEMO
© ppt by OH22
Microsoft Azure “Pipeline"
 Strange Error
Messages (mixed
english and local
language)
 Problems on installing
external packages
 no reasonable Editor
for R Scripts
page: 60
© ppt by OH22
Microsoft Azure “Pipeline"
 Future Insights:
 With Microsoft Power BI and SQL Server
2016 we can now use R
 This provides more opportunities for the
processing of data with existing R scripts
and libraries
 Also this offers the possibility to visualize
results with Power BI and SSRS
page: 61
© ppt by OH22
Pipelines: Microsoft Azure “Pipeline”
Pros
Cons
1) Limitations by the dotnetbio
Library
2) "Commandline" - Yes, you still
need development skills
3) Storage and compute power is
not for free
4) Almost none existing community
5) Microsoft is hardly recognized by
universities
1) Allows independent analysis
even of large amounts of
data
2) No limitations by memory or
compute power
3) Easy extensibility through U-
SQL and C #
4) Can be automated
5) Powerful and interactive
visualizations through Power
BI
Program
- Azure Data Lake
- Azure SQL Database
- Power BI
- U-SQL, C#, dotnetbio
© ppt by OH22
THANK YOU TO OUR SPONSORS

Weitere ähnliche Inhalte

Was ist angesagt?

EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbench
European Data Forum
 
TermProject_cp33252_alw278_aa44757
TermProject_cp33252_alw278_aa44757TermProject_cp33252_alw278_aa44757
TermProject_cp33252_alw278_aa44757
Abe Arredondo
 
OpenPackProcessingAccelearation
OpenPackProcessingAccelearationOpenPackProcessingAccelearation
OpenPackProcessingAccelearation
Craig Nuzzo
 

Was ist angesagt? (8)

EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbench
 
Advances at the Argonne Leadership Computing Center
Advances at the Argonne Leadership Computing CenterAdvances at the Argonne Leadership Computing Center
Advances at the Argonne Leadership Computing Center
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
CloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use CaseCloudLightning and the OPM-based Use Case
CloudLightning and the OPM-based Use Case
 
TermProject_cp33252_alw278_aa44757
TermProject_cp33252_alw278_aa44757TermProject_cp33252_alw278_aa44757
TermProject_cp33252_alw278_aa44757
 
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...Progress Towards Leveraging Natural Language Processing for Collecting Experi...
Progress Towards Leveraging Natural Language Processing for Collecting Experi...
 
OpenPOWER Academia and Research team's webinar - Presentations from Oak Ridg...
OpenPOWER Academia and Research team's webinar  - Presentations from Oak Ridg...OpenPOWER Academia and Research team's webinar  - Presentations from Oak Ridg...
OpenPOWER Academia and Research team's webinar - Presentations from Oak Ridg...
 
OpenPackProcessingAccelearation
OpenPackProcessingAccelearationOpenPackProcessingAccelearation
OpenPackProcessingAccelearation
 

Andere mochten auch

Spain portugal 15 days
Spain portugal 15 daysSpain portugal 15 days
Spain portugal 15 days
ENTOURS
 
Portuguese food cc
Portuguese food ccPortuguese food cc
Portuguese food cc
carlagaspar
 
португалия
португалияпортугалия
португалия
IroK_N1
 

Andere mochten auch (17)

Power BI - The self service BI Lifecycle in the cloud
Power BI - The self service BI Lifecycle in the cloudPower BI - The self service BI Lifecycle in the cloud
Power BI - The self service BI Lifecycle in the cloud
 
[2015 Oracle Cloud Summit] 6. BI Cloud Service -엔터프라이즈급 분석 플랫폼과 서비스를 클라우드에서 구현
[2015 Oracle Cloud Summit] 6. BI Cloud Service -엔터프라이즈급 분석 플랫폼과 서비스를 클라우드에서 구현[2015 Oracle Cloud Summit] 6. BI Cloud Service -엔터프라이즈급 분석 플랫폼과 서비스를 클라우드에서 구현
[2015 Oracle Cloud Summit] 6. BI Cloud Service -엔터프라이즈급 분석 플랫폼과 서비스를 클라우드에서 구현
 
비즈니스 인텔리전스 솔루션 사이센스 퀵스타트 프로그램
비즈니스 인텔리전스 솔루션 사이센스 퀵스타트 프로그램비즈니스 인텔리전스 솔루션 사이센스 퀵스타트 프로그램
비즈니스 인텔리전스 솔루션 사이센스 퀵스타트 프로그램
 
Wise bi 사업소개
Wise bi 사업소개Wise bi 사업소개
Wise bi 사업소개
 
Data discovery qlikview
Data discovery   qlikviewData discovery   qlikview
Data discovery qlikview
 
Spain portugal 15 days
Spain portugal 15 daysSpain portugal 15 days
Spain portugal 15 days
 
Exercise your local Options ppt Portugal Nov 16
Exercise your local Options ppt Portugal Nov 16Exercise your local Options ppt Portugal Nov 16
Exercise your local Options ppt Portugal Nov 16
 
Denmark and fjord magic
Denmark and fjord magicDenmark and fjord magic
Denmark and fjord magic
 
AWS Quicksight에서의 비즈니스 인텔리전스 - 김기완 :: 2015 리인벤트 리캡 게이밍
AWS Quicksight에서의 비즈니스 인텔리전스 - 김기완 :: 2015 리인벤트 리캡 게이밍AWS Quicksight에서의 비즈니스 인텔리전스 - 김기완 :: 2015 리인벤트 리캡 게이밍
AWS Quicksight에서의 비즈니스 인텔리전스 - 김기완 :: 2015 리인벤트 리캡 게이밍
 
QlikView ppt
QlikView pptQlikView ppt
QlikView ppt
 
Portuguese food cc
Portuguese food ccPortuguese food cc
Portuguese food cc
 
#askSAP Analytics Innovation Community Call: Self-Service BI and SAP Lumira
#askSAP Analytics Innovation Community Call: Self-Service BI and SAP Lumira#askSAP Analytics Innovation Community Call: Self-Service BI and SAP Lumira
#askSAP Analytics Innovation Community Call: Self-Service BI and SAP Lumira
 
데이터 시각화의 글로벌 동향 20140819 - 고영혁
데이터 시각화의 글로벌 동향   20140819 - 고영혁데이터 시각화의 글로벌 동향   20140819 - 고영혁
데이터 시각화의 글로벌 동향 20140819 - 고영혁
 
AWS 신규 데이터 분석 서비스 - QuickSight, Kinesis Firehose 등 (양승도) :: re:Invent re:Cap ...
AWS 신규 데이터 분석 서비스 - QuickSight, Kinesis Firehose 등 (양승도) :: re:Invent re:Cap ...AWS 신규 데이터 분석 서비스 - QuickSight, Kinesis Firehose 등 (양승도) :: re:Invent re:Cap ...
AWS 신규 데이터 분석 서비스 - QuickSight, Kinesis Firehose 등 (양승도) :: re:Invent re:Cap ...
 
태블로 소프트웨어(Tableau Software) 소개
태블로 소프트웨어(Tableau Software) 소개태블로 소프트웨어(Tableau Software) 소개
태블로 소프트웨어(Tableau Software) 소개
 
португалия
португалияпортугалия
португалия
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 

Ähnlich wie Bioinformatics on Azure

Ähnlich wie Bioinformatics on Azure (20)

From construction to deployment of LifeWatchGreece the potentail role of EGI-...
From construction to deployment of LifeWatchGreece the potentail role of EGI-...From construction to deployment of LifeWatchGreece the potentail role of EGI-...
From construction to deployment of LifeWatchGreece the potentail role of EGI-...
 
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptxOpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
OpenACC and Open Hackathons Monthly Highlights: July 2022.pptx
 
第1回バイオインフォマティクスデータ可視化セミナー@Riken
第1回バイオインフォマティクスデータ可視化セミナー@Riken第1回バイオインフォマティクスデータ可視化セミナー@Riken
第1回バイオインフォマティクスデータ可視化セミナー@Riken
 
iMicrobe_ASLO_2015
iMicrobe_ASLO_2015iMicrobe_ASLO_2015
iMicrobe_ASLO_2015
 
Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and Ceremony
 
OpenACC and Hackathons Monthly Highlights: April 2023
OpenACC and Hackathons Monthly Highlights: April  2023OpenACC and Hackathons Monthly Highlights: April  2023
OpenACC and Hackathons Monthly Highlights: April 2023
 
Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?Scientific Workflows: what do we have, what do we miss?
Scientific Workflows: what do we have, what do we miss?
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Ogf27 Ligo
Ogf27 LigoOgf27 Ligo
Ogf27 Ligo
 
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
 
Grid computing
Grid computingGrid computing
Grid computing
 
IDB-Cloud Providing Bioinformatics Services on Cloud
IDB-Cloud Providing Bioinformatics Services on CloudIDB-Cloud Providing Bioinformatics Services on Cloud
IDB-Cloud Providing Bioinformatics Services on Cloud
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
 
Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012Easygenomics ISCB Cloud section 2012
Easygenomics ISCB Cloud section 2012
 
Cytoscape: Now and Future
Cytoscape: Now and FutureCytoscape: Now and Future
Cytoscape: Now and Future
 
Jprofessionals co create the future of your city
Jprofessionals co create the future of your cityJprofessionals co create the future of your city
Jprofessionals co create the future of your city
 
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Wo...
 
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overview
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overviewIntroduction to Data Models & Cisco's NextGen Device Level APIs: an overview
Introduction to Data Models & Cisco's NextGen Device Level APIs: an overview
 
Panel: NRP Science Impacts​
Panel: NRP Science Impacts​Panel: NRP Science Impacts​
Panel: NRP Science Impacts​
 
WSO2 Big Data Platform and Applications
WSO2 Big Data Platform and ApplicationsWSO2 Big Data Platform and Applications
WSO2 Big Data Platform and Applications
 

Mehr von Tillmann Eitelberg

SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
 

Mehr von Tillmann Eitelberg (7)

Data lake analytics for the admin
Data lake analytics for the adminData lake analytics for the admin
Data lake analytics for the admin
 
Embrace and extend first-class activity and 3rd party ecosystem for ssis in adf
Embrace and extend first-class activity and 3rd party ecosystem for ssis in adfEmbrace and extend first-class activity and 3rd party ecosystem for ssis in adf
Embrace and extend first-class activity and 3rd party ecosystem for ssis in adf
 
Industry 4.0 in a box
Industry 4.0 in a boxIndustry 4.0 in a box
Industry 4.0 in a box
 
Webanalytics with Microsoft BI
Webanalytics with Microsoft BIWebanalytics with Microsoft BI
Webanalytics with Microsoft BI
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Advanced DQS Integration
Advanced DQS IntegrationAdvanced DQS Integration
Advanced DQS Integration
 
SQLSaturday #188 - Enterprise Information Management
SQLSaturday #188  - Enterprise Information ManagementSQLSaturday #188  - Enterprise Information Management
SQLSaturday #188 - Enterprise Information Management
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Bioinformatics on Azure

  • 1. © ppt by OH22 TUGA IT 2016 LISBON, PORTUGAL Bioinformatics on Azure Tillmann Eitelberg, Ameli Kirse
  • 2. © ppt by OH22 THANK YOU TO OUR SPONSORS
  • 3. © ppt by OH22 THANK YOU TO OUR TEAM ANDRÉ BATISTA ANDRÉ MELANCIA ANDRÉ VALA ANTÓNIO LOURENÇO BRUNO LOPES CLÁUDIO SILVA NIKO NEUGEBAUER RUI REISRICARDO CABRAL NUNO CANCELO PAULO MATOS PEDRO SIMÕES SANDRA MORGADO SANDRO PEREIRARUI BASTOS NUNO ÁRIAS SILVA
  • 4. © ppt by OH22 Bioinformatics on Azure Insights to Biodiversity with the Microsoft Data Platform
  • 5. © ppt by OH22 About Us Tillmann Eitelberg  CEO oh22information services GmbH  PASS Regional Mentor Germany  Vice-president PASS Germany  Chapter Leader Cologne/Bonn, Germany  Microsoft Data Platform MVP  www.ssis-components.net  www.sqlpodcast.de Ameli Kirse  Master of Science in Organismic, Evolutionary and Palaeobiology  Working with Next Generation Sequencing Data  Focused on molecular biodiversity research  Currently PhD position on HTP sequencing and analysis
  • 6. © ppt by OH22 Biodiversity is important Ecosystem engineers are actively creating, maintaining and modifying habitats spillmybeans.com
  • 7. © ppt by OH22 Biodiversity is important Changes in food webs can alter community structures with unforeseeably consequences weedscience.weebly.com
  • 8. © ppt by OH22 Biodiversity is important Humans benefit from a high biodiversity  conservation sites Ecosystem services www.hamburg.deutschland-summt.de www.lavendel-labyrinth.de
  • 9. © ppt by OH22 Biodiversity is important Osmoderma eremita (hermit beetle) Stuttgart 21 (Underground train station)
  • 10. © ppt by OH22 Examples for active biodiversity protection World Youth Day 2005 Epidalea calamita vs. Pope
  • 11. © ppt by OH22 Species identification yesterday 1. Collection of species (traps, canopy fogging, etc.) 2. Identification on the basis of expertise knowledge and Identification keys ©Mark Moffett
  • 12. © ppt by OH22 Species identification yesterday Phenotypic plasticity Cryptic species
  • 13. © ppt by OH22 Molecular Biology
  • 14. © ppt by OH22 Barcoding Sample collection Sequencing PCR DNA extraction Photo documentation Metadata
  • 15. © ppt by OH22 Barcoding Time consuming collection of species Everything found? Unculturable bacteria ©Tasha Sturm Cabrillo College
  • 16. © ppt by OH22 Next Generation Sequencing Illumina Miseq  Bridge Amplification  Up to 15M reads per run  15GB  Paired-end reads (300bp)
  • 17. © ppt by OH22 Illumina MiSeq - NGS page: 17http://www.biorigami.com
  • 18. © ppt by OH22 Chimeras - Incomplete extension leading to a foreign DNA strand, which is copied to completion in the upcoming PCR cycles page: 18 Primer Primer Primer serves as a primer
  • 19. © ppt by OH22 Environmental Metabarcoding Every organism is leaving its DNA unconsciously behind blogs.scientificamerican.com DNA Extraction Analyzed via bioinformatic tools © Tetra GmbH
  • 20. © ppt by OH22 Rarefaction Curve How do we know if we have taken enough samples? 0 1 2 3 4 5 6 7 8 0 2 4 6 8 10 NumberofOTUs x10000 Samples Jackknife mean Bootstrap mean Coleman mean Chao1 mean
  • 21. © ppt by OH22 Typical Analysis workflow  Depending on NGS approach used  Illumina: merge paired reads  Demultiplexing  Quality filtering  Clustering (OTU detection)  Chimera removal  Comparisson to database / Alignment  Identification of species
  • 22. © ppt by OH22 Pipelines: QIIME Interface - Comprises several programs and algorithms - Written in Python Pros Cons 1) Installation 2) Error messages are leaving room for interpretations 3) Analysis steps are often time consuming 4) No information if program is still running or how long it will take 5) Each step is generating a temporary file 1) Broad range of implemented programs + algorithms 2) Several scripts can be run on multiple processors 3) Biom files and R-Packages
  • 23. © ppt by OH22 Pipelines: QIIME  Primarily based on python  Can be used on Mac via MacQIIME  For VirtualBox as well as for Amazon EC2 exist preconfigured images  Original statement from QIIME website: QIIME has a lot of dependencies and can (but doesn’t have to) be very challenging to install.
  • 24. © ppt by OH22 Pipelines: Mothur Pros Cons 1) Expects the user to perform the whole analysis 2) Each step is generating a temporary file 1) Broad range of implemented algorithms and programs 2) Comparatively short runtime 3) Several statistical and graphical tools are available 4) Very active online community Program - Several programs and algorithms are implemented - Written in C++
  • 25. © ppt by OH22 Pipelines: Mothur  Available for Linux, Mac and Windows  Based on C++  Installation is just decompressing the provided ZIP file  Active development / release cycles  Source Code hosted on GitHub
  • 26. © ppt by OH22 Pipeline: MG-Rast Pros Cons 1) Usage via a graphical user interface 2) Very user friendly 3) Krona plots 4) Biom files Web tool - No installation necessary 1) Limited set of possible settings 2) Misleading Information of runtime 3) Assigns priorities  Data privacy? 4) Download problems 5) OTUs? 6)Results?
  • 27. © ppt by OH22 Pipeline: MG-Rast  RESTful API  Uses JSON as its data format  Hosted on AWS  MG-Rast is OpenSource and hosted on GitHub  Local Installation is possible but not supported (We have not been funded to create a readily installable version)
  • 28. © ppt by OH22 Pipelines - Cons  Time consuming!  Waiting for free cores  Complex environment which is mostly configured by an administrator  Information about runtime are lacking  Limited data storage  Command line  Basic knowledge about programming languages is required  Questionable results
  • 29. © ppt by OH22 Further analysis  All pipelines provide tools for the analysis of the dataset (alpha-diversity, beta-diversity etc.)  Biologist love extravagant/individualized plots  R  Very powerful (ggplot2, vegan, BiodiversityR)  EstimateS, STAMP, etc.
  • 30. © ppt by OH22 Why Big Data – Metasystematics databases  Some online tools are available 500MB 2000 Sequences
  • 31. © ppt by OH22 Current Data Storage Requirements page: 31 0.5 PB Twitter’s storage needs today are estimated at 0.5 petabytes per year 1 EB YouTube currently requires from 100 Petabytes to 1 Exabyte for storage
  • 32. © ppt by OH22 Current Data Storage Requirements page: 32 100 PB The National Astronomical Observatory of Japan devotes ~100 petabytes to storage 100 PB For genomics more than 100 petabytes of storage are currently used by only 20 of the largest institutions 1 EB Square Kilometre Array (SKA) project is expected to lead to a storage demand of 1 exabyte per year
  • 33. © ppt by OH22 Future Data Storage Requirements page: 33 2 EB YouTube require between 1 and 2 Exabyte's additional storage per year by 2025 40 EB 40 Exabyte's of storage capacity will be needed by 2025 just for the human genomes.
  • 34. © ppt by OH22 Why Big Data?  40% of all sequences derived from a marine sediment sample could not be assigned to Family level  We have almost no idea about bacterial diversity (uncultured species)  Future goal: The genome of all species should be stored in online available databases (generation gap)
  • 35. © ppt by OH22 Big Data - 3 Vs of Big Data page: 35 MB GB TB PB batch periodic real time table database unstructured web EB
  • 36. © ppt by OH22 Microsoft Azure “Pipeline”  Use of new techniques that are optimized for large data sets  Independent Administration  Virtually no limitation of space and / or Performance  Optimize Workflows  Improved presentation of results  Simple interactive use of the results  Using established bioinformatics algorithms and libraries page: 36
  • 37. © ppt by OH22 Azure Data Lake page: 37 Batch, real-time, and interactive analytics made easy A Platform for data scientists to store and process data of any size and shape.
  • 38. © ppt by OH22 Azure Data Lake Store  HDFS - Apache Hadoop Distributed File System  Integrated in Azure Data Lake Analytics and HDInsight  Future integrations in  Revolution R  Hortonworks, Cloudera, MapR  Spark, Storm, Flume, Sqoop, Kafka  … page: 38
  • 39. © ppt by OH22 Azure Data Lake Store  No fixed limits on file size  Unstructured and structured data in their native format  Massive throughput to increase analytic performance  High durability, availability, and reliability  Azure Active Directory access control page: 39 Brontobytes
  • 40. © ppt by OH22 Azure Data Lake Analytics page: 40  Using U-SQL for analysis  Process unstructured and structured data  Extend U-SQL with C#/.NET  C# expressions  User-defined operators (UDO)  User-defined function (UDF)  User-defined aggregates (UDAGG)  Declarative nature of SQL and custom imperative power of C#
  • 41. © ppt by OH22 .NET BIO  Bioinformatics library for .NET from Microsoft Research  includes parsers for common bioinformatics file formats  algorithms for manipulating DNA, RNA, and protein sequences  connectors to biological web services  Open Source Project from Outercurve  Hosted on GitHub  Apache 2 License
  • 42. © ppt by OH22 Power BI page: 42
  • 43. © ppt by OH22 Power BI  Comprehensive toolset for reporting  Power BI Service  Power BI Desktop  Power BI Mobile Apps  Power BI Developer  Allows the generation of different charts  Additional charts can be generated via R or D3.js page: 43
  • 44. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export  Import Sequences generated by Illumina Miseq  FASTQ is a text based format  One record goes over 4 lines  Record includes sequences as well as quality information's   We also import the SILVA database to avoid using web services page: 44 @MSQ-M01442:174:000000000-AEW84:1:1101:10036:8807 1:N:0:CTCTCTACTAGATCGC TAGCGTATATTAAAGTTGTTGAGGTTAAAAAGCTCGTAGTTGAACCTTGGGCCTGGCTGGCCGGTCCGCCTCA ... + EFGGGFFEGGGGGDGGGGGGGGGGGGGGGFEFGGFCEGGEGFFGGGGG<FGGGDGGGGGGGGGGGGGGGGCCC
  • 45. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export  Custom Extractor for U-SQL  C# + dotnetbio Library  Preparations:  Libraries require strong names for Azure Data Lake  Some dotenetbio Code has to be adjusted due to some limitations of the original library (memory limitations) page: 45
  • 46. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export  Filter data which doesn’t reach our quality criteria  Remove Primer  Remove Sequences shorter than 100 nucleotides  Remove non reverse complemented sequences in dataset  ACGTACCGAT  TGCATGGCTA  Ilumnia Quality Score  Overall < 19%  Per Nucleotid < 15% page: 46
  • 47. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export page: 47 public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output) { if (input.Length > 0) { var fqParser = new FastQParser { FormatType = FastQFormatType.Illumina_v1_8, Alphabet = Alphabets.RNA }; IList<QualitativeSequence> seqsOriginal = new FastQParser() .Parse(input.BaseStream) .Cast<QualitativeSequence>() .ToList(); foreach (var item in seqsOriginal) { SequenceToRow(item, output); yield return output.AsReadOnly(); } } else { throw new Exception("input file not found"); } } DEMO
  • 48. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export page: 48 @q = SELECT SequenceId, Sequence, Counter, ReversedSequence, ComplementedSequence, ReverseComplementedSequence, IndexOfNonGap, LastIndexOfNonGap, AlphabetName, HasAmbiguity, HasGaps, HasTerminations, IsComplementSupported, LowNucleotides, QualityScore FROM @e WHERE LengthCounter > 100 AND IsComplementSupported == true AND LowNucleotides <= 5 AND QualityScore > 19 AND HasAmbiguity == true;
  • 49. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export page: 49
  • 50. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export  Cluster sequences into OTU (Operational Taxonomic Unit)  Sequences with a similarity score higher than 97% belong to the same OTU  MUMmer or NUCmer (NUCleotide MUMmer) Algortihm is used by dotnetbio  Current disadvantage of our pipeline: many common algorithms like SortMeRNA, Mothur, TRIE, UCLUST, USEARCH, BLAST, USEARCH61, SUMACLUST, SWARM, CD-Hit can not be used page: 50
  • 51. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export  Chimeras are a product of the unwanted combination of two or more sequences  dotnetbio offers a comprehensive range of functions for the Chimera detection page: 51
  • 52. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export  Alignment or mapping is the process of searching data in reference databases like BLAST, SILVA or GreenGenes  SILVA is currently the most comprehensive RNA database  4 million 16S/18S  400.000 23S/28S page: 52
  • 53. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export  From the perspective of a (relational) database developer, the system is very crude  Data is stored in text files  Sequences are stored in ARB files  Taxonomy data is stored in CSV files  Keys between both tables look like AB002522.1.1416  To search within the database the Needleman-Wunsch algorithm is used page: 53
  • 54. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export  Needleman-Wunsch algorithm is used to compare biological sequences  the similarity between two sequences is calculated  ACTCCTTAA  A-TCC--AA page: 54
  • 55. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export page: 55 public static int AlignmentScore(string sequenceA, string sequenceB, int gapPenalty, int matchScore, int mismatchScore) { var align = Align(sequenceA, sequenceB, gapPenalty, matchScore, mismatchScore)[1]; var c = align.Count(x => x == '-'); return Convert.ToInt32(100 / align.Length * c); } DEMO
  • 56. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export  Taxonomy Assignment is usually one step in the described workflow  Size doesn‘t matter – simply join the data  With U-SQL we already joined the data together in the alignment process page: 56
  • 57. © ppt by OH22 Microsoft Azure “Pipeline" Import Quality Filtering OTU Clustering Chimera Filtering Alignement Taxonomy Assignment Export  Data is exported to an Azure SQL Database  Disadvantage:  You need to create credentials via PowerShell to use an external database  You will need to allow an IP range in the Azure SQL Database firewall (25.66.0.0 to 25.66.255.255).  After export the data it can easily consumed by external applications like Power BI page: 57
  • 58. © ppt by OH22 Microsoft Azure “Pipeline"  Visualize data with Microsoft Power BI  Connect to our Azure SQL Database  Make shure you have allowed windows azure services for your server  It’s also possible to access data directly from ADLS  Create Reports and Dashboards  Ask questions about your data page: 58
  • 59. © ppt by OH22 Microsoft Azure “Pipeline" page: 59 DEMO
  • 60. © ppt by OH22 Microsoft Azure “Pipeline"  Strange Error Messages (mixed english and local language)  Problems on installing external packages  no reasonable Editor for R Scripts page: 60
  • 61. © ppt by OH22 Microsoft Azure “Pipeline"  Future Insights:  With Microsoft Power BI and SQL Server 2016 we can now use R  This provides more opportunities for the processing of data with existing R scripts and libraries  Also this offers the possibility to visualize results with Power BI and SSRS page: 61
  • 62. © ppt by OH22 Pipelines: Microsoft Azure “Pipeline” Pros Cons 1) Limitations by the dotnetbio Library 2) "Commandline" - Yes, you still need development skills 3) Storage and compute power is not for free 4) Almost none existing community 5) Microsoft is hardly recognized by universities 1) Allows independent analysis even of large amounts of data 2) No limitations by memory or compute power 3) Easy extensibility through U- SQL and C # 4) Can be automated 5) Powerful and interactive visualizations through Power BI Program - Azure Data Lake - Azure SQL Database - Power BI - U-SQL, C#, dotnetbio
  • 63. © ppt by OH22 THANK YOU TO OUR SPONSORS