SlideShare a Scribd company logo
1 of 29
Download to read offline
Genome_annotation@BioDec:
Python all over the place.
Ivan Rossi
ivan@biodec.com
@rouge2507
Hello
● BioDec does bioinformatics since 2002
● Bioinformatics software development
● Bioinformation management system, BioDecoders
● Bioinformatics Consulting
● Development, engineering and integration of custom solutions
● Annotated databases of biosequences (e.g. genomes)
● Our Forte
● Protein-sequence analysis
● Trans-membrane proteins
● Machine-learning
● Python is everywhere
The Challenge:
from Sequence to Function
>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.
MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG
DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE
SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH
WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE
YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI
KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR
GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS
LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY
YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT
KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
Protein Function
Gene Sequence
Protein Sequence (~10^7)
Protein Structure (10^5)
Problems in Sequence Analysis
Information Overflow:
very large sets of data available
High Throughput:
New data must be processed at high speed
(volume of data, time constraints)
Open Problems:
difficult to provide a simple first-principle or a
model-based solution
Alignments
OmpA APKDNTWYTGAKLGWS QYHDTGLINNNGPTHEN KLGAGAFGGYQV NPYVGFEMGYDWLGR
OEP21 IDTNTFFQVRGGLD TKT---------------GQPS SGSALIRHF YPNFSATLGVGVRYD
OmpA MPYKGSVENGA YKAQGVQLTAKLGYP ITDDLDIYTRLGGMVWRADT YSNVYGKN HDTGVS
OEP21 KQDSVGVRYAKND KLRYTVLAKKT FPVTNDGLVNFKIK GGCDVDQD-------FKE WKSR
OmpA PVFAGGVEYA I-TPEIATRLEYQW TNNIGDAHTIGTRPDNG MLSLGVSYRF G-----
OEP21 GGAEFSWNVF NFQKDQDVRLRIGYE AFEQV-PYLQIRE NNWTFNADYKGRWNVRYD L
Alignments of some kind are the main tool for
sequence comparison and database search
OmpA: PDB 1BXW, SwissProt OMPA_ECOLI
OEP21: Transmembrane Domain (24-177)
Tools from machine learning
Prediction
Known sequences (DB subsets)
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
ANN,
HMM,
SVM
ANN,
HMM,
SVM
Known mapping
General
Rules
Known
structures
Artificial Neural Networks (ANNs)
Hidden Markov Models (HMMs)
Support Vector Machines (SVMs)
New sequence
A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0
C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0
E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0
F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0
G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0
H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0
K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100
I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0
L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0
N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0
R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0
T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0
V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0
W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0
Evolutionary
Information
1 Y K D Y H S - D K K K G E L - -
2 Y R D Y Q T - D Q K K G D L - -
3 Y R D Y Q S - D H K K G E L - -
4 Y R D Y V S - D H K K G E L - -
5 Y R D Y Q F - D Q K K G S L - -
6 Y K D Y N T - H Q K K N E S - -
7 Y R D Y Q T - D H K K A D L - -
8 G Y G F G - - L I K N T E T T K
9 T K G Y G F G L I K N T E T T K
10 T K G Y G F G L I K N T E T T K
Sequence position
MSA
Seq. Profile
Sequence profile
Given a Multiple
Sequence Alignment
(MSA) of similar
sequences,
associate to each
position a 20-valued
vector containing
the relative
aminoacidic
composition of the
aligned sequences.
Why Python? (2.1.x, in 2002)
● Common ground, easy to pick up
● Expressive: productive, fast prototyping
● Mantainable: readable after months
● Useful tools and libs (e.g. BioPython)
● Retrospective:
We were f...ing RIGHT!
Hidden Markov Models
Very powerful tools when:
● The system can be modeled in probabilistic terms.
● There is a ‘grammar of the problem’
● There is a “limited sequential dependency” that can
model the problem (at least to a rough approx)
N T
0.01
0.01
0.99
0.99
99HMMers
End
Start
Signal Peptide
TM1
TM2
TM3
TM4
TM5
TM6
TM7
Insertion loop
Inside loop
Outside loop
Profile-HMM, based on:
http://www.biocomp.unibo.it/piero/PHMM
BioPython
BioPython (http://biopython.org) is a community-
developed (O|B|F) set of Python libraries and tools for
bioinformatics.
● The Parsers for formats and application (vital)
● The Sequence objects
● Bio.SeqIO, Bio.AlignIO, Bio.PDB
● Specialized External-application wrappers
● BioSQL interface
BioSQL
BioSQL (http://www.biosql.org) is a generic
relational model (a schema) covering
sequences, features, sequence and feature
annotation, a reference taxonomy, and
ontologies.
● Works with all O|B|F Bio* projects
● We extended it to suit our special need
Ruffus
Ruffus (http://www.ruffus.org.uk/) is a
Computation Pipeline library for Python, designed
to allow easy analysis automation.
● Acts like a pythonic Make on steroids
● Write your Python functions and decorate them
– @originate, @transform, @merge an more
● Pipeline handling
– Run pipelines make-style (run_pipeline)
– Schedule pipelines on SGE compute clusters (run_job)
Angler pipeline
Proteome
Generate
profiles
Predictions:
Signal peptides
Betabarrels
Alpha-helical TMP
Fold recognition
Coiled coils
Disordered regions
Sub-cellular localization
Classify
Proteome
Atlas (a DB)
Angler annotates and classifies
Protein sequences
ZenDock
Analyzes protein solvent-
exposed surface for
putative “interactor”
residues, returning a
“fuzzy” (probabilistic)
answer.
Interactors are correlated
and grouped into patches
Results are mapped on
the protein 3D structure
and made available
through a web interface
Contact-shell profile
Int non-Int
If you can't outrun them...
The Problem
● Full Profile building is the slow step
– It takes 30” to 5' for a 3-passes PsiBlast run
(uniref90)
– Repeat for ~10^5 … CPU weeks for genome.
● Major genomes updated every 3 months
● Micro-SME: limited resources
… try to outsmart them.
● Sequence space is redundant
– Both intra-genome and inter-genome
● Profiles are built incrementally
– PsiBlast is an iterative algorithm
● PsiBlast is deterministic
– Given the same sequence, database, and number
of iterations you get the same profile
Our accelerator: the PyBlastCache
1) Hash the sequence
2) version the reference protein database
3) store computed profiles in a key-value store
1) Key as a combination of seq. hash and DB version
4) Compute
● If full_key_match: skip_and_copy()
●
If seq_key_match: update_profile( seq, itn=1)
●
If no_key: create_profile(seq, itn=3)
The (Python) front-ends
● Plone: a CMS
– https://plone.org
● Web2py: a MVC framework
– http://www.web2py.com
● Galaxy: web interface + workflow engine
– Focus on reproducible research
– https://wiki.galaxyproject.org/
– Saas: https://usegalaxy.org
● A BiOSQL browser, based on Plone, to search and
display data and metadata (annotations) from
biosequence databases. Could integrate predictors;
● We publicly released the base version open-source
software at http://plone4bio.org;
● Used to be the la base for some commercial software
we sold to clients.
Plone4Bio
Plone4Bio screenshots
Bologna, 21/1/2010
LIMS features
Galaxy
Galaxy is an open, web-based platform for accessible,
reproducible, and transparent computational biomedical
research.
– Users without programming experience can easily specify
parameters and run tools and workflows.
– Galaxy captures information in order to allow complete repeats
of a computational analysis.
– Users share and publish analyses via the web and create
Pages, interactive, web-based documents that describe a
complete analysis.
● Accepted as material by peer reviewed journals
Galaxy highlights
Galaxy is useful to both end user and bioinformatic devs.
● Get data directly from online DBs (USCS, Biomart,...)
● Handling of data from lab instrumentetion (e.g NGS seqs)
● Map calculated data on online viewers (e.g. genome viewer)
● Easily extensible: wrapping a foreign tools is as simple as
by writing an XML file.
● Data sharing (workflows, libraries, tools...)
● The community!
Snapshots
From https://usegalaxy.org
Visual programming
Thou Shalt Care For The DATA
● So much junk in the literature!!
– Both for features and data sets
● Use training, testing and validation sets
● The sets should always be disjoint
– Below 25% seq ID
● Redundancy is THE ENEMY
● Avoid feature bloat, use feature selection
● Always compare results with a nearest-neighbor method
– Good ones are really hard to beat
No Free Lunch
● There is no killer method
– Choose method that better models your domain
(e.g. sequences → HMMs)
– Data curation is always more important
● Be Humble, be Honest!
Meditation hint: http://www.no-free-lunch.org/
The community is your friend.
Give back to the community.

More Related Content

Similar to Genome_annotation@BioDec: Python all over the place

WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2
 
ICWE2017 BigDataEurope
ICWE2017 BigDataEuropeICWE2017 BigDataEurope
ICWE2017 BigDataEuropeBigData_Europe
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartAraport
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistryguest5929fa7
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistrybaoilleach
 
#Fstoco - Monitoring and Instrumentation, why Tracing is Key
#Fstoco  - Monitoring and Instrumentation, why Tracing is Key#Fstoco  - Monitoring and Instrumentation, why Tracing is Key
#Fstoco - Monitoring and Instrumentation, why Tracing is KeyJonah Kowall
 
Tech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating SystemTech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating Systemnvirters
 
ExSchema - ICSM'13
ExSchema - ICSM'13ExSchema - ICSM'13
ExSchema - ICSM'13jccastrejon
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Josef Hardi
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitGreg Landrum
 
IPK - Reproducible research - To infinity
IPK - Reproducible research - To infinityIPK - Reproducible research - To infinity
IPK - Reproducible research - To infinityPeterMorrell4
 
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...David Peyruc
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic webStanley Wang
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSAksw Group
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataJimmy Angelakos
 
Antao Biopython Bosc2008
Antao Biopython Bosc2008Antao Biopython Bosc2008
Antao Biopython Bosc2008bosc_2008
 
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...Timothy Spann
 

Similar to Genome_annotation@BioDec: Python all over the place (20)

WSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product OverviewWSO2 Machine Learner - Product Overview
WSO2 Machine Learner - Product Overview
 
ICWE2017 BigDataEurope
ICWE2017 BigDataEuropeICWE2017 BigDataEurope
ICWE2017 BigDataEurope
 
ICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick ProvartICAR 2015 Workshop - Nick Provart
ICAR 2015 Workshop - Nick Provart
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
 
Python for Chemistry
Python for ChemistryPython for Chemistry
Python for Chemistry
 
#Fstoco - Monitoring and Instrumentation, why Tracing is Key
#Fstoco  - Monitoring and Instrumentation, why Tracing is Key#Fstoco  - Monitoring and Instrumentation, why Tracing is Key
#Fstoco - Monitoring and Instrumentation, why Tracing is Key
 
Tech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating SystemTech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating System
 
Bio2RDF@BH2010
Bio2RDF@BH2010Bio2RDF@BH2010
Bio2RDF@BH2010
 
ExSchema - ICSM'13
ExSchema - ICSM'13ExSchema - ICSM'13
ExSchema - ICSM'13
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
IPK - Reproducible research - To infinity
IPK - Reproducible research - To infinityIPK - Reproducible research - To infinity
IPK - Reproducible research - To infinity
 
Berlin OpenStack Summit'18
Berlin OpenStack Summit'18Berlin OpenStack Summit'18
Berlin OpenStack Summit'18
 
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
tranSMART Community Meeting 5-7 Nov 13 - Session 1: Chilly-Mazarin Meeting Ob...
 
Ontologies and semantic web
Ontologies and semantic webOntologies and semantic web
Ontologies and semantic web
 
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGSEVOLUTION OF ONTOLOGY-BASED MAPPINGS
EVOLUTION OF ONTOLOGY-BASED MAPPINGS
 
Using PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic DataUsing PostgreSQL with Bibliographic Data
Using PostgreSQL with Bibliographic Data
 
Antao Biopython Bosc2008
Antao Biopython Bosc2008Antao Biopython Bosc2008
Antao Biopython Bosc2008
 
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...
 
Resume_Srivatsa
Resume_SrivatsaResume_Srivatsa
Resume_Srivatsa
 

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Genome_annotation@BioDec: Python all over the place

  • 1. Genome_annotation@BioDec: Python all over the place. Ivan Rossi ivan@biodec.com @rouge2507
  • 2. Hello ● BioDec does bioinformatics since 2002 ● Bioinformatics software development ● Bioinformation management system, BioDecoders ● Bioinformatics Consulting ● Development, engineering and integration of custom solutions ● Annotated databases of biosequences (e.g. genomes) ● Our Forte ● Protein-sequence analysis ● Trans-membrane proteins ● Machine-learning ● Python is everywhere
  • 3. The Challenge: from Sequence to Function >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Protein Function Gene Sequence Protein Sequence (~10^7) Protein Structure (10^5)
  • 4. Problems in Sequence Analysis Information Overflow: very large sets of data available High Throughput: New data must be processed at high speed (volume of data, time constraints) Open Problems: difficult to provide a simple first-principle or a model-based solution
  • 5. Alignments OmpA APKDNTWYTGAKLGWS QYHDTGLINNNGPTHEN KLGAGAFGGYQV NPYVGFEMGYDWLGR OEP21 IDTNTFFQVRGGLD TKT---------------GQPS SGSALIRHF YPNFSATLGVGVRYD OmpA MPYKGSVENGA YKAQGVQLTAKLGYP ITDDLDIYTRLGGMVWRADT YSNVYGKN HDTGVS OEP21 KQDSVGVRYAKND KLRYTVLAKKT FPVTNDGLVNFKIK GGCDVDQD-------FKE WKSR OmpA PVFAGGVEYA I-TPEIATRLEYQW TNNIGDAHTIGTRPDNG MLSLGVSYRF G----- OEP21 GGAEFSWNVF NFQKDQDVRLRIGYE AFEQV-PYLQIRE NNWTFNADYKGRWNVRYD L Alignments of some kind are the main tool for sequence comparison and database search OmpA: PDB 1BXW, SwissProt OMPA_ECOLI OEP21: Transmembrane Domain (24-177)
  • 6. Tools from machine learning Prediction Known sequences (DB subsets) TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN ANN, HMM, SVM ANN, HMM, SVM Known mapping General Rules Known structures Artificial Neural Networks (ANNs) Hidden Markov Models (HMMs) Support Vector Machines (SVMs) New sequence
  • 7. A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0 Evolutionary Information 1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K Sequence position MSA Seq. Profile Sequence profile Given a Multiple Sequence Alignment (MSA) of similar sequences, associate to each position a 20-valued vector containing the relative aminoacidic composition of the aligned sequences.
  • 8. Why Python? (2.1.x, in 2002) ● Common ground, easy to pick up ● Expressive: productive, fast prototyping ● Mantainable: readable after months ● Useful tools and libs (e.g. BioPython) ● Retrospective: We were f...ing RIGHT!
  • 9. Hidden Markov Models Very powerful tools when: ● The system can be modeled in probabilistic terms. ● There is a ‘grammar of the problem’ ● There is a “limited sequential dependency” that can model the problem (at least to a rough approx) N T 0.01 0.01 0.99 0.99
  • 10. 99HMMers End Start Signal Peptide TM1 TM2 TM3 TM4 TM5 TM6 TM7 Insertion loop Inside loop Outside loop Profile-HMM, based on: http://www.biocomp.unibo.it/piero/PHMM
  • 11. BioPython BioPython (http://biopython.org) is a community- developed (O|B|F) set of Python libraries and tools for bioinformatics. ● The Parsers for formats and application (vital) ● The Sequence objects ● Bio.SeqIO, Bio.AlignIO, Bio.PDB ● Specialized External-application wrappers ● BioSQL interface
  • 12. BioSQL BioSQL (http://www.biosql.org) is a generic relational model (a schema) covering sequences, features, sequence and feature annotation, a reference taxonomy, and ontologies. ● Works with all O|B|F Bio* projects ● We extended it to suit our special need
  • 13. Ruffus Ruffus (http://www.ruffus.org.uk/) is a Computation Pipeline library for Python, designed to allow easy analysis automation. ● Acts like a pythonic Make on steroids ● Write your Python functions and decorate them – @originate, @transform, @merge an more ● Pipeline handling – Run pipelines make-style (run_pipeline) – Schedule pipelines on SGE compute clusters (run_job)
  • 14. Angler pipeline Proteome Generate profiles Predictions: Signal peptides Betabarrels Alpha-helical TMP Fold recognition Coiled coils Disordered regions Sub-cellular localization Classify Proteome Atlas (a DB) Angler annotates and classifies Protein sequences
  • 15. ZenDock Analyzes protein solvent- exposed surface for putative “interactor” residues, returning a “fuzzy” (probabilistic) answer. Interactors are correlated and grouped into patches Results are mapped on the protein 3D structure and made available through a web interface Contact-shell profile Int non-Int
  • 16. If you can't outrun them... The Problem ● Full Profile building is the slow step – It takes 30” to 5' for a 3-passes PsiBlast run (uniref90) – Repeat for ~10^5 … CPU weeks for genome. ● Major genomes updated every 3 months ● Micro-SME: limited resources
  • 17. … try to outsmart them. ● Sequence space is redundant – Both intra-genome and inter-genome ● Profiles are built incrementally – PsiBlast is an iterative algorithm ● PsiBlast is deterministic – Given the same sequence, database, and number of iterations you get the same profile
  • 18. Our accelerator: the PyBlastCache 1) Hash the sequence 2) version the reference protein database 3) store computed profiles in a key-value store 1) Key as a combination of seq. hash and DB version 4) Compute ● If full_key_match: skip_and_copy() ● If seq_key_match: update_profile( seq, itn=1) ● If no_key: create_profile(seq, itn=3)
  • 19. The (Python) front-ends ● Plone: a CMS – https://plone.org ● Web2py: a MVC framework – http://www.web2py.com ● Galaxy: web interface + workflow engine – Focus on reproducible research – https://wiki.galaxyproject.org/ – Saas: https://usegalaxy.org
  • 20. ● A BiOSQL browser, based on Plone, to search and display data and metadata (annotations) from biosequence databases. Could integrate predictors; ● We publicly released the base version open-source software at http://plone4bio.org; ● Used to be the la base for some commercial software we sold to clients. Plone4Bio
  • 23. Galaxy Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research. – Users without programming experience can easily specify parameters and run tools and workflows. – Galaxy captures information in order to allow complete repeats of a computational analysis. – Users share and publish analyses via the web and create Pages, interactive, web-based documents that describe a complete analysis. ● Accepted as material by peer reviewed journals
  • 24. Galaxy highlights Galaxy is useful to both end user and bioinformatic devs. ● Get data directly from online DBs (USCS, Biomart,...) ● Handling of data from lab instrumentetion (e.g NGS seqs) ● Map calculated data on online viewers (e.g. genome viewer) ● Easily extensible: wrapping a foreign tools is as simple as by writing an XML file. ● Data sharing (workflows, libraries, tools...) ● The community!
  • 27. Thou Shalt Care For The DATA ● So much junk in the literature!! – Both for features and data sets ● Use training, testing and validation sets ● The sets should always be disjoint – Below 25% seq ID ● Redundancy is THE ENEMY ● Avoid feature bloat, use feature selection ● Always compare results with a nearest-neighbor method – Good ones are really hard to beat
  • 28. No Free Lunch ● There is no killer method – Choose method that better models your domain (e.g. sequences → HMMs) – Data curation is always more important ● Be Humble, be Honest! Meditation hint: http://www.no-free-lunch.org/
  • 29. The community is your friend. Give back to the community.