SlideShare ist ein Scribd-Unternehmen logo
1 von 38
1
How to be a bioinformatician
Christian Frech, PhD
St. Anna Children’s Cancer Research Institute, Vienna, Austria
Talk at University of Applied Sciences, Hagenberg, Austria
April 23rd, 2014
What is a bioinformatician?
2
Informatician Statistician
Biologist
Data
scientist
Modified from http://blog.fejes.ca/?p=2418
Bioinformatician vs. computational biologist
 Asks biological questions
 Analyzes & interprets
biological data
 Runs existing programs
 Ad hoc scripting
 Perl, R, Python
3
 IT savvy
 Builds & maintains
biological databases &
Web sites
 Designs & implements
clever algorithms
 C/C++, Java, Python
Bioinformatician Computational
biologist
Grasp of computational subjectsmore less
Grasp of biological subjectsless more
or vice versa
Why do we need bioinformaticians?
 Amount of generated biological data requires sophisticated
computing for data management and analysis
 Programmers lack biological knowledge
 Biologists don‟t program
 The two don‟t understand each other
4
http://www.youtube.com/watch?v=Hz1fyhVOjr4
Latest Illumina sequencer shipped last
week (HiSeq v4 reagent kit) outputs
1 terabase (TB) of data in 6 days1!
Biologists talks to statistician
1 http://www.illumina.com/products/hiseq-sbs-kit-v4.ilmn
What are bioinformaticians doing?
5
6
What are bioinformaticians doing?
Word cloud from manuscript titles published in Bioinformatics from Jan 2013 to April 2014
Challenges as bioinformatician
 Biology is complex, not black and white
 As many exceptions as rules (e.g.: define “gene”)
 No single optimal solution to a problem
 Results interpretable in many ways (story telling, cherry picking)
 Understanding the biological question
 Field is moving incredibly fast
 Lack of standards, immature/abandoned software
 Standard of today obsolete tomorrow
 Much time spent on collecting/cleaning-up data, troubleshooting errors
 Stay flexible, don‟t overinvest in single platform/technology
 Hundreds of software tools and databases out there
 Easy to get lost
 Important to understand their strengths and weaknesses
8
Which tools should I use?
9
179 tools
Heard of: 65%
Used: 30%
10
http://omictools.com/
Things to have in your bioinformatics
toolbox
 Linux command line
 Scripting language with
associated Bio* library (BioPerl,
BioPython, R/Bioconductor, …)
 Basic statistical tests, regression,
p-values, maximum likelihood,
multiple testing correction
 Sequence alignment
(FASTA & BLAST)
 Biological databases
 Regular expressions
 Sequencing technologies
 Web technologies (HTML, XML, …)
11
 Advanced R skills
 Parallel/distributed computing
 DBMS, SQL
 (Semi-)compiled language (C/C++, Java)
 Dimensionality reduction (e.g. PCA)
 Cluster analysis
 Support Vector Machines
 Hidden Markov models
 Web framework (e.g. Django)
 Version control system (e.g. Git)
 Advanced text editor (Emacs, vim)
 IDE (e.g. Eclipse, NetBeans)
Must haves Highly recommended
Requirement
Recommended
Language
Speed matters, low-level programming
Rich-client enterprise application development
Text file processing (regex)
Statistical analysis, fancy plots
Rapid prototyping, readable & maintainable scripts
Workflow automation
What programming language should I learn?
12Be a jack of all trades, master of ONE!
Perl on decline, R and Python gaining popularity
13
http://computationalproteomic.blogspot.co.uk/2013/10/which-are-best-programming-
languages.html
http://openwetware.org/wiki/Image:Most_Popular_Bioinformatics_Programming_Languages.png
Perl most popular bioinformatics
programming language in 2008
R and Python take the lead in 2014
Top 10 most common and/or
annoying mistakes in bioinformatics
14
Inspired by “What Are The Most Common Stupid Mistakes In Bioinformatics?” (https://www.biostars.org/p/7126/)
Top-10 most common/annoying mistakes in bioinformatics
# 10
Using genome coordinates with wrong
genome version
(for example, using gene coordinates from human genome
version hg18 but reference sequence from version hg19)
15
Top-10 most common/annoying mistakes in bioinformatics
# 9
Forgetting to process the second strand of
DNA sequence
16
Top-10 most common/annoying mistakes in bioinformatics
# 8
Processing second strand of DNA sequence,
but taking reverse instead of reverse
complement sequence
17
Top-10 most common/annoying mistakes in bioinformatics
# 7
Not accounting for different human
chromosomes names between
UCSC and Ensembl
Example:
UCSC: “chr1”
Ensembl: “1”
18
Top-10 most common/annoying mistakes in bioinformatics
# 6
Assuming the alphabetical order of
chromosome names is
“chr1”, “chr2”, “chr3”, …
when in fact it is
“chr1”, “chr10”, “chr11”, …
19
Top-10 most common/annoying mistakes in bioinformatics
# 5
Assuming „tab‟ field separator
when in fact it is „blank‟
(or vice versa)
(look almost identical in text editor)
20
Top-10 most common/annoying mistakes in bioinformatics
# 4
Assuming DNA sequence consists of only
four letters (A, T, C, G) while in fact
there is a fifth
21
„N‟ for missing base
(„X‟ for missing amino acid)
Top-10 most common/annoying mistakes in bioinformatics
# 3
Forgetting to use dos2unix on a Windows text file
before processing it under Linux
plus spending 1 hour to debug the problem
plus being tricked by this multiple times
Text file line breaks differ between platforms:
Linux (LF); Windows (CR+LF); classic Mac (CR).
22
Top-10 most common/annoying mistakes in bioinformatics
# 2
When importing data into MS Excel, letting it
auto-convert HUGO gene names into dates
and forgetting about it
(e.g., tumor suppressor gene “DEC1” will be converted to “1-DEC” on import)
~30 genes in total
23
#1
Off-by-one error
There are only two common problems in bioinformatics:
(1) lack of standards, (2) ID conversion, and
(3) off-by-one errors
24
http://en.wikipedia.org/wiki/Off-by-one_error
Top-10 most common/annoying mistakes in bioinformatics
Ten personal recommendations for
your future work as bioinformatician
25
#1 - Learn Linux!
 Most bioinformatics tools not available
on Windows
 Linux file systems better for many and/or very large files
 Command line interface (CLI) has advantages over
graphical user interface (GUI)
 Recorded command history (reproducibility)
 Key stroke to re-run analysis, instead of repeating 100 mouse
clicks
 Linux CLI (Shell) much more powerful than Windows CLI
26
# 2 - Embrace the “Unix tools philosophy”
 Small programs (“tools”) instead of monolithic applications
 Designed for simple, specific tasks that are performed well
(awk, cat, grep, wc, etc.)
 Many and well documented parameters
 Combined with Unix pipes (read from STDIN, write to STDOUT)
 cut -f 3 myfile.txt | sort | uniq
 Advantages
 Great flexibility, easy re-use of existing tools
 Intermediate output can be stored and inspected for troubleshooting
 Complex tasks can be performed quickly with shell „one-liners‟
 This paradigm fits bioinformatics well, where often many
heterogeneous data files need to be processed in many
different ways
27http://www.linuxdevcenter.com/lpt/a/302
Example NGS use case demonstrating the power
of the Unix tools philosophy
 Explanation
 „samtools mpileup‟ piles up short reads from the input BAM file for
each position in the reference genome
 „bcftools view‟ calls the variants
 „vcfutils vcf2fq‟ computes the consensus sequence
 The resulting FASTA sequence is redirected to the output file cns.fq
 By knowing available tools and their parameters, bioinformatics
„wizards‟ can get complex stuff done in almost no time
28
samtools mpileup -uf ref.fa aln.bam |
bcftools view -cg - |
vcfutils.pl vcf2fq > cns.fq
http://samtools.sourceforge.net/mpileup.shtml
#3 - Don’t reinvent the wheel
 Coding is fun, but look
around before you hack
into your keyboard
 Don‟t write the 29th FASTA
file parser if proven solutions
are available
 BioPerl
 BioPython
 Bioconductor
29
#4 - If you happen to invent a wheel, …
 Document source and parameters well
 Use version control system (git, svn)
 Deposit code in public repository
 sourceforge.net
 github.com
 Write test cases
30
# 5 - Automate pipelines
with GNU/Make
 Developed in 1970s to build executables from
source files
 Incredibly useful for data-driven workflows as well
 Automatic error checking
 Parallelization (utilize multiple cores)
 Incremental builds (re-start your pipeline from point of failure)
 Bug-free
 Get started at
http://www.bioinformaticszen.com/post/decomplected-workflows-makefiles/
31
# 6 - Value your time
 Architecture vs. accomplishment
 “Perfect is the enemy of the good” -- Voltaire
 OO design and normalized databases are nice, but can be an
overkill if requirements change from analysis to analysis
 Automate what can be automated
 Reproducibility
 Easy to repeat analysis with slightly changed parameters
 BUT: Don‟t spend two days automating a one-time
analysis that can be done manually in 10 minutes
32
# 7 – Make use of free online resources to learn
about specialized topics
 www.coursera.org
 Bioinformatics Algorithms
(https://www.coursera.org/course/bioinformatics)
 Computing for Data Analysis
(https://www.coursera.org/course/compdata)
 R Programming
(https://www.coursera.org/course/rprog)
 https://www.edx.org/
 Data Analysis for Genomics (https://www.edx.org/course/harvardx/harvardx-
ph525x-data-analysis-genomics-1401#.U1TUbXV52R8)
 Introduction to Biology (https://www.edx.org/course/mitx/mitx-7-00x-
introduction-biology-secret-1768#.U1TVL3V52R8)
 http://rosalind.info/problems/locations/
33
# 8 - Become an expert
 Identify an area of interest
and get really good at it
 Work at places where you
can learn from the best
 Spend time abroad
 Great experience
 Labs/companies will not only hire you for what you
know, but who you know
34
# 9 - Decide early on if you want to stay in
academia or go into industry
35
Academia Industry
• PhD highly recommended
• Take your time to find
compatible supervisor
+ Freedom to pursue own ideas
+ Very flexible working hours
+ Work independently
- Steep & competitive career
ladder (postdoc >> PI/prof)
- Lower pay
- Publish or perish
• PhD beneficial (to get in), but
not necessarily required for
daily work (e.g. build/maintain
databases)
+ More frequent (positive)
feedback
+ Higher pay
+ Job security
- More (external) deadlines
- Higher pressure to get things
done
See also “Ten Simple Rules for Choosing between Industry and Academia” (David B. Searls, 2009)
# 10 - Stay informed & get connected
 Follow literature and blogs
 http://en.wikipedia.org/wiki/List_of_bioinformatics_journals
 http://www.homolog.us/blogs/blog/2012/07/27/how-to-stay-
current-in-bioinformaticsgenomics/
 Subscribe via RSS feeds
 http://feedly.com or others
 Platform independent (e.g. read on your phone)
 Bioinformatics Q&A forums
 http://www.biostars.org (highly recommended)
 http://seqanswers.com/ (focus on NGS)
 http://www.reddit.com/r/bioinformatics/ (student-oriented)
 Other
 http://bioinformatics.org – fosters collaboration in bioinformatics
 http://www.researchgate.net – “Facebook” for researchers
 German bioinformatics group on XING (https://www.xing.com/net/pri485482x/bin)
36
Conclusion
 As bioinformatician, you will be at the
forefront of one of the greatest scientific
enterprises of our time
 Biologists overwhelmed with massive
data sets
 YOU will get to see exciting results first
 Requires integration of knowledge from many domains
 IT, biology, medicine, statistics, math, …
 Knowing your informatics toolbox AND understanding the biological
question is what makes you very valuable
37
Thank you!
Christian Frech
frech.christian@gmail.com
38
Further Reading
 “So you want to be a computational biologist?”
http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html
 “What It Takes to Be a Bioinformatician”
http://nav4bioinfo.wordpress.com/2013/03/19/what-it-takes-to-be-a-bioinformatician/
 “The alternative „what it takes to be a bioinformatician‟”
https://biomickwatson.wordpress.com/2013/03/18/the-alternative-what-it-takes-to-be-a-bioinformatician/
 “So You Want To Be a Computational Biologist, Or A Bioinformatician?”
http://www.checkmatescientist.net/2013/11/so-you-want-to-be-computational.html
 “Being a bioinformatician is hard”
http://www.bioinformaticszen.com/post/being-a-bioinformatician-is-hard/
 “How not to be a bioinformatician”
http://www.scfbm.org/content/7/1/3
 “Ten Simple Rules for Reproducible Computational Research”
http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285
 “Ten Simple Rules for Getting Ahead as a Computational Biologist in Academia”
http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002001;jsessionid=6D5D844E0E2
E21C9E565378C7F714D76
 “A Quick Guide for Developing Effective Bioinformatics Programming Skills”
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000589
 “What Is Really the Salary of a Bioinformatician/Computational Biologist?”
http://www.homolog.us/blogs/blog/2014/04/02/what-is-really-the-salary-of-a-bioinformaticiancomputational-
biologist/
39

Weitere ähnliche Inhalte

Was ist angesagt?

Bioinformatics and functional genomics
Bioinformatics and functional genomicsBioinformatics and functional genomics
Bioinformatics and functional genomics
Aisha Kalsoom
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
biinoida
 
Bioinformatics Final Presentation
Bioinformatics Final PresentationBioinformatics Final Presentation
Bioinformatics Final Presentation
Shruthi Choudary
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
Abhishek Vatsa
 

Was ist angesagt? (20)

Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
Storing and Accessing Information. Databases and Queries (UEB-UAT Bioinformat...
 
Bioinformatics & It's Scope in Biotechnology
Bioinformatics & It's Scope in BiotechnologyBioinformatics & It's Scope in Biotechnology
Bioinformatics & It's Scope in Biotechnology
 
Applications of bioinformatics
Applications of bioinformaticsApplications of bioinformatics
Applications of bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
UniProt
UniProtUniProt
UniProt
 
Sequence database
Sequence databaseSequence database
Sequence database
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics and functional genomics
Bioinformatics and functional genomicsBioinformatics and functional genomics
Bioinformatics and functional genomics
 
Introduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASEIntroduction OF BIOLOGICAL DATABASE
Introduction OF BIOLOGICAL DATABASE
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Systems biology
Systems biologySystems biology
Systems biology
 
History and scope in bioinformatics
History and scope in bioinformaticsHistory and scope in bioinformatics
History and scope in bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Bioinformatics Final Presentation
Bioinformatics Final PresentationBioinformatics Final Presentation
Bioinformatics Final Presentation
 
String.pptx
String.pptxString.pptx
String.pptx
 
History and devolopment of bioinfomatics.ppt (1)
History and devolopment of bioinfomatics.ppt (1)History and devolopment of bioinfomatics.ppt (1)
History and devolopment of bioinfomatics.ppt (1)
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 
Biological Database
Biological DatabaseBiological Database
Biological Database
 
Databases in Bioinformatics
Databases in BioinformaticsDatabases in Bioinformatics
Databases in Bioinformatics
 

Andere mochten auch

Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informatics
Daniela Rotariu
 

Andere mochten auch (12)

The Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer InterviewsThe Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer Interviews
 
Bioinformatics A Biased Overview
Bioinformatics A Biased OverviewBioinformatics A Biased Overview
Bioinformatics A Biased Overview
 
Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1
 
Ap Chapter 21
Ap Chapter 21Ap Chapter 21
Ap Chapter 21
 
Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informatics
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification
 
Molecular Markers: Major Applications in Insects
Molecular Markers: Major Applications in InsectsMolecular Markers: Major Applications in Insects
Molecular Markers: Major Applications in Insects
 
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura AdamMapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
 
Bioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataBioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big Data
 
Formal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural GenomesFormal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural Genomes
 
Gene concept
Gene conceptGene concept
Gene concept
 

Ähnlich wie How to be a bioinformatician

2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 

Ähnlich wie How to be a bioinformatician (20)

Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and Anduril
 
Reproducibility: 10 Simple Rules
Reproducibility: 10 Simple RulesReproducibility: 10 Simple Rules
Reproducibility: 10 Simple Rules
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
Reproducible Science and Deep Software Variability
Reproducible Science and Deep Software VariabilityReproducible Science and Deep Software Variability
Reproducible Science and Deep Software Variability
 
Computational Resources In Infectious Disease
Computational Resources In Infectious DiseaseComputational Resources In Infectious Disease
Computational Resources In Infectious Disease
 
Software Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The UglySoftware Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The Ugly
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Libra Library OS
Libra Library OSLibra Library OS
Libra Library OS
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
 
Open64 compiler
Open64 compilerOpen64 compiler
Open64 compiler
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.ppt
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
HPC For Bioinformatics
HPC For BioinformaticsHPC For Bioinformatics
HPC For Bioinformatics
 

Kürzlich hochgeladen

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Lokesh Kothari
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 

Kürzlich hochgeladen (20)

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 

How to be a bioinformatician

  • 1. 1 How to be a bioinformatician Christian Frech, PhD St. Anna Children’s Cancer Research Institute, Vienna, Austria Talk at University of Applied Sciences, Hagenberg, Austria April 23rd, 2014
  • 2. What is a bioinformatician? 2 Informatician Statistician Biologist Data scientist Modified from http://blog.fejes.ca/?p=2418
  • 3. Bioinformatician vs. computational biologist  Asks biological questions  Analyzes & interprets biological data  Runs existing programs  Ad hoc scripting  Perl, R, Python 3  IT savvy  Builds & maintains biological databases & Web sites  Designs & implements clever algorithms  C/C++, Java, Python Bioinformatician Computational biologist Grasp of computational subjectsmore less Grasp of biological subjectsless more or vice versa
  • 4. Why do we need bioinformaticians?  Amount of generated biological data requires sophisticated computing for data management and analysis  Programmers lack biological knowledge  Biologists don‟t program  The two don‟t understand each other 4 http://www.youtube.com/watch?v=Hz1fyhVOjr4 Latest Illumina sequencer shipped last week (HiSeq v4 reagent kit) outputs 1 terabase (TB) of data in 6 days1! Biologists talks to statistician 1 http://www.illumina.com/products/hiseq-sbs-kit-v4.ilmn
  • 6. 6 What are bioinformaticians doing? Word cloud from manuscript titles published in Bioinformatics from Jan 2013 to April 2014
  • 7. Challenges as bioinformatician  Biology is complex, not black and white  As many exceptions as rules (e.g.: define “gene”)  No single optimal solution to a problem  Results interpretable in many ways (story telling, cherry picking)  Understanding the biological question  Field is moving incredibly fast  Lack of standards, immature/abandoned software  Standard of today obsolete tomorrow  Much time spent on collecting/cleaning-up data, troubleshooting errors  Stay flexible, don‟t overinvest in single platform/technology  Hundreds of software tools and databases out there  Easy to get lost  Important to understand their strengths and weaknesses 8
  • 8. Which tools should I use? 9 179 tools Heard of: 65% Used: 30%
  • 10. Things to have in your bioinformatics toolbox  Linux command line  Scripting language with associated Bio* library (BioPerl, BioPython, R/Bioconductor, …)  Basic statistical tests, regression, p-values, maximum likelihood, multiple testing correction  Sequence alignment (FASTA & BLAST)  Biological databases  Regular expressions  Sequencing technologies  Web technologies (HTML, XML, …) 11  Advanced R skills  Parallel/distributed computing  DBMS, SQL  (Semi-)compiled language (C/C++, Java)  Dimensionality reduction (e.g. PCA)  Cluster analysis  Support Vector Machines  Hidden Markov models  Web framework (e.g. Django)  Version control system (e.g. Git)  Advanced text editor (Emacs, vim)  IDE (e.g. Eclipse, NetBeans) Must haves Highly recommended
  • 11. Requirement Recommended Language Speed matters, low-level programming Rich-client enterprise application development Text file processing (regex) Statistical analysis, fancy plots Rapid prototyping, readable & maintainable scripts Workflow automation What programming language should I learn? 12Be a jack of all trades, master of ONE!
  • 12. Perl on decline, R and Python gaining popularity 13 http://computationalproteomic.blogspot.co.uk/2013/10/which-are-best-programming- languages.html http://openwetware.org/wiki/Image:Most_Popular_Bioinformatics_Programming_Languages.png Perl most popular bioinformatics programming language in 2008 R and Python take the lead in 2014
  • 13. Top 10 most common and/or annoying mistakes in bioinformatics 14 Inspired by “What Are The Most Common Stupid Mistakes In Bioinformatics?” (https://www.biostars.org/p/7126/)
  • 14. Top-10 most common/annoying mistakes in bioinformatics # 10 Using genome coordinates with wrong genome version (for example, using gene coordinates from human genome version hg18 but reference sequence from version hg19) 15
  • 15. Top-10 most common/annoying mistakes in bioinformatics # 9 Forgetting to process the second strand of DNA sequence 16
  • 16. Top-10 most common/annoying mistakes in bioinformatics # 8 Processing second strand of DNA sequence, but taking reverse instead of reverse complement sequence 17
  • 17. Top-10 most common/annoying mistakes in bioinformatics # 7 Not accounting for different human chromosomes names between UCSC and Ensembl Example: UCSC: “chr1” Ensembl: “1” 18
  • 18. Top-10 most common/annoying mistakes in bioinformatics # 6 Assuming the alphabetical order of chromosome names is “chr1”, “chr2”, “chr3”, … when in fact it is “chr1”, “chr10”, “chr11”, … 19
  • 19. Top-10 most common/annoying mistakes in bioinformatics # 5 Assuming „tab‟ field separator when in fact it is „blank‟ (or vice versa) (look almost identical in text editor) 20
  • 20. Top-10 most common/annoying mistakes in bioinformatics # 4 Assuming DNA sequence consists of only four letters (A, T, C, G) while in fact there is a fifth 21 „N‟ for missing base („X‟ for missing amino acid)
  • 21. Top-10 most common/annoying mistakes in bioinformatics # 3 Forgetting to use dos2unix on a Windows text file before processing it under Linux plus spending 1 hour to debug the problem plus being tricked by this multiple times Text file line breaks differ between platforms: Linux (LF); Windows (CR+LF); classic Mac (CR). 22
  • 22. Top-10 most common/annoying mistakes in bioinformatics # 2 When importing data into MS Excel, letting it auto-convert HUGO gene names into dates and forgetting about it (e.g., tumor suppressor gene “DEC1” will be converted to “1-DEC” on import) ~30 genes in total 23
  • 23. #1 Off-by-one error There are only two common problems in bioinformatics: (1) lack of standards, (2) ID conversion, and (3) off-by-one errors 24 http://en.wikipedia.org/wiki/Off-by-one_error Top-10 most common/annoying mistakes in bioinformatics
  • 24. Ten personal recommendations for your future work as bioinformatician 25
  • 25. #1 - Learn Linux!  Most bioinformatics tools not available on Windows  Linux file systems better for many and/or very large files  Command line interface (CLI) has advantages over graphical user interface (GUI)  Recorded command history (reproducibility)  Key stroke to re-run analysis, instead of repeating 100 mouse clicks  Linux CLI (Shell) much more powerful than Windows CLI 26
  • 26. # 2 - Embrace the “Unix tools philosophy”  Small programs (“tools”) instead of monolithic applications  Designed for simple, specific tasks that are performed well (awk, cat, grep, wc, etc.)  Many and well documented parameters  Combined with Unix pipes (read from STDIN, write to STDOUT)  cut -f 3 myfile.txt | sort | uniq  Advantages  Great flexibility, easy re-use of existing tools  Intermediate output can be stored and inspected for troubleshooting  Complex tasks can be performed quickly with shell „one-liners‟  This paradigm fits bioinformatics well, where often many heterogeneous data files need to be processed in many different ways 27http://www.linuxdevcenter.com/lpt/a/302
  • 27. Example NGS use case demonstrating the power of the Unix tools philosophy  Explanation  „samtools mpileup‟ piles up short reads from the input BAM file for each position in the reference genome  „bcftools view‟ calls the variants  „vcfutils vcf2fq‟ computes the consensus sequence  The resulting FASTA sequence is redirected to the output file cns.fq  By knowing available tools and their parameters, bioinformatics „wizards‟ can get complex stuff done in almost no time 28 samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq http://samtools.sourceforge.net/mpileup.shtml
  • 28. #3 - Don’t reinvent the wheel  Coding is fun, but look around before you hack into your keyboard  Don‟t write the 29th FASTA file parser if proven solutions are available  BioPerl  BioPython  Bioconductor 29
  • 29. #4 - If you happen to invent a wheel, …  Document source and parameters well  Use version control system (git, svn)  Deposit code in public repository  sourceforge.net  github.com  Write test cases 30
  • 30. # 5 - Automate pipelines with GNU/Make  Developed in 1970s to build executables from source files  Incredibly useful for data-driven workflows as well  Automatic error checking  Parallelization (utilize multiple cores)  Incremental builds (re-start your pipeline from point of failure)  Bug-free  Get started at http://www.bioinformaticszen.com/post/decomplected-workflows-makefiles/ 31
  • 31. # 6 - Value your time  Architecture vs. accomplishment  “Perfect is the enemy of the good” -- Voltaire  OO design and normalized databases are nice, but can be an overkill if requirements change from analysis to analysis  Automate what can be automated  Reproducibility  Easy to repeat analysis with slightly changed parameters  BUT: Don‟t spend two days automating a one-time analysis that can be done manually in 10 minutes 32
  • 32. # 7 – Make use of free online resources to learn about specialized topics  www.coursera.org  Bioinformatics Algorithms (https://www.coursera.org/course/bioinformatics)  Computing for Data Analysis (https://www.coursera.org/course/compdata)  R Programming (https://www.coursera.org/course/rprog)  https://www.edx.org/  Data Analysis for Genomics (https://www.edx.org/course/harvardx/harvardx- ph525x-data-analysis-genomics-1401#.U1TUbXV52R8)  Introduction to Biology (https://www.edx.org/course/mitx/mitx-7-00x- introduction-biology-secret-1768#.U1TVL3V52R8)  http://rosalind.info/problems/locations/ 33
  • 33. # 8 - Become an expert  Identify an area of interest and get really good at it  Work at places where you can learn from the best  Spend time abroad  Great experience  Labs/companies will not only hire you for what you know, but who you know 34
  • 34. # 9 - Decide early on if you want to stay in academia or go into industry 35 Academia Industry • PhD highly recommended • Take your time to find compatible supervisor + Freedom to pursue own ideas + Very flexible working hours + Work independently - Steep & competitive career ladder (postdoc >> PI/prof) - Lower pay - Publish or perish • PhD beneficial (to get in), but not necessarily required for daily work (e.g. build/maintain databases) + More frequent (positive) feedback + Higher pay + Job security - More (external) deadlines - Higher pressure to get things done See also “Ten Simple Rules for Choosing between Industry and Academia” (David B. Searls, 2009)
  • 35. # 10 - Stay informed & get connected  Follow literature and blogs  http://en.wikipedia.org/wiki/List_of_bioinformatics_journals  http://www.homolog.us/blogs/blog/2012/07/27/how-to-stay- current-in-bioinformaticsgenomics/  Subscribe via RSS feeds  http://feedly.com or others  Platform independent (e.g. read on your phone)  Bioinformatics Q&A forums  http://www.biostars.org (highly recommended)  http://seqanswers.com/ (focus on NGS)  http://www.reddit.com/r/bioinformatics/ (student-oriented)  Other  http://bioinformatics.org – fosters collaboration in bioinformatics  http://www.researchgate.net – “Facebook” for researchers  German bioinformatics group on XING (https://www.xing.com/net/pri485482x/bin) 36
  • 36. Conclusion  As bioinformatician, you will be at the forefront of one of the greatest scientific enterprises of our time  Biologists overwhelmed with massive data sets  YOU will get to see exciting results first  Requires integration of knowledge from many domains  IT, biology, medicine, statistics, math, …  Knowing your informatics toolbox AND understanding the biological question is what makes you very valuable 37
  • 38. Further Reading  “So you want to be a computational biologist?” http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html  “What It Takes to Be a Bioinformatician” http://nav4bioinfo.wordpress.com/2013/03/19/what-it-takes-to-be-a-bioinformatician/  “The alternative „what it takes to be a bioinformatician‟” https://biomickwatson.wordpress.com/2013/03/18/the-alternative-what-it-takes-to-be-a-bioinformatician/  “So You Want To Be a Computational Biologist, Or A Bioinformatician?” http://www.checkmatescientist.net/2013/11/so-you-want-to-be-computational.html  “Being a bioinformatician is hard” http://www.bioinformaticszen.com/post/being-a-bioinformatician-is-hard/  “How not to be a bioinformatician” http://www.scfbm.org/content/7/1/3  “Ten Simple Rules for Reproducible Computational Research” http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285  “Ten Simple Rules for Getting Ahead as a Computational Biologist in Academia” http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002001;jsessionid=6D5D844E0E2 E21C9E565378C7F714D76  “A Quick Guide for Developing Effective Bioinformatics Programming Skills” http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000589  “What Is Really the Salary of a Bioinformatician/Computational Biologist?” http://www.homolog.us/blogs/blog/2014/04/02/what-is-really-the-salary-of-a-bioinformaticiancomputational- biologist/ 39

Hinweis der Redaktion

  1. Version 5
  2. Funny rant about bioinformatics, not to be taken literally:http://madhadron.com/posts/2012-03-26-a-farewell-to-bioinformatics.html