SlideShare ist ein Scribd-Unternehmen logo
1 von 38
1
How to be a bioinformatician
Christian Frech, PhD
St. Anna Children’s Cancer Research Institute, Vienna, Austria
Talk at University of Applied Sciences, Hagenberg, Austria
April 23rd, 2014
What is a bioinformatician?
2
Informatician Statistician
Biologist
Data
scientist
Modified from http://blog.fejes.ca/?p=2418
Bioinformatician vs. computational biologist
 Asks biological questions
 Analyzes & interprets
biological data
 Runs existing programs
 Ad hoc scripting
 Perl, R, Python
3
 IT savvy
 Builds & maintains
biological databases &
Web sites
 Designs & implements
clever algorithms
 C/C++, Java, Python
Bioinformatician Computational
biologist
Grasp of computational subjectsmore less
Grasp of biological subjectsless more
or vice versa
Why do we need bioinformaticians?
 Amount of generated biological data requires sophisticated
computing for data management and analysis
 Programmers lack biological knowledge
 Biologists don‟t program
 The two don‟t understand each other
4
http://www.youtube.com/watch?v=Hz1fyhVOjr4
Latest Illumina sequencer shipped last
week (HiSeq v4 reagent kit) outputs
1 terabase (TB) of data in 6 days1!
Biologists talks to statistician
1 http://www.illumina.com/products/hiseq-sbs-kit-v4.ilmn
What are bioinformaticians doing?
5
6
What are bioinformaticians doing?
Word cloud from manuscript titles published in Bioinformatics from Jan 2013 to April 2014
Challenges as bioinformatician
 Biology is complex, not black and white
 As many exceptions as rules (e.g.: define “gene”)
 No single optimal solution to a problem
 Results interpretable in many ways (story telling, cherry picking)
 Understanding the biological question
 Field is moving incredibly fast
 Lack of standards, immature/abandoned software
 Standard of today obsolete tomorrow
 Much time spent on collecting/cleaning-up data, troubleshooting errors
 Stay flexible, don‟t overinvest in single platform/technology
 Hundreds of software tools and databases out there
 Easy to get lost
 Important to understand their strengths and weaknesses
8
Which tools should I use?
9
179 tools
Heard of: 65%
Used: 30%
10
http://omictools.com/
Things to have in your bioinformatics
toolbox
 Linux command line
 Scripting language with
associated Bio* library (BioPerl,
BioPython, R/Bioconductor, …)
 Basic statistical tests, regression,
p-values, maximum likelihood,
multiple testing correction
 Sequence alignment
(FASTA & BLAST)
 Biological databases
 Regular expressions
 Sequencing technologies
 Web technologies (HTML, XML, …)
11
 Advanced R skills
 Parallel/distributed computing
 DBMS, SQL
 (Semi-)compiled language (C/C++, Java)
 Dimensionality reduction (e.g. PCA)
 Cluster analysis
 Support Vector Machines
 Hidden Markov models
 Web framework (e.g. Django)
 Version control system (e.g. Git)
 Advanced text editor (Emacs, vim)
 IDE (e.g. Eclipse, NetBeans)
Must haves Highly recommended
Requirement
Recommended
Language
Speed matters, low-level programming
Rich-client enterprise application development
Text file processing (regex)
Statistical analysis, fancy plots
Rapid prototyping, readable & maintainable scripts
Workflow automation
What programming language should I learn?
12Be a jack of all trades, master of ONE!
Perl on decline, R and Python gaining popularity
13
http://computationalproteomic.blogspot.co.uk/2013/10/which-are-best-programming-
languages.html
http://openwetware.org/wiki/Image:Most_Popular_Bioinformatics_Programming_Languages.png
Perl most popular bioinformatics
programming language in 2008
R and Python take the lead in 2014
Top 10 most common and/or
annoying mistakes in bioinformatics
14
Inspired by “What Are The Most Common Stupid Mistakes In Bioinformatics?” (https://www.biostars.org/p/7126/)
Top-10 most common/annoying mistakes in bioinformatics
# 10
Using genome coordinates with wrong
genome version
(for example, using gene coordinates from human genome
version hg18 but reference sequence from version hg19)
15
Top-10 most common/annoying mistakes in bioinformatics
# 9
Forgetting to process the second strand of
DNA sequence
16
Top-10 most common/annoying mistakes in bioinformatics
# 8
Processing second strand of DNA sequence,
but taking reverse instead of reverse
complement sequence
17
Top-10 most common/annoying mistakes in bioinformatics
# 7
Not accounting for different human
chromosomes names between
UCSC and Ensembl
Example:
UCSC: “chr1”
Ensembl: “1”
18
Top-10 most common/annoying mistakes in bioinformatics
# 6
Assuming the alphabetical order of
chromosome names is
“chr1”, “chr2”, “chr3”, …
when in fact it is
“chr1”, “chr10”, “chr11”, …
19
Top-10 most common/annoying mistakes in bioinformatics
# 5
Assuming „tab‟ field separator
when in fact it is „blank‟
(or vice versa)
(look almost identical in text editor)
20
Top-10 most common/annoying mistakes in bioinformatics
# 4
Assuming DNA sequence consists of only
four letters (A, T, C, G) while in fact
there is a fifth
21
„N‟ for missing base
(„X‟ for missing amino acid)
Top-10 most common/annoying mistakes in bioinformatics
# 3
Forgetting to use dos2unix on a Windows text file
before processing it under Linux
plus spending 1 hour to debug the problem
plus being tricked by this multiple times
Text file line breaks differ between platforms:
Linux (LF); Windows (CR+LF); classic Mac (CR).
22
Top-10 most common/annoying mistakes in bioinformatics
# 2
When importing data into MS Excel, letting it
auto-convert HUGO gene names into dates
and forgetting about it
(e.g., tumor suppressor gene “DEC1” will be converted to “1-DEC” on import)
~30 genes in total
23
#1
Off-by-one error
There are only two common problems in bioinformatics:
(1) lack of standards, (2) ID conversion, and
(3) off-by-one errors
24
http://en.wikipedia.org/wiki/Off-by-one_error
Top-10 most common/annoying mistakes in bioinformatics
Ten personal recommendations for
your future work as bioinformatician
25
#1 - Learn Linux!
 Most bioinformatics tools not available
on Windows
 Linux file systems better for many and/or very large files
 Command line interface (CLI) has advantages over
graphical user interface (GUI)
 Recorded command history (reproducibility)
 Key stroke to re-run analysis, instead of repeating 100 mouse
clicks
 Linux CLI (Shell) much more powerful than Windows CLI
26
# 2 - Embrace the “Unix tools philosophy”
 Small programs (“tools”) instead of monolithic applications
 Designed for simple, specific tasks that are performed well
(awk, cat, grep, wc, etc.)
 Many and well documented parameters
 Combined with Unix pipes (read from STDIN, write to STDOUT)
 cut -f 3 myfile.txt | sort | uniq
 Advantages
 Great flexibility, easy re-use of existing tools
 Intermediate output can be stored and inspected for troubleshooting
 Complex tasks can be performed quickly with shell „one-liners‟
 This paradigm fits bioinformatics well, where often many
heterogeneous data files need to be processed in many
different ways
27http://www.linuxdevcenter.com/lpt/a/302
Example NGS use case demonstrating the power
of the Unix tools philosophy
 Explanation
 „samtools mpileup‟ piles up short reads from the input BAM file for
each position in the reference genome
 „bcftools view‟ calls the variants
 „vcfutils vcf2fq‟ computes the consensus sequence
 The resulting FASTA sequence is redirected to the output file cns.fq
 By knowing available tools and their parameters, bioinformatics
„wizards‟ can get complex stuff done in almost no time
28
samtools mpileup -uf ref.fa aln.bam |
bcftools view -cg - |
vcfutils.pl vcf2fq > cns.fq
http://samtools.sourceforge.net/mpileup.shtml
#3 - Don’t reinvent the wheel
 Coding is fun, but look
around before you hack
into your keyboard
 Don‟t write the 29th FASTA
file parser if proven solutions
are available
 BioPerl
 BioPython
 Bioconductor
29
#4 - If you happen to invent a wheel, …
 Document source and parameters well
 Use version control system (git, svn)
 Deposit code in public repository
 sourceforge.net
 github.com
 Write test cases
30
# 5 - Automate pipelines
with GNU/Make
 Developed in 1970s to build executables from
source files
 Incredibly useful for data-driven workflows as well
 Automatic error checking
 Parallelization (utilize multiple cores)
 Incremental builds (re-start your pipeline from point of failure)
 Bug-free
 Get started at
http://www.bioinformaticszen.com/post/decomplected-workflows-makefiles/
31
# 6 - Value your time
 Architecture vs. accomplishment
 “Perfect is the enemy of the good” -- Voltaire
 OO design and normalized databases are nice, but can be an
overkill if requirements change from analysis to analysis
 Automate what can be automated
 Reproducibility
 Easy to repeat analysis with slightly changed parameters
 BUT: Don‟t spend two days automating a one-time
analysis that can be done manually in 10 minutes
32
# 7 – Make use of free online resources to learn
about specialized topics
 www.coursera.org
 Bioinformatics Algorithms
(https://www.coursera.org/course/bioinformatics)
 Computing for Data Analysis
(https://www.coursera.org/course/compdata)
 R Programming
(https://www.coursera.org/course/rprog)
 https://www.edx.org/
 Data Analysis for Genomics (https://www.edx.org/course/harvardx/harvardx-
ph525x-data-analysis-genomics-1401#.U1TUbXV52R8)
 Introduction to Biology (https://www.edx.org/course/mitx/mitx-7-00x-
introduction-biology-secret-1768#.U1TVL3V52R8)
 http://rosalind.info/problems/locations/
33
# 8 - Become an expert
 Identify an area of interest
and get really good at it
 Work at places where you
can learn from the best
 Spend time abroad
 Great experience
 Labs/companies will not only hire you for what you
know, but who you know
34
# 9 - Decide early on if you want to stay in
academia or go into industry
35
Academia Industry
• PhD highly recommended
• Take your time to find
compatible supervisor
+ Freedom to pursue own ideas
+ Very flexible working hours
+ Work independently
- Steep & competitive career
ladder (postdoc >> PI/prof)
- Lower pay
- Publish or perish
• PhD beneficial (to get in), but
not necessarily required for
daily work (e.g. build/maintain
databases)
+ More frequent (positive)
feedback
+ Higher pay
+ Job security
- More (external) deadlines
- Higher pressure to get things
done
See also “Ten Simple Rules for Choosing between Industry and Academia” (David B. Searls, 2009)
# 10 - Stay informed & get connected
 Follow literature and blogs
 http://en.wikipedia.org/wiki/List_of_bioinformatics_journals
 http://www.homolog.us/blogs/blog/2012/07/27/how-to-stay-
current-in-bioinformaticsgenomics/
 Subscribe via RSS feeds
 http://feedly.com or others
 Platform independent (e.g. read on your phone)
 Bioinformatics Q&A forums
 http://www.biostars.org (highly recommended)
 http://seqanswers.com/ (focus on NGS)
 http://www.reddit.com/r/bioinformatics/ (student-oriented)
 Other
 http://bioinformatics.org – fosters collaboration in bioinformatics
 http://www.researchgate.net – “Facebook” for researchers
 German bioinformatics group on XING (https://www.xing.com/net/pri485482x/bin)
36
Conclusion
 As bioinformatician, you will be at the
forefront of one of the greatest scientific
enterprises of our time
 Biologists overwhelmed with massive
data sets
 YOU will get to see exciting results first
 Requires integration of knowledge from many domains
 IT, biology, medicine, statistics, math, …
 Knowing your informatics toolbox AND understanding the biological
question is what makes you very valuable
37
Thank you!
Christian Frech
frech.christian@gmail.com
38
Further Reading
 “So you want to be a computational biologist?”
http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html
 “What It Takes to Be a Bioinformatician”
http://nav4bioinfo.wordpress.com/2013/03/19/what-it-takes-to-be-a-bioinformatician/
 “The alternative „what it takes to be a bioinformatician‟”
https://biomickwatson.wordpress.com/2013/03/18/the-alternative-what-it-takes-to-be-a-bioinformatician/
 “So You Want To Be a Computational Biologist, Or A Bioinformatician?”
http://www.checkmatescientist.net/2013/11/so-you-want-to-be-computational.html
 “Being a bioinformatician is hard”
http://www.bioinformaticszen.com/post/being-a-bioinformatician-is-hard/
 “How not to be a bioinformatician”
http://www.scfbm.org/content/7/1/3
 “Ten Simple Rules for Reproducible Computational Research”
http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285
 “Ten Simple Rules for Getting Ahead as a Computational Biologist in Academia”
http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002001;jsessionid=6D5D844E0E2
E21C9E565378C7F714D76
 “A Quick Guide for Developing Effective Bioinformatics Programming Skills”
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000589
 “What Is Really the Salary of a Bioinformatician/Computational Biologist?”
http://www.homolog.us/blogs/blog/2014/04/02/what-is-really-the-salary-of-a-bioinformaticiancomputational-
biologist/
39

Weitere ähnliche Inhalte

Was ist angesagt?

Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysis
drelamuruganvet
 

Was ist angesagt? (20)

Whole genome sequence.
Whole genome sequence.Whole genome sequence.
Whole genome sequence.
 
Pub med
Pub medPub med
Pub med
 
Ion Torrent Sequencing
Ion Torrent SequencingIon Torrent Sequencing
Ion Torrent Sequencing
 
Primer design
Primer designPrimer design
Primer design
 
Next Generation Sequencing of DNA
Next Generation Sequencing of DNANext Generation Sequencing of DNA
Next Generation Sequencing of DNA
 
Gene Ontology Project
Gene Ontology ProjectGene Ontology Project
Gene Ontology Project
 
Dna microarray
Dna microarrayDna microarray
Dna microarray
 
GENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICSGENOMICS AND BIOINFORMATICS
GENOMICS AND BIOINFORMATICS
 
Whole genome sequence
Whole genome sequenceWhole genome sequence
Whole genome sequence
 
Dna quantification
Dna quantificationDna quantification
Dna quantification
 
Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
Tools and database of NCBI
Tools and database of NCBITools and database of NCBI
Tools and database of NCBI
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
Whole genome sequencing
Whole genome sequencingWhole genome sequencing
Whole genome sequencing
 
Whole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysisWhole genome sequencing of bacteria & analysis
Whole genome sequencing of bacteria & analysis
 
DNA & RNA isolation
DNA & RNA isolationDNA & RNA isolation
DNA & RNA isolation
 
Multiple Sequence Alignment-just glims of viewes on bioinformatics.
 Multiple Sequence Alignment-just glims of viewes on bioinformatics. Multiple Sequence Alignment-just glims of viewes on bioinformatics.
Multiple Sequence Alignment-just glims of viewes on bioinformatics.
 
Fasta
FastaFasta
Fasta
 
Intro to illumina sequencing
Intro to illumina sequencingIntro to illumina sequencing
Intro to illumina sequencing
 
Intro bioinfo
Intro bioinfoIntro bioinfo
Intro bioinfo
 

Andere mochten auch

Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informatics
Daniela Rotariu
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
Abhishek Vatsa
 

Andere mochten auch (13)

The Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer InterviewsThe Marketer's Guide To Customer Interviews
The Marketer's Guide To Customer Interviews
 
Bioinformatics A Biased Overview
Bioinformatics A Biased OverviewBioinformatics A Biased Overview
Bioinformatics A Biased Overview
 
Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1Flow Cytometry Training : Introduction day 1 session 1
Flow Cytometry Training : Introduction day 1 session 1
 
Ap Chapter 21
Ap Chapter 21Ap Chapter 21
Ap Chapter 21
 
Project report-on-bio-informatics
Project report-on-bio-informaticsProject report-on-bio-informatics
Project report-on-bio-informatics
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification DNA Markers Techniques for Plant Varietal Identification
DNA Markers Techniques for Plant Varietal Identification
 
Molecular Markers: Major Applications in Insects
Molecular Markers: Major Applications in InsectsMolecular Markers: Major Applications in Insects
Molecular Markers: Major Applications in Insects
 
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura AdamMapping Genotype to Phenotype using Attribute Grammar, Laura Adam
Mapping Genotype to Phenotype using Attribute Grammar, Laura Adam
 
Bioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big DataBioinformatics in the Era of Open Science and Big Data
Bioinformatics in the Era of Open Science and Big Data
 
Formal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural GenomesFormal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural Genomes
 
Gene concept
Gene conceptGene concept
Gene concept
 
Basics of bioinformatics
Basics of bioinformaticsBasics of bioinformatics
Basics of bioinformatics
 

Ähnlich wie How to be a bioinformatician

2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 

Ähnlich wie How to be a bioinformatician (20)

Reproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and AndurilReproducible bioinformatics pipelines with Docker and Anduril
Reproducible bioinformatics pipelines with Docker and Anduril
 
Reproducibility: 10 Simple Rules
Reproducibility: 10 Simple RulesReproducibility: 10 Simple Rules
Reproducibility: 10 Simple Rules
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
Reproducible Science and Deep Software Variability
Reproducible Science and Deep Software VariabilityReproducible Science and Deep Software Variability
Reproducible Science and Deep Software Variability
 
Computational Resources In Infectious Disease
Computational Resources In Infectious DiseaseComputational Resources In Infectious Disease
Computational Resources In Infectious Disease
 
Software Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The UglySoftware Pipelines: The Good, The Bad and The Ugly
Software Pipelines: The Good, The Bad and The Ugly
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Libra Library OS
Libra Library OSLibra Library OS
Libra Library OS
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
 
Open64 compiler
Open64 compilerOpen64 compiler
Open64 compiler
 
Software Mining and Software Datasets
Software Mining and Software DatasetsSoftware Mining and Software Datasets
Software Mining and Software Datasets
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
 
Parallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.pptParallel_and_Cluster_Computing.ppt
Parallel_and_Cluster_Computing.ppt
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
HPC For Bioinformatics
HPC For BioinformaticsHPC For Bioinformatics
HPC For Bioinformatics
 

Kürzlich hochgeladen

Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
Silpa
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
NazaninKarimi6
 

Kürzlich hochgeladen (20)

Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.POGONATUM : morphology, anatomy, reproduction etc.
POGONATUM : morphology, anatomy, reproduction etc.
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
development of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virusdevelopment of diagnostic enzyme assay to detect leuser virus
development of diagnostic enzyme assay to detect leuser virus
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 

How to be a bioinformatician

  • 1. 1 How to be a bioinformatician Christian Frech, PhD St. Anna Children’s Cancer Research Institute, Vienna, Austria Talk at University of Applied Sciences, Hagenberg, Austria April 23rd, 2014
  • 2. What is a bioinformatician? 2 Informatician Statistician Biologist Data scientist Modified from http://blog.fejes.ca/?p=2418
  • 3. Bioinformatician vs. computational biologist  Asks biological questions  Analyzes & interprets biological data  Runs existing programs  Ad hoc scripting  Perl, R, Python 3  IT savvy  Builds & maintains biological databases & Web sites  Designs & implements clever algorithms  C/C++, Java, Python Bioinformatician Computational biologist Grasp of computational subjectsmore less Grasp of biological subjectsless more or vice versa
  • 4. Why do we need bioinformaticians?  Amount of generated biological data requires sophisticated computing for data management and analysis  Programmers lack biological knowledge  Biologists don‟t program  The two don‟t understand each other 4 http://www.youtube.com/watch?v=Hz1fyhVOjr4 Latest Illumina sequencer shipped last week (HiSeq v4 reagent kit) outputs 1 terabase (TB) of data in 6 days1! Biologists talks to statistician 1 http://www.illumina.com/products/hiseq-sbs-kit-v4.ilmn
  • 6. 6 What are bioinformaticians doing? Word cloud from manuscript titles published in Bioinformatics from Jan 2013 to April 2014
  • 7. Challenges as bioinformatician  Biology is complex, not black and white  As many exceptions as rules (e.g.: define “gene”)  No single optimal solution to a problem  Results interpretable in many ways (story telling, cherry picking)  Understanding the biological question  Field is moving incredibly fast  Lack of standards, immature/abandoned software  Standard of today obsolete tomorrow  Much time spent on collecting/cleaning-up data, troubleshooting errors  Stay flexible, don‟t overinvest in single platform/technology  Hundreds of software tools and databases out there  Easy to get lost  Important to understand their strengths and weaknesses 8
  • 8. Which tools should I use? 9 179 tools Heard of: 65% Used: 30%
  • 10. Things to have in your bioinformatics toolbox  Linux command line  Scripting language with associated Bio* library (BioPerl, BioPython, R/Bioconductor, …)  Basic statistical tests, regression, p-values, maximum likelihood, multiple testing correction  Sequence alignment (FASTA & BLAST)  Biological databases  Regular expressions  Sequencing technologies  Web technologies (HTML, XML, …) 11  Advanced R skills  Parallel/distributed computing  DBMS, SQL  (Semi-)compiled language (C/C++, Java)  Dimensionality reduction (e.g. PCA)  Cluster analysis  Support Vector Machines  Hidden Markov models  Web framework (e.g. Django)  Version control system (e.g. Git)  Advanced text editor (Emacs, vim)  IDE (e.g. Eclipse, NetBeans) Must haves Highly recommended
  • 11. Requirement Recommended Language Speed matters, low-level programming Rich-client enterprise application development Text file processing (regex) Statistical analysis, fancy plots Rapid prototyping, readable & maintainable scripts Workflow automation What programming language should I learn? 12Be a jack of all trades, master of ONE!
  • 12. Perl on decline, R and Python gaining popularity 13 http://computationalproteomic.blogspot.co.uk/2013/10/which-are-best-programming- languages.html http://openwetware.org/wiki/Image:Most_Popular_Bioinformatics_Programming_Languages.png Perl most popular bioinformatics programming language in 2008 R and Python take the lead in 2014
  • 13. Top 10 most common and/or annoying mistakes in bioinformatics 14 Inspired by “What Are The Most Common Stupid Mistakes In Bioinformatics?” (https://www.biostars.org/p/7126/)
  • 14. Top-10 most common/annoying mistakes in bioinformatics # 10 Using genome coordinates with wrong genome version (for example, using gene coordinates from human genome version hg18 but reference sequence from version hg19) 15
  • 15. Top-10 most common/annoying mistakes in bioinformatics # 9 Forgetting to process the second strand of DNA sequence 16
  • 16. Top-10 most common/annoying mistakes in bioinformatics # 8 Processing second strand of DNA sequence, but taking reverse instead of reverse complement sequence 17
  • 17. Top-10 most common/annoying mistakes in bioinformatics # 7 Not accounting for different human chromosomes names between UCSC and Ensembl Example: UCSC: “chr1” Ensembl: “1” 18
  • 18. Top-10 most common/annoying mistakes in bioinformatics # 6 Assuming the alphabetical order of chromosome names is “chr1”, “chr2”, “chr3”, … when in fact it is “chr1”, “chr10”, “chr11”, … 19
  • 19. Top-10 most common/annoying mistakes in bioinformatics # 5 Assuming „tab‟ field separator when in fact it is „blank‟ (or vice versa) (look almost identical in text editor) 20
  • 20. Top-10 most common/annoying mistakes in bioinformatics # 4 Assuming DNA sequence consists of only four letters (A, T, C, G) while in fact there is a fifth 21 „N‟ for missing base („X‟ for missing amino acid)
  • 21. Top-10 most common/annoying mistakes in bioinformatics # 3 Forgetting to use dos2unix on a Windows text file before processing it under Linux plus spending 1 hour to debug the problem plus being tricked by this multiple times Text file line breaks differ between platforms: Linux (LF); Windows (CR+LF); classic Mac (CR). 22
  • 22. Top-10 most common/annoying mistakes in bioinformatics # 2 When importing data into MS Excel, letting it auto-convert HUGO gene names into dates and forgetting about it (e.g., tumor suppressor gene “DEC1” will be converted to “1-DEC” on import) ~30 genes in total 23
  • 23. #1 Off-by-one error There are only two common problems in bioinformatics: (1) lack of standards, (2) ID conversion, and (3) off-by-one errors 24 http://en.wikipedia.org/wiki/Off-by-one_error Top-10 most common/annoying mistakes in bioinformatics
  • 24. Ten personal recommendations for your future work as bioinformatician 25
  • 25. #1 - Learn Linux!  Most bioinformatics tools not available on Windows  Linux file systems better for many and/or very large files  Command line interface (CLI) has advantages over graphical user interface (GUI)  Recorded command history (reproducibility)  Key stroke to re-run analysis, instead of repeating 100 mouse clicks  Linux CLI (Shell) much more powerful than Windows CLI 26
  • 26. # 2 - Embrace the “Unix tools philosophy”  Small programs (“tools”) instead of monolithic applications  Designed for simple, specific tasks that are performed well (awk, cat, grep, wc, etc.)  Many and well documented parameters  Combined with Unix pipes (read from STDIN, write to STDOUT)  cut -f 3 myfile.txt | sort | uniq  Advantages  Great flexibility, easy re-use of existing tools  Intermediate output can be stored and inspected for troubleshooting  Complex tasks can be performed quickly with shell „one-liners‟  This paradigm fits bioinformatics well, where often many heterogeneous data files need to be processed in many different ways 27http://www.linuxdevcenter.com/lpt/a/302
  • 27. Example NGS use case demonstrating the power of the Unix tools philosophy  Explanation  „samtools mpileup‟ piles up short reads from the input BAM file for each position in the reference genome  „bcftools view‟ calls the variants  „vcfutils vcf2fq‟ computes the consensus sequence  The resulting FASTA sequence is redirected to the output file cns.fq  By knowing available tools and their parameters, bioinformatics „wizards‟ can get complex stuff done in almost no time 28 samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq http://samtools.sourceforge.net/mpileup.shtml
  • 28. #3 - Don’t reinvent the wheel  Coding is fun, but look around before you hack into your keyboard  Don‟t write the 29th FASTA file parser if proven solutions are available  BioPerl  BioPython  Bioconductor 29
  • 29. #4 - If you happen to invent a wheel, …  Document source and parameters well  Use version control system (git, svn)  Deposit code in public repository  sourceforge.net  github.com  Write test cases 30
  • 30. # 5 - Automate pipelines with GNU/Make  Developed in 1970s to build executables from source files  Incredibly useful for data-driven workflows as well  Automatic error checking  Parallelization (utilize multiple cores)  Incremental builds (re-start your pipeline from point of failure)  Bug-free  Get started at http://www.bioinformaticszen.com/post/decomplected-workflows-makefiles/ 31
  • 31. # 6 - Value your time  Architecture vs. accomplishment  “Perfect is the enemy of the good” -- Voltaire  OO design and normalized databases are nice, but can be an overkill if requirements change from analysis to analysis  Automate what can be automated  Reproducibility  Easy to repeat analysis with slightly changed parameters  BUT: Don‟t spend two days automating a one-time analysis that can be done manually in 10 minutes 32
  • 32. # 7 – Make use of free online resources to learn about specialized topics  www.coursera.org  Bioinformatics Algorithms (https://www.coursera.org/course/bioinformatics)  Computing for Data Analysis (https://www.coursera.org/course/compdata)  R Programming (https://www.coursera.org/course/rprog)  https://www.edx.org/  Data Analysis for Genomics (https://www.edx.org/course/harvardx/harvardx- ph525x-data-analysis-genomics-1401#.U1TUbXV52R8)  Introduction to Biology (https://www.edx.org/course/mitx/mitx-7-00x- introduction-biology-secret-1768#.U1TVL3V52R8)  http://rosalind.info/problems/locations/ 33
  • 33. # 8 - Become an expert  Identify an area of interest and get really good at it  Work at places where you can learn from the best  Spend time abroad  Great experience  Labs/companies will not only hire you for what you know, but who you know 34
  • 34. # 9 - Decide early on if you want to stay in academia or go into industry 35 Academia Industry • PhD highly recommended • Take your time to find compatible supervisor + Freedom to pursue own ideas + Very flexible working hours + Work independently - Steep & competitive career ladder (postdoc >> PI/prof) - Lower pay - Publish or perish • PhD beneficial (to get in), but not necessarily required for daily work (e.g. build/maintain databases) + More frequent (positive) feedback + Higher pay + Job security - More (external) deadlines - Higher pressure to get things done See also “Ten Simple Rules for Choosing between Industry and Academia” (David B. Searls, 2009)
  • 35. # 10 - Stay informed & get connected  Follow literature and blogs  http://en.wikipedia.org/wiki/List_of_bioinformatics_journals  http://www.homolog.us/blogs/blog/2012/07/27/how-to-stay- current-in-bioinformaticsgenomics/  Subscribe via RSS feeds  http://feedly.com or others  Platform independent (e.g. read on your phone)  Bioinformatics Q&A forums  http://www.biostars.org (highly recommended)  http://seqanswers.com/ (focus on NGS)  http://www.reddit.com/r/bioinformatics/ (student-oriented)  Other  http://bioinformatics.org – fosters collaboration in bioinformatics  http://www.researchgate.net – “Facebook” for researchers  German bioinformatics group on XING (https://www.xing.com/net/pri485482x/bin) 36
  • 36. Conclusion  As bioinformatician, you will be at the forefront of one of the greatest scientific enterprises of our time  Biologists overwhelmed with massive data sets  YOU will get to see exciting results first  Requires integration of knowledge from many domains  IT, biology, medicine, statistics, math, …  Knowing your informatics toolbox AND understanding the biological question is what makes you very valuable 37
  • 38. Further Reading  “So you want to be a computational biologist?” http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html  “What It Takes to Be a Bioinformatician” http://nav4bioinfo.wordpress.com/2013/03/19/what-it-takes-to-be-a-bioinformatician/  “The alternative „what it takes to be a bioinformatician‟” https://biomickwatson.wordpress.com/2013/03/18/the-alternative-what-it-takes-to-be-a-bioinformatician/  “So You Want To Be a Computational Biologist, Or A Bioinformatician?” http://www.checkmatescientist.net/2013/11/so-you-want-to-be-computational.html  “Being a bioinformatician is hard” http://www.bioinformaticszen.com/post/being-a-bioinformatician-is-hard/  “How not to be a bioinformatician” http://www.scfbm.org/content/7/1/3  “Ten Simple Rules for Reproducible Computational Research” http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285  “Ten Simple Rules for Getting Ahead as a Computational Biologist in Academia” http://www.ploscollections.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002001;jsessionid=6D5D844E0E2 E21C9E565378C7F714D76  “A Quick Guide for Developing Effective Bioinformatics Programming Skills” http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000589  “What Is Really the Salary of a Bioinformatician/Computational Biologist?” http://www.homolog.us/blogs/blog/2014/04/02/what-is-really-the-salary-of-a-bioinformaticiancomputational- biologist/ 39

Hinweis der Redaktion

  1. Version 5
  2. Funny rant about bioinformatics, not to be taken literally:http://madhadron.com/posts/2012-03-26-a-farewell-to-bioinformatics.html