Math, Stats and CS in Public Health and Medical Research
1. Jessica Minnier, OHSU,
Lewis & Clark College Mathematics Colloquium, 3.19.14
Math, Stats and CS
in Public Health and Medical Research
2. “Biostatistics (a portmanteau of biology and
statistics; sometimes referred to as biometry or
biometrics) is the application of statistics to a
wide range of topics in biology.” – Wikipedia
or,“What is Biostatistics?”
“Bioinformatics is an interdisciplinary scientific field
that develops methods for storing, retrieving,
organizing and analyzing biological data” – Wikipedia
“Computational biology involves the development
and application of data-analytical and theoretical
methods, mathematical modeling and computational
simulation techniques to the study of biological,
behavioral, and social systems.” – Wikipedia
3. Sample (n = 1)
¨ L&C mathematics major (2007), CS minor
¨ PhD in Biostatistics (2007-2012)
¤ “Inference and Prediction for High Dimensional Data via
Penalized Regression and Kernel Machine Methods”
¨ Postdoc (2012-2013)
¤ Cancer risk prediction with gene-environment
interactions
¨ Assistant Professor (2013-now)
v Division of Biostatistics
v Department of Public Health & Preventive Medicine
v School of Medicine (soon to be School of Public Health)
v Oregon Health & Science University
4. Outline
¨ Biostatistics and Bioinformatics/
Computational Biology
¤ More interesting definitions, research examples,
case studies
¤ Types of careers
¨ My trajectory
¤ LC math to grad school to jobs
¨ Resources and advice
5. Biostatistics, in the news.
Comics from Jim Borgman; XKCD; also fun:
http://stats.stackexchange.com/questions/423/what-is-your-favorite-data-analysis-cartoon
In summary:
A poor understanding of statistics makes everyone look bad.
8. Applied math?
¨ Applied mathematics often studies deterministic
models (engineering and mechanics, population
models, cryptography)
¨ Some questions can’t be solved by deterministic
models, but a partial answer can be given with
statistics
¤ Does smoking cause lung cancer? (inference from
observational studies)
¤ Is it going to rain tomorrow? (stochastic model)
¤ Do statins lower cholesterol? (randomized trial)
Rafa Irizarry’s math major talk: https://www.youtube.com/watch?v=gXeWdvHKTQQ
9. Example data
¨ Collection of measurements from a sampled
population
¨ Measurements of a lab experiment
¨ Medical images of subjects’ brains over time
¨ Results of a clinical trial
¨ Gene expression from different types of cultured
tissue
¨ Simulated data modeling HIV progression
¨ Values from electronic medical records sampled
retrospectively
¨ 3 million genetic mutations from 20,000 subjects
Brian Caffo’s MOOC: Biostatistics Bootcamp I, lecture 1
10. Inform medical decisions
¨ A large clinical trial in 2002 by the Women’s
Health Initiative was stopped early due to
preliminary data showing that hormone
replacement therapy had a negative health
impact.
¨ This data contradicted prior evidence on the
efficacy of HRT for post menopausal women.
¨ Statistical decision to end the trial, prevent
further harm
Brian Caffo’s MOOC: Biostatistics Bootcamp I, lecture 1; JAMA 2002;288(3):321-333
11. Inform medical decisions
¨ Guidelines for mammogram screening
based on probabilities of false positives and
negatives, cost-benefit analyses, survival
analysis
¨ Analysis of adverse effects in a clinical trial
determines drug safety, dosage,
subpopulations
¨ Even general public must make decisions
about risk when making their own medical
decisions
¨ Experts cannot make decisions without data
12. Bioinformatics & Computational
biology
¨ Sequencing the human genome (aligning,
matching, searching)
¨ Algorithms for turning massive information from
electronic medical records into useful predictors
of disease progression
¨ Machine learning algorithms for risk prediction
models with large and complex data (imaging,
genetic)
¨ Analysis of networks (protein interactions,
genetic pathways, social behavior influencing
health outcomes)
¨ Simulation of complex data (methylation
patterns in the genome)
13. Biomathematics
¨ Mathematical models to study infectious
disease progression (in a population or in a
body’s cells)
¨ Steady-state simulations of cancer cell
growth
¨ Usually in joint biostatistics/biomathematics
or applied mathematics departments, some
epidemiology
14. Where do we work?
(non-random sample = my classmates)
¨ Assistant professors: OHSU School of Medicine, UNC School of Medicine, UIUC
Statistics Dept, University of New Mexico School of Medicine
¨ Consultant/Manager, Analysis Group
¨ Assistant Member, RAND Corporation
¤ Nonprofit global policy think tank
¨ Computational Biologist, Genentech
¨ Instructors: UPenn School of Medicine, Harvard School of Public Health
¨ Research Associate, Dana Farber Cancer Institute
¨ Statistician, Partners Health Care
¨ Other possibilities:
¤ Government: National Institutes of Health, Food & Drug, Centers for Disease and
Control,WHO, Health departments in foreign countries
¤ Google, Intel, etc.
¤ Liberal arts colleges or smaller universities focused on teaching
¤ Pharma, Consulting, Labs, Hospitals, Hospital Research Centers, Research Institutes,
Universities
16. Case study 1: RNA-Seq Data
¨ RNA sequencing uses
Next Generation
Sequencing (NGS) to
quantify RNA presence
and quantity in a genetic
sample at a moment in
time
¨ Studies the dynamic
transcriptome of a cell
¨ The problem: Compare
expressions of genes in
heart vs. brain tissues?
Which genes are turned
off in heart and on in
brain?
17. Case study 1: RNA-Seq Data
¨ Step 1: Biologists collect samples, send to lab
for sequencing
¨ Step 2: Genetic material is transformed into
millions of ‘reads’
¤ AACTAGACCTGG
¨ Step 3:The reads are mapped to the genome,
transformed into counts for each gene
¨ Step 4:The distribution of gene counts for
different tissues is compared
18. RNA-seq: Step 3
¨ Step 3:The reads are mapped to the genome,
transformed into counts for each gene
¨ Computational biologists developed fast
searching algorithms to map a short read
(likely containing errors) to a genome with
millions of base pairs, much repetition, some
variability (SNPs)
19. RNA-seq: Step 3
¨ Bowtie (Langmead 2009
Genome Biology)
incorporated the Burrows
Wheeler indexing
algorithm to shorten the
mapping to less than a day
(used to be days if not
months)
http://www.cs.jhu.edu/~langmea/resources/
lecture_notes/bwt_and_fm_index.pdf
¨ TopHat (Trapnell 2009
Bioinformatics) can detect
splicing junctions where
certain genes code for
multiple proteins via
alternatively spliced mRNA
20. RNA-seq: Step 4
¨ Step 4:The distribution of gene counts for
different tissues is compared
¨ Bioinformaticians and biostatisticians clean the
data, normalize the data, and conduct statistical
tests to determine if certain genes are
expressed in one tissue differently than another
¨ Tests based on models: negative binomial
distribution of counts, likelihood ratio tests
¨ Clustering algorithms
¨ Study genetic pathway enrichment, up- or down-
regulated genes
¨ Biologists then study these genes more closely
23. Case study 2:
Electronic Medical Records
¨ Medical and health records are
becoming increasingly digitized
¨ EMR can contain records of health
measurements (blood pressure),
diagnoses (depression), treatments
prescribed (statins), family history
information, and even detailed
descriptions of doctor visits (clinician
notes)
¨ Thousands of patients can have
dozens of records, some can have just
2
¨ Question: How to select subjects with
bipolar disorder from a large pool of
patients?
24. Case study 2:
Electronic Medical Records
¨ Step 1: All the records must be collected, stored, put
in a database, managed, tracked
¨ Step 2: A small subset must be read by a team of
clinicians and scored as “case” versus “control”
¨ Step 3:Transform codes and paragraphs of words
into predictors of disease
¨ Step 4: Determine important predictors of disease
and build a prediction model with these variables
¨ Step 5:Validate the model, assess its performance
¨ Step 6: Implement the model in larger pool of
subjects to select the bipolar cases for a future
genetic study
25. EMR: Step 1
¨ Step 1: All the records must be collected,
stored, put in a database, managed, tracked
¨ Computer scientists and bioinformaticians
must perform these steps (SQL, anyone?
MUMPS? Python, perl…)
¨ Efficiency in this setting is no small task
26. EMR: Step 3
¨ Step 3:Transform codes and paragraphs of
words into predictors of disease
¨ Natural language processing (NLP) is used
by bioinformaticians to mine the paragraphs
of data for terms that occur often in cases and
less often in controls
¨ Certain words in a doctor’s note become
possible predictors of disease
27. EMR: Step 4-6
¨ Step 4-6: Determine important predictors of
disease, build a prediction model with these
variables, assess/validate performance,
implement model
¨ Biostatisticians develop
¤ high dimensional regression methods or
machine learning methods
¤ to select important predictors and build models
¤ to predict outcomes based on a large number of
variables (i.e., LASSO, support vector machine
learning)
29. Back to me.
¨ Began with Yung-Pin’s research project on
CpG islands (related to new field of
epigenetics)
¨ Enjoyed journal clubs/biostatistics meetings
at OHSU
¨ Pure math vs. applied math vs. something
else
¨ Did you want to be a doctor? Do you want to
help people?
¨ Ended up in grad school, what did I learn?
30. Biostatistics grad school
¨ Statistics ≠ pure math!
¨ A masters would have helped with intuition,
but not usually funded
¨ Research universities ≠ Lewis & Clark!
¨ Depend on self-teaching, your classmates,
and especially the T.A.’s to get by (when
interviewing, meet the students!)
¨ Light teaching load, (hopefully) heavy
collaborative/consulting load
¨ Lots of women in public health (like LC)!
¨ Grad school is always hard.
31. Bioinformatics grad school
¨ So far mostly the same
¨ More focused on biology
¨ Incorporating more biology training, wet
labs
¨ Software/Bioconductor/R package
development
¨ Diverging from traditional biostat?
32. Helpful classes
¨ Statistics and probability (obviously)
¨ All the computer science classes, ever (python,
more C!)
¨ Linear algebra
¨ Genetics (molecular biology would have been
nice, though no biology required for biostat)
¨ Advanced calculus/real analysis (for theoretical
classes such as Prob II and Inference II and
writing my thesis, not always required)
¨ Discrete
¨ Abstract Algebra (don’t worry, not required
either)
¨ Liberal arts education in general
33. Helpful skills
¨ Latex
¨ R
¨ Python or Perl
¨ Unix, cluster/cloud computing
¨ Teaching/tutoring
¨ Research experience!
¨ Programming, software development
¨ C, Fortran
¨ Github
¨ You must enjoy talking to people, collaborating,
explaining math/stat/cs to non mathematical
people!
34. Pros & Cons
Pros
¨ Interesting & meaningful research problems
¨ Always in demand, more so every day
¨ Collaborate with clinicians, biologists,
researchers of all kinds
¨ Salary isn’t too shabby
Cons
¨ Soft money L
¨ Grants, grants, always grants (but not
necessarily our own)
35. Last thoughts
¨ Consider Epidemiology
¨ Applied vs.Theoretical research
¨ My day: mostly programming and writing
code (cleaning data + analysis, simulations),
lots of meetings, a bit of pen & pencil
research and thinking of new grants, reading
articles, reading clinical trial protocols,
sample size and power calculations
¨ This will vary on where you work
¨ Masters vs. PhD
36. More talks like this
¨ Excellent overview of bioinformatics & computational biology fields and
careers in medicine by Dr. Shannon McWeeney (
http://www.biodevlab.org/) at OHSU
https://ohsu.adobeconnect.com/_a46054336/p61byw86754/?
launcher=false&fcsContent=true&pbMode=normal
¨ Rafa Irizarry’s (at HSPH http://rafalab.dfci.harvard.edu/) math major talk:
https://www.youtube.com/watch?v=gXeWdvHKTQQ
¨ Plenty of interesting talks at JSM, the big statistical meeting/conference,
it will be nearby in Seattle in August of 2015
http://www.amstat.org/meetings/jsm/2014/index.cfm (in Boston this
year); http://www.amstat.org/meetings/jsm.cfm
37. Learning resources
¨ Summer Institute for Training in Biostatistics (for undergrads)
http://www.nhlbi.nih.gov/funding/training/redbook/sibsweb.htm
¤ U Wisc at Madison, Columbia, Emory, Boston U, NC State, U of Iowa, U of Minnesota, U
of Pittsburgh (All of the websites have “What is Biostatistics?” pages)
¨ MOOC’s (Massive Online Open Courses)
¤ Learn R
http://www.flaviobarros.net/2014/03/14/online-multimedia-resources-learn-r
¤ Learn biostats https://www.coursera.org/course/biostats
¤ Learn statistical learning
https://class.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/
about
¤ Learn bioinformatics http://www.langmead-lab.org/teaching-materials/ and
http://rosalind.info/problems/list-view/
¨ UW’s Summer Institutes (scholarships for students)
¤ Statistical Genetics; Statistics and Modeling in Infectious Diseases; Statistics for
Clinical Research
¨ Comprehensive list of job postings for statistics/biostatistics/bioinformatics:
http://www.stat.ufl.edu/jobs/
38. The internet
¨ Youtube
¤ Rafa Irizarry’s youtube channel (especially
http://youtu.be/gXeWdvHKTQQ)
¨ Simply Statistics blog (http://simplystatistics.org/)
¨ R-bloggers
¨ Getting Genetics Done blog
(http://gettinggeneticsdone.blogspot.com/ )
¨ FiveThirtyEight (http://fivethirtyeight.com/)
¨ Neat summary measure of types of research
done in various departments (biased toward
east coast) https://muschellij2.shinyapps.io/ENAR_Over_Time/