Math, Stats and CS in Public Health and Medical Research

Jessica Minnier, OHSU,
Lewis & Clark College Mathematics Colloquium, 3.19.14
Math, Stats and CS
in Public Health and Medical Research

“Biostatistics (a portmanteau of biology and
statistics; sometimes referred to as biometry or
biometrics) is the application of statistics to a
wide range of topics in biology.” – Wikipedia
or,“What is Biostatistics?”
“Bioinformatics is an interdisciplinary scientific field
that develops methods for storing, retrieving,
organizing and analyzing biological data” – Wikipedia
“Computational biology involves the development
and application of data-analytical and theoretical
methods, mathematical modeling and computational
simulation techniques to the study of biological,
behavioral, and social systems.” – Wikipedia

Sample (n = 1)
¨  L&C mathematics major (2007), CS minor
¨  PhD in Biostatistics (2007-2012)
¤  “Inference and Prediction for High Dimensional Data via
Penalized Regression and Kernel Machine Methods”
¨  Postdoc (2012-2013)
¤  Cancer risk prediction with gene-environment
interactions
¨  Assistant Professor (2013-now)
v  Division of Biostatistics
v  Department of Public Health & Preventive Medicine
v  School of Medicine (soon to be School of Public Health)
v  Oregon Health & Science University

Outline
¨  Biostatistics and Bioinformatics/
Computational Biology
¤  More interesting definitions, research examples,
case studies
¤  Types of careers
¨  My trajectory
¤  LC math to grad school to jobs
¨  Resources and advice

Biostatistics, in the news.
Comics from Jim Borgman; XKCD; also fun:
http://stats.stackexchange.com/questions/423/what-is-your-favorite-data-analysis-cartoon
In summary:
A poor understanding of statistics makes everyone look bad.

Biostatistics, in the news
Forbes

Applied math?
¨  Applied mathematics often studies deterministic
models (engineering and mechanics, population
models, cryptography)
¨  Some questions can’t be solved by deterministic
models, but a partial answer can be given with
statistics
¤  Does smoking cause lung cancer? (inference from
observational studies)
¤  Is it going to rain tomorrow? (stochastic model)
¤  Do statins lower cholesterol? (randomized trial)
Rafa Irizarry’s math major talk: https://www.youtube.com/watch?v=gXeWdvHKTQQ

Example data
¨  Collection of measurements from a sampled
population
¨  Measurements of a lab experiment
¨  Medical images of subjects’ brains over time
¨  Results of a clinical trial
¨  Gene expression from different types of cultured
tissue
¨  Simulated data modeling HIV progression
¨  Values from electronic medical records sampled
retrospectively
¨  3 million genetic mutations from 20,000 subjects
Brian Caffo’s MOOC: Biostatistics Bootcamp I, lecture 1

Inform medical decisions
¨  A large clinical trial in 2002 by the Women’s
Health Initiative was stopped early due to
preliminary data showing that hormone
replacement therapy had a negative health
impact.
¨  This data contradicted prior evidence on the
efficacy of HRT for post menopausal women.
¨  Statistical decision to end the trial, prevent
further harm
Brian Caffo’s MOOC: Biostatistics Bootcamp I, lecture 1; JAMA 2002;288(3):321-333

Inform medical decisions
¨  Guidelines for mammogram screening
based on probabilities of false positives and
negatives, cost-benefit analyses, survival
analysis
¨  Analysis of adverse effects in a clinical trial
determines drug safety, dosage,
subpopulations
¨  Even general public must make decisions
about risk when making their own medical
decisions
¨  Experts cannot make decisions without data

Bioinformatics & Computational
biology
¨  Sequencing the human genome (aligning,
matching, searching)
¨  Algorithms for turning massive information from
electronic medical records into useful predictors
of disease progression
¨  Machine learning algorithms for risk prediction
models with large and complex data (imaging,
genetic)
¨  Analysis of networks (protein interactions,
genetic pathways, social behavior influencing
health outcomes)
¨  Simulation of complex data (methylation
patterns in the genome)

Biomathematics
¨  Mathematical models to study infectious
disease progression (in a population or in a
body’s cells)
¨  Steady-state simulations of cancer cell
growth
¨  Usually in joint biostatistics/biomathematics
or applied mathematics departments, some
epidemiology

Where do we work?
(non-random sample = my classmates)
¨  Assistant professors: OHSU School of Medicine, UNC School of Medicine, UIUC
Statistics Dept, University of New Mexico School of Medicine
¨  Consultant/Manager, Analysis Group
¨  Assistant Member, RAND Corporation
¤  Nonprofit global policy think tank
¨  Computational Biologist, Genentech
¨  Instructors: UPenn School of Medicine, Harvard School of Public Health
¨  Research Associate, Dana Farber Cancer Institute
¨  Statistician, Partners Health Care
¨  Other possibilities:
¤  Government: National Institutes of Health, Food & Drug, Centers for Disease and
Control,WHO, Health departments in foreign countries
¤  Google, Intel, etc.
¤  Liberal arts colleges or smaller universities focused on teaching
¤  Pharma, Consulting, Labs, Hospitals, Hospital Research Centers, Research Institutes,
Universities

Real data, please?
¨  Two examples…

Case study 1: RNA-Seq Data
¨  RNA sequencing uses
Next Generation
Sequencing (NGS) to
quantify RNA presence
and quantity in a genetic
sample at a moment in
time
¨  Studies the dynamic
transcriptome of a cell
¨  The problem: Compare
expressions of genes in
heart vs. brain tissues?
Which genes are turned
off in heart and on in
brain?

Case study 1: RNA-Seq Data
¨  Step 1: Biologists collect samples, send to lab
for sequencing
¨  Step 2: Genetic material is transformed into
millions of ‘reads’
¤  AACTAGACCTGG
¨  Step 3:The reads are mapped to the genome,
transformed into counts for each gene
¨  Step 4:The distribution of gene counts for
different tissues is compared

RNA-seq: Step 3
¨  Step 3:The reads are mapped to the genome,
transformed into counts for each gene
¨  Computational biologists developed fast
searching algorithms to map a short read
(likely containing errors) to a genome with
millions of base pairs, much repetition, some
variability (SNPs)

RNA-seq: Step 3
¨  Bowtie (Langmead 2009
Genome Biology)
incorporated the Burrows
Wheeler indexing
algorithm to shorten the
mapping to less than a day
(used to be days if not
months)
http://www.cs.jhu.edu/~langmea/resources/
lecture_notes/bwt_and_fm_index.pdf
¨  TopHat (Trapnell 2009
Bioinformatics) can detect
splicing junctions where
certain genes code for
multiple proteins via
alternatively spliced mRNA

RNA-seq: Step 4
¨  Step 4:The distribution of gene counts for
different tissues is compared
¨  Bioinformaticians and biostatisticians clean the
data, normalize the data, and conduct statistical
tests to determine if certain genes are
expressed in one tissue differently than another
¨  Tests based on models: negative binomial
distribution of counts, likelihood ratio tests
¨  Clustering algorithms
¨  Study genetic pathway enrichment, up- or down-
regulated genes
¨  Biologists then study these genes more closely

Heatmap and
dendogram from
cluster algorithm
comparing genes
in cultured mouse
heart and brain
tissues

Case study 2:
Electronic Medical Records
¨  Medical and health records are
becoming increasingly digitized
¨  EMR can contain records of health
measurements (blood pressure),
diagnoses (depression), treatments
prescribed (statins), family history
information, and even detailed
descriptions of doctor visits (clinician
notes)
¨  Thousands of patients can have
dozens of records, some can have just
2
¨  Question: How to select subjects with
bipolar disorder from a large pool of
patients?

Case study 2:
Electronic Medical Records
¨  Step 1: All the records must be collected, stored, put
in a database, managed, tracked
¨  Step 2: A small subset must be read by a team of
clinicians and scored as “case” versus “control”
¨  Step 3:Transform codes and paragraphs of words
into predictors of disease
¨  Step 4: Determine important predictors of disease
and build a prediction model with these variables
¨  Step 5:Validate the model, assess its performance
¨  Step 6: Implement the model in larger pool of
subjects to select the bipolar cases for a future
genetic study

EMR: Step 1
¨  Step 1: All the records must be collected,
stored, put in a database, managed, tracked
¨  Computer scientists and bioinformaticians
must perform these steps (SQL, anyone?
MUMPS? Python, perl…)
¨  Efficiency in this setting is no small task

EMR: Step 3
¨  Step 3:Transform codes and paragraphs of
words into predictors of disease
¨  Natural language processing (NLP) is used
by bioinformaticians to mine the paragraphs
of data for terms that occur often in cases and
less often in controls
¨  Certain words in a doctor’s note become
possible predictors of disease

EMR: Step 4-6
¨  Step 4-6: Determine important predictors of
disease, build a prediction model with these
variables, assess/validate performance,
implement model
¨  Biostatisticians develop
¤  high dimensional regression methods or
machine learning methods
¤  to select important predictors and build models
¤  to predict outcomes based on a large number of
variables (i.e., LASSO, support vector machine
learning)

Regularized logistic regression with NLP predictors
Solution path for coefficients of predictors
based on adaptive LASSO

Back to me.
¨  Began with Yung-Pin’s research project on
CpG islands (related to new field of
epigenetics)
¨  Enjoyed journal clubs/biostatistics meetings
at OHSU
¨  Pure math vs. applied math vs. something
else
¨  Did you want to be a doctor? Do you want to
help people?
¨  Ended up in grad school, what did I learn?

Biostatistics grad school
¨  Statistics ≠ pure math!
¨  A masters would have helped with intuition,
but not usually funded
¨  Research universities ≠ Lewis & Clark!
¨  Depend on self-teaching, your classmates,
and especially the T.A.’s to get by (when
interviewing, meet the students!)
¨  Light teaching load, (hopefully) heavy
collaborative/consulting load
¨  Lots of women in public health (like LC)!
¨  Grad school is always hard.

Bioinformatics grad school
¨  So far mostly the same
¨  More focused on biology
¨  Incorporating more biology training, wet
labs
¨  Software/Bioconductor/R package
development
¨  Diverging from traditional biostat?

Helpful classes
¨  Statistics and probability (obviously)
¨  All the computer science classes, ever (python,
more C!)
¨  Linear algebra
¨  Genetics (molecular biology would have been
nice, though no biology required for biostat)
¨  Advanced calculus/real analysis (for theoretical
classes such as Prob II and Inference II and
writing my thesis, not always required)
¨  Discrete
¨  Abstract Algebra (don’t worry, not required
either)
¨  Liberal arts education in general

Helpful skills
¨  Latex
¨  R
¨  Python or Perl
¨  Unix, cluster/cloud computing
¨  Teaching/tutoring
¨  Research experience!
¨  Programming, software development
¨  C, Fortran
¨  Github
¨  You must enjoy talking to people, collaborating,
explaining math/stat/cs to non mathematical
people!

Pros & Cons
Pros
¨  Interesting & meaningful research problems
¨  Always in demand, more so every day
¨  Collaborate with clinicians, biologists,
researchers of all kinds
¨  Salary isn’t too shabby
Cons
¨  Soft money L
¨  Grants, grants, always grants (but not
necessarily our own)

Last thoughts
¨  Consider Epidemiology
¨  Applied vs.Theoretical research
¨  My day: mostly programming and writing
code (cleaning data + analysis, simulations),
lots of meetings, a bit of pen & pencil
research and thinking of new grants, reading
articles, reading clinical trial protocols,
sample size and power calculations
¨  This will vary on where you work
¨  Masters vs. PhD

More talks like this
¨  Excellent overview of bioinformatics & computational biology fields and
careers in medicine by Dr. Shannon McWeeney (
http://www.biodevlab.org/) at OHSU
https://ohsu.adobeconnect.com/_a46054336/p61byw86754/?
launcher=false&fcsContent=true&pbMode=normal
¨  Rafa Irizarry’s (at HSPH http://rafalab.dfci.harvard.edu/) math major talk:
https://www.youtube.com/watch?v=gXeWdvHKTQQ
¨  Plenty of interesting talks at JSM, the big statistical meeting/conference,
it will be nearby in Seattle in August of 2015
http://www.amstat.org/meetings/jsm/2014/index.cfm (in Boston this
year); http://www.amstat.org/meetings/jsm.cfm

Learning resources
¨  Summer Institute for Training in Biostatistics (for undergrads)
http://www.nhlbi.nih.gov/funding/training/redbook/sibsweb.htm
¤  U Wisc at Madison, Columbia, Emory, Boston U, NC State, U of Iowa, U of Minnesota, U
of Pittsburgh (All of the websites have “What is Biostatistics?” pages)
¨  MOOC’s (Massive Online Open Courses)
¤  Learn R
http://www.flaviobarros.net/2014/03/14/online-multimedia-resources-learn-r
¤  Learn biostats https://www.coursera.org/course/biostats
¤  Learn statistical learning
https://class.stanford.edu/courses/HumanitiesScience/StatLearning/Winter2014/
about
¤  Learn bioinformatics http://www.langmead-lab.org/teaching-materials/ and
http://rosalind.info/problems/list-view/
¨  UW’s Summer Institutes (scholarships for students)
¤  Statistical Genetics; Statistics and Modeling in Infectious Diseases; Statistics for
Clinical Research
¨  Comprehensive list of job postings for statistics/biostatistics/bioinformatics:
http://www.stat.ufl.edu/jobs/

The internet
¨  Youtube
¤  Rafa Irizarry’s youtube channel (especially
http://youtu.be/gXeWdvHKTQQ)
¨  Simply Statistics blog (http://simplystatistics.org/)
¨  R-bloggers
¨  Getting Genetics Done blog
(http://gettinggeneticsdone.blogspot.com/ )
¨  FiveThirtyEight (http://fivethirtyeight.com/)
¨  Neat summary measure of types of research
done in various departments (biased toward
east coast) https://muschellij2.shinyapps.io/ENAR_Over_Time/

Questions?
¨  minnier@ohsu.edu

Math, Stats and CS in Public Health and Medical Research

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (10)

Similar to Math, Stats and CS in Public Health and Medical Research

Similar to Math, Stats and CS in Public Health and Medical Research (20)

Recently uploaded

Recently uploaded (20)

Math, Stats and CS in Public Health and Medical Research