Data science 101

Data Science 101
Robert Hoyt MD FACP
January 12, 2017

Disclaimer
• I have no conflicts of interest to report
• The opinions presented are those of the author
and do not necessarily reflect those of the
University of West Florida

Learning Objectives
Upon completion of the presentation participants
should be able to:
• Summarize the characteristics of data science
• Summarize the skill sets for data scientists
• Compare and contrast predictive analytics using
statistics vs. machine learning
• Enumerate features of IBM Watson Analytics
(IBMWA)
• Enumerate features of WEKA machine learning
• List the challenges facing data science

Definitions
• Data science is “the scientific study of the
creation, validation and transformation of data to
create meaning.” 1 Because data science is
relatively new, definitions are still evolving. Data
science is a good “umbrella” term.
• Analytics is “the discovery and communication
of meaningful patterns in data.” While some
would argue for separating data analytics from
data mining and knowledge discovery from data
(KDD), we will use the terms interchangeably. 2

Venn diagram of Data Science
Data
Science

Critical need for data scientists with:
• Domain expertise (example: healthcare)
• In depth statistical knowledge
• Computer science expertise
• Machine learning expertise
• Programming expertise: R, SQL and Python
languages
• Relational database system (RDBS) knowledge
• Comfort level with “Big Data”

Historical Background
• While all industries (including sports) are
incorporating analytics and data science, the
business world was first.
• Businesses benefitted from knowing which
customers were likely to unsubscribe (churn) and
if you purchased item A, would you purchase item
B (market basket analysis).
• As far back as the 1960s a small group of
statisticians suggested their field should be
broadened to handle more volume and variety of
data.
• In the 1990s computer scientists developed and
promoted machine learning software.

Historical background
• There is evidence that many healthcare workers
lack training in statistics and machine learning. 3
• There is also evidence that statistics is not easy to
teach to non-statisticians and difficult to retain. 4
• Statisticians recommend knowledge of calculus
and linear algebra; not routinely studied by
healthcare workers. They often prefer that
statistical formulas should be calculated
longhand.

Stats: Logistic Regression
B.
Would you prefer approach A or B?
A.

Historical Background
• As a result, many workers are not comfortable
with statistics and data analytics
• This observation flies in the face of a health data
explosion and a shortage of data scientists
• The data explosion is fueled by genomic, EHR-
related, wearable technology and social media
data.
• About 75% of healthcare data is unstructured
(free text), so difficult to analyze
• Enter the Big Data era to further confuse matters

Big Data Definition
• #1 Too much data to analyze on one computer
• #2 The Five V’s
• Volume: massive amounts of data are being generated
• Velocity: data is being generated so rapidly that it needs
to be analyzed without placing it in a database
• Variety: roughly 80% of data in existence is unstructured
so it won’t fit into a database or spreadsheet.
• Veracity: current data can be “messy” with missing data
and other challenges.
• Value: data scientists now have the capability to turn
large volumes of unstructured data into something
meaningful.5

Data Science is part of the federal
vision of a healthcare system
• Learning health system: “an ecosystem where all
stakeholders can securely, effectively and efficiently
contribute, share and analyze data” 6 (the PDCA
cycle)
• Precision medicine: “identifying which approaches
will be effective for which patients based on genetic,
environmental, and lifestyle factors.” Clearly this
initiative requires a big data approach to integrate
these data.7
• Population health requires data analytics
• Value based care requires data

Types of analytics (Gartner)
Predictive analytics describes four
attributes:
1. An emphasis on prediction
2. Rapid analysis measured in hours or
days
3. An emphasis on the business
relevance of the resulting insights (no
ivory tower analyses)
4. An emphasis on ease of use, thus
making the tools accessible to business
users.8

Predictive Analytics
• It could be argued that predictive analytics is the
most important aspect of data science, where an
outcome of importance is predicted based on
multiple factors influencing the outcome. This is
the area I will focus on
• Use cases will be discussed in the next slide
• I will not cover:
• Text mining with natural language processing (NLP) is
very important for mining unstructured data
• Data visualization software, such as Tableau and
QlikSense, used for descriptive analytics
• Deep Learning based on artificial intelligence (AI)

Predictive Analytics Use Cases
• Predict poor patient outcomes (morbidity)
• Sepsis prediction 9
• Impending renal failure 10
• Predict death (mortality)
• Predict readmissions: in fiscal 2016 only 24%
(799/3400) of reporting hospitals will not receive a
penalty (0.1%-3% range) for too many
readmissions. 11-12
• Predict high cost patients for population health
care management: 5% of Medicare/Medicaid
patients use 50% of resources. 13-14

Predictive Modeling Approaches
1. Modeling with statistics
2. Modeling with Machine Learning
3. Modeling using the R or Python programming
languages (not covered)

Predictive analytics
Design the
model
Statistical
Modeling
Statistics
Machine
learning
Association Regression Classification Clustering
Programming
Languages

Data Science/Analytics Process

Predictive Analytics
• The most common approach is to use
classification where you predict an outcome
(dependent variable) that is categorical data (e.g.
lived, died) with multiple predictors (independent
variables). For example, you have a data set of
pregnant women with Zika virus. Some have
children with micro-encephalopathy and others
don’t. You run a classification model to see if
factors such as age, trimester of infection, fever,
symptoms, etc. predict micro-encephalopathy
• If the outcome is numerical then you would use
linear regression

Need for better data analytical tools
• We would benefit from more user friendly tools and
some degree of automation
• MS Excel with the Analysis ToolPak add-in is a
possibility but implies you know which stats tests to use
• There are also multiple statistical packages, such as
SPSS and SAS, also associated with a steep learning
curve

Need for better data analytical tools
• Tool #1: IBM Watson Analytics: automatics
predictive, descriptive and visualization analytics
• Tool #2 WEKA: open source machine learning
platform

IBM Watson Analytics
• New program offered in 2015 that is not related to Watson
Health (cognitive computing). Business oriented
• Program is based on SPSS-based statistical tests. Covers
regression, classification, decision trees, chi-square, t-
tests, etc.
• Program can automatically convert nominal data to
numerical and vice versa
• Versions
• Free
• Professional (Academic)

IBM Watson Analytics Academic
program
• Free for universities to use for teaching (non-commercial)
purposes. Includes 100 students/professor/year
• University of West Florida has used the program for about
12 months in a Health Informatics graduate course and a
Data Mining (computer science) course
• IBM did an on-site visit for training
• Multiple videos on YouTube
• PDF user guide available

IBM Watson analytics features
• IBMWA is completely online
• Accepts Excel and CSV input, as well as feeds from
most relational database systems (RDBSs)
• 100 GB storage
• Limits: 500 columns and 10 million rows
• Twitter Feed analysis

Our 2016 Review Article
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5080525/

IBMWA versions
• About the time we submitted our detailed analysis of Watson
Analytics, IBM had created Watson-2 that combined
“Explore” and “Predict” into “Discover.” Watson-1 will be
retired shortly.
• Watson-2 includes statistical details about prediction when
the target/outcome is numerical. They are working on
adding the statistical details for categorical
targets/outcomes.
• Watson-2 includes a “data quality” score but doesn’t point
out missing or skewed data and outliers.

Breast-Cancer-spreadsheet
• 286 patients
• One outcome and 9 Attributes or predictors

Predictive Analytics: What Drives Breast
Cancer Recurrence + predictive model

Predict breast cancer recurrence

Degree of malignancy and prediction
of recurrence and no recurrence

Confusion matrix is created but not explained
(degree of malignancy and no recurrence)
Predicted
No recurrence Recurrence
Actual No recurrence TN = 161 FP = 40 201
Recurrence FN = 40 TP = 45 85
201 85 286
Accuracy = TP + TN/Total = 72%
Sensitivity (recall) TP/FN + TP = 53%
Specificity = TN/TN + FP = 80%
Precision = TP/TP + FP = 53%

Create your Display
• Display can be shared by email, hyperlink, Tweet or downloaded
• Display is interactive

IBMWA limitations
• Business oriented, so not aligned perfectly with healthcare
data analytics. Predictive strength is good, but we are used
to sensitivity, specificity, PPV, NPV, ROC curves, etc.
• No choice of statistical tests
• IBMWA does not perform unsupervised learning
• This approach (results first, stats second) may not appeal to
purists
• Sample dataset I used was of excellent quality, therefore not
typical of many datasets

questions
• Process to apply for the academic program is easy. Apply at:
https://www.ibm.com/blogs/watson-analytics/calling-all-
academics-have-we-got-a-watson-analytics-for-you/
• IBM Contact information: Randy Messina at
randymessina@us.ibm.com
• My contact information: rhoyt@uwf.edu
IBMWA Application Process

Machine Learning (ML)
• Machine learning was developed by computer
scientists and is largely based on mathematics,
like statistics
• While some ML algorithms are difficult to
understand (e.g. neural networks), others are
easier, such as decision trees and regression
• Modeling is like baking: you decide what you
want to bake and the select the best recipe
(algorithm) to accomplish it. Optimally, you select
multiple recipes and compare the results!
Example: you want a model to decide what is
spam email. You test many algorithms for best
results and determine the best combination of
predictors

Algorithm Types
• Supervised learning
• Classification for categorical data (spam v no spam)
• Regression for numerical data ($, mortality rate)
• Unsupervised learning
• Association rules: an example would be market basket
analysis
• Clustering: when you don’t know the data categories
and you are looking for patterns in large data sets.
Used extensively with genetic data sets
• What ML has in common with statistical approach
• Both will perform linear regression, logistic regression
and decision trees

Open Source Free ML Programs
• WEKA 15
• Pentaho Community 16
• RapidMiner Community 17
• KNIME 18
• Orange 19

Machine Learning with Orange
University of Ljubljana,Slovenia

WEKA
• Named after a bird in New Zealand and stands
for Waikato Environment for Knowledge
Assessment
• Free software is associated with a free ML
course and a low cost textbook
• Software works on all operating systems
• WEKA is the only ML program mentioned that
does not require moving around widgets or
operators

Predicting type 2 diabetes (WEKA)

Results using logistic regression

Outcome Measurements
Accuracy is hitting a the bulls-eye every time.
Precision is hitting the same place each time, even if it is not the place you aimed for.

Receiver operator characteristic
(ROC) curve (c-statistic = AUC)
TP
FP

Decision Tree for Contact Lenses

Clustering algorithm
(3 groups identified)

Predictive Analytics Report Card
• Many risk prediction models yield mediocre
results at this point (C-statistic .56 -.80), but we
are early in the game.
• Models need to work in real-time ideally
• Some risk models are used in healthcare
organizations that might not fit your patient
demographic, such as safety-net hospitals, etc.
• It is helpful to identify patients at risk for
morbidity and mortality but you still have to have
an intervention team, ready to apply additional
resources to high risk patients

Data Science Education Stats
• Certificate (82); Bachelor (24); Masters (259);
Doctorate (14)
• 37% of courses are offered online
• 101 programs were related to business schools,
40 related to mathematics and statistics
departments, 39 related to computer science
departments and 9 related to new data science
departments. The remainder were from a wide
variety of college and university departments.

Data Science Centers
• Multiple universities and medical centers have
created “data centers” to create the right
environment for data analysis and research
• They tend to be multi-disciplinary and not just
relegated to the computer science department
• Every industry seems to have interest in data
science and analytics, hence the need to create
a central hub

Data Science Resources
• My web site www.informaticseducation.org has a
resource center with Chapter 23: Data Science
resources:
• Data sets: health and non-health related
• Free data science courses
• Free statistics resources
• Free visualization software
• Free programming tutorials
• Other helpful stuff
• Chapter 23 is available thru Lulu.com for $2.99

ONC Sponsored Free Courses
• Healthcare Data Analytics
• Bellevue College (limited to Veterans Administration
Staff Only)
• Columbia University
• Normandale College
• Oregon Health & Science University
• University of Alabama at Birmingham
• University of Texas Health Science Center at Houston

Machine Learning Resources
• I would recommend beginning with Jason
Brownlee’s eBooks:
• Machine Learning Algorithms $27 (163 pages)
• Machine Learning Mastery with WEKA $27 (248 pages)
• www.machinelearningmastery.com

Data Science Challenges
• Not enough data scientists; it is estimated that we
will need 140,000 by 2018 20
• Not enough data science training programs
• Expensive to build big data and data science
centers
• Privacy and security concerns
• Hype. Adverse unintended consequences (AUCs)
• Medical data is heterogeneous and complex,
compared to other industries 21
• Correlation does not equal causation
• 80% of the time spent with data analysis is spent
preparing the data for analysis 22

Data Science Challenges
• Difficult to find patient-level data
• It has been stated that clinical medicine accounts
for only 20% of population health; 80% is due to
psycho-social-environmental-behavioral-
economic factors that are beyond the control of
the healthcare system. Therefore, interventions
based on good data can result in no impact 23
• Just because you have technology and
voluminous data doesn’t mean it changes patient
outcomes. Example: fitness devices affecting
behavior 24

Make data science part of patient care
Not everyone will be able to afford a robust
analytics platform overlaid on a clinical data
warehouse and the ability to handle Big Data.
But we can start the educational process to learn
more about data science

Anticipate the onslaught of
data analytical vendors

Conclusions
• Data science is a new information science that
serves as an umbrella for data creation,
manipulation, analysis and research
• Data scientists are in high demand and it will
take years before we can educate enough
scientists to meet the demand
• Data science is a team sport; it will require teams
with individual skill sets to accomplish robust
data science
• New tools such as Watson and WEKA likely
represent the beginning of analysis automation

Conclusions
• I encourage everyone to increase their
knowledge in data science areas, such as
predictive analytics
• There are a myriad of free and affordable
courses now available online (mentioned in my
blog and on the resource page)
• I encourage academic centers and HIT vendors
to expand their data science offerings at multiple
levels

Questions?
Slides available as Data Science 101
on www.slideshare.net

Data science 101

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Data science 101

Ähnlich wie Data science 101 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data science 101

Hinweis der Redaktion