2. Disclaimer
• I have no conflicts of interest to report
• The opinions presented are those of the author
and do not necessarily reflect those of the
University of West Florida
3. Learning Objectives
Upon completion of the presentation participants
should be able to:
• Summarize the characteristics of data science
• Summarize the skill sets for data scientists
• Compare and contrast predictive analytics using
statistics vs. machine learning
• Enumerate features of IBM Watson Analytics
(IBMWA)
• Enumerate features of WEKA machine learning
• List the challenges facing data science
6. Definitions
• Data science is “the scientific study of the
creation, validation and transformation of data to
create meaning.” 1 Because data science is
relatively new, definitions are still evolving. Data
science is a good “umbrella” term.
• Analytics is “the discovery and communication
of meaningful patterns in data.” While some
would argue for separating data analytics from
data mining and knowledge discovery from data
(KDD), we will use the terms interchangeably. 2
8. Critical need for data scientists with:
• Domain expertise (example: healthcare)
• In depth statistical knowledge
• Computer science expertise
• Machine learning expertise
• Programming expertise: R, SQL and Python
languages
• Relational database system (RDBS) knowledge
• Comfort level with “Big Data”
9. Historical Background
• While all industries (including sports) are
incorporating analytics and data science, the
business world was first.
• Businesses benefitted from knowing which
customers were likely to unsubscribe (churn) and
if you purchased item A, would you purchase item
B (market basket analysis).
• As far back as the 1960s a small group of
statisticians suggested their field should be
broadened to handle more volume and variety of
data.
• In the 1990s computer scientists developed and
promoted machine learning software.
10. Historical background
• There is evidence that many healthcare workers
lack training in statistics and machine learning. 3
• There is also evidence that statistics is not easy to
teach to non-statisticians and difficult to retain. 4
• Statisticians recommend knowledge of calculus
and linear algebra; not routinely studied by
healthcare workers. They often prefer that
statistical formulas should be calculated
longhand.
12. Historical Background
• As a result, many workers are not comfortable
with statistics and data analytics
• This observation flies in the face of a health data
explosion and a shortage of data scientists
• The data explosion is fueled by genomic, EHR-
related, wearable technology and social media
data.
• About 75% of healthcare data is unstructured
(free text), so difficult to analyze
• Enter the Big Data era to further confuse matters
13. Big Data Definition
• #1 Too much data to analyze on one computer
• #2 The Five V’s
• Volume: massive amounts of data are being generated
• Velocity: data is being generated so rapidly that it needs
to be analyzed without placing it in a database
• Variety: roughly 80% of data in existence is unstructured
so it won’t fit into a database or spreadsheet.
• Veracity: current data can be “messy” with missing data
and other challenges.
• Value: data scientists now have the capability to turn
large volumes of unstructured data into something
meaningful.5
14. Data Science is part of the federal
vision of a healthcare system
• Learning health system: “an ecosystem where all
stakeholders can securely, effectively and efficiently
contribute, share and analyze data” 6 (the PDCA
cycle)
• Precision medicine: “identifying which approaches
will be effective for which patients based on genetic,
environmental, and lifestyle factors.” Clearly this
initiative requires a big data approach to integrate
these data.7
• Population health requires data analytics
• Value based care requires data
15. Types of analytics (Gartner)
Predictive analytics describes four
attributes:
1. An emphasis on prediction
2. Rapid analysis measured in hours or
days
3. An emphasis on the business
relevance of the resulting insights (no
ivory tower analyses)
4. An emphasis on ease of use, thus
making the tools accessible to business
users.8
16. Predictive Analytics
• It could be argued that predictive analytics is the
most important aspect of data science, where an
outcome of importance is predicted based on
multiple factors influencing the outcome. This is
the area I will focus on
• Use cases will be discussed in the next slide
• I will not cover:
• Text mining with natural language processing (NLP) is
very important for mining unstructured data
• Data visualization software, such as Tableau and
QlikSense, used for descriptive analytics
• Deep Learning based on artificial intelligence (AI)
17. Predictive Analytics Use Cases
• Predict poor patient outcomes (morbidity)
• Sepsis prediction 9
• Impending renal failure 10
• Predict death (mortality)
• Predict readmissions: in fiscal 2016 only 24%
(799/3400) of reporting hospitals will not receive a
penalty (0.1%-3% range) for too many
readmissions. 11-12
• Predict high cost patients for population health
care management: 5% of Medicare/Medicaid
patients use 50% of resources. 13-14
18. Predictive Modeling Approaches
1. Modeling with statistics
2. Modeling with Machine Learning
3. Modeling using the R or Python programming
languages (not covered)
21. Predictive Analytics
• The most common approach is to use
classification where you predict an outcome
(dependent variable) that is categorical data (e.g.
lived, died) with multiple predictors (independent
variables). For example, you have a data set of
pregnant women with Zika virus. Some have
children with micro-encephalopathy and others
don’t. You run a classification model to see if
factors such as age, trimester of infection, fever,
symptoms, etc. predict micro-encephalopathy
• If the outcome is numerical then you would use
linear regression
22. Need for better data analytical tools
• We would benefit from more user friendly tools and
some degree of automation
• MS Excel with the Analysis ToolPak add-in is a
possibility but implies you know which stats tests to use
• There are also multiple statistical packages, such as
SPSS and SAS, also associated with a steep learning
curve
23. Need for better data analytical tools
• Tool #1: IBM Watson Analytics: automatics
predictive, descriptive and visualization analytics
• Tool #2 WEKA: open source machine learning
platform
24. IBM Watson Analytics
• New program offered in 2015 that is not related to Watson
Health (cognitive computing). Business oriented
• Program is based on SPSS-based statistical tests. Covers
regression, classification, decision trees, chi-square, t-
tests, etc.
• Program can automatically convert nominal data to
numerical and vice versa
• Versions
• Free
• Professional (Academic)
25. IBM Watson Analytics Academic
program
• Free for universities to use for teaching (non-commercial)
purposes. Includes 100 students/professor/year
• University of West Florida has used the program for about
12 months in a Health Informatics graduate course and a
Data Mining (computer science) course
• IBM did an on-site visit for training
• Multiple videos on YouTube
• PDF user guide available
26. IBM Watson analytics features
• IBMWA is completely online
• Accepts Excel and CSV input, as well as feeds from
most relational database systems (RDBSs)
• 100 GB storage
• Limits: 500 columns and 10 million rows
• Twitter Feed analysis
28. IBMWA versions
• About the time we submitted our detailed analysis of Watson
Analytics, IBM had created Watson-2 that combined
“Explore” and “Predict” into “Discover.” Watson-1 will be
retired shortly.
• Watson-2 includes statistical details about prediction when
the target/outcome is numerical. They are working on
adding the statistical details for categorical
targets/outcomes.
• Watson-2 includes a “data quality” score but doesn’t point
out missing or skewed data and outliers.
38. Confusion matrix is created but not explained
(degree of malignancy and no recurrence)
Predicted
No recurrence Recurrence
Actual No recurrence TN = 161 FP = 40 201
Recurrence FN = 40 TP = 45 85
201 85 286
Accuracy = TP + TN/Total = 72%
Sensitivity (recall) TP/FN + TP = 53%
Specificity = TN/TN + FP = 80%
Precision = TP/TP + FP = 53%
39. Create your Display
• Display can be shared by email, hyperlink, Tweet or downloaded
• Display is interactive
40. IBMWA limitations
• Business oriented, so not aligned perfectly with healthcare
data analytics. Predictive strength is good, but we are used
to sensitivity, specificity, PPV, NPV, ROC curves, etc.
• No choice of statistical tests
• IBMWA does not perform unsupervised learning
• This approach (results first, stats second) may not appeal to
purists
• Sample dataset I used was of excellent quality, therefore not
typical of many datasets
41. questions
• Process to apply for the academic program is easy. Apply at:
https://www.ibm.com/blogs/watson-analytics/calling-all-
academics-have-we-got-a-watson-analytics-for-you/
• IBM Contact information: Randy Messina at
randymessina@us.ibm.com
• My contact information: rhoyt@uwf.edu
IBMWA Application Process
42. Machine Learning (ML)
• Machine learning was developed by computer
scientists and is largely based on mathematics,
like statistics
• While some ML algorithms are difficult to
understand (e.g. neural networks), others are
easier, such as decision trees and regression
• Modeling is like baking: you decide what you
want to bake and the select the best recipe
(algorithm) to accomplish it. Optimally, you select
multiple recipes and compare the results!
Example: you want a model to decide what is
spam email. You test many algorithms for best
results and determine the best combination of
predictors
43. Algorithm Types
• Supervised learning
• Classification for categorical data (spam v no spam)
• Regression for numerical data ($, mortality rate)
• Unsupervised learning
• Association rules: an example would be market basket
analysis
• Clustering: when you don’t know the data categories
and you are looking for patterns in large data sets.
Used extensively with genetic data sets
• What ML has in common with statistical approach
• Both will perform linear regression, logistic regression
and decision trees
44. Open Source Free ML Programs
• WEKA 15
• Pentaho Community 16
• RapidMiner Community 17
• KNIME 18
• Orange 19
46. WEKA
• Named after a bird in New Zealand and stands
for Waikato Environment for Knowledge
Assessment
• Free software is associated with a free ML
course and a low cost textbook
• Software works on all operating systems
• WEKA is the only ML program mentioned that
does not require moving around widgets or
operators
49. Outcome Measurements
Accuracy is hitting a the bulls-eye every time.
Precision is hitting the same place each time, even if it is not the place you aimed for.
53. Predictive Analytics Report Card
• Many risk prediction models yield mediocre
results at this point (C-statistic .56 -.80), but we
are early in the game.
• Models need to work in real-time ideally
• Some risk models are used in healthcare
organizations that might not fit your patient
demographic, such as safety-net hospitals, etc.
• It is helpful to identify patients at risk for
morbidity and mortality but you still have to have
an intervention team, ready to apply additional
resources to high risk patients
54. Data Science Education Stats
• Certificate (82); Bachelor (24); Masters (259);
Doctorate (14)
• 37% of courses are offered online
• 101 programs were related to business schools,
40 related to mathematics and statistics
departments, 39 related to computer science
departments and 9 related to new data science
departments. The remainder were from a wide
variety of college and university departments.
55. Data Science Centers
• Multiple universities and medical centers have
created “data centers” to create the right
environment for data analysis and research
• They tend to be multi-disciplinary and not just
relegated to the computer science department
• Every industry seems to have interest in data
science and analytics, hence the need to create
a central hub
56. Data Science Resources
• My web site www.informaticseducation.org has a
resource center with Chapter 23: Data Science
resources:
• Data sets: health and non-health related
• Free data science courses
• Free statistics resources
• Free visualization software
• Free programming tutorials
• Other helpful stuff
• Chapter 23 is available thru Lulu.com for $2.99
57. ONC Sponsored Free Courses
• Healthcare Data Analytics
• Bellevue College (limited to Veterans Administration
Staff Only)
• Columbia University
• Normandale College
• Oregon Health & Science University
• University of Alabama at Birmingham
• University of Texas Health Science Center at Houston
58. Machine Learning Resources
• I would recommend beginning with Jason
Brownlee’s eBooks:
• Machine Learning Algorithms $27 (163 pages)
• Machine Learning Mastery with WEKA $27 (248 pages)
• www.machinelearningmastery.com
60. Data Science Challenges
• Not enough data scientists; it is estimated that we
will need 140,000 by 2018 20
• Not enough data science training programs
• Expensive to build big data and data science
centers
• Privacy and security concerns
• Hype. Adverse unintended consequences (AUCs)
• Medical data is heterogeneous and complex,
compared to other industries 21
• Correlation does not equal causation
• 80% of the time spent with data analysis is spent
preparing the data for analysis 22
61. Data Science Challenges
• Difficult to find patient-level data
• It has been stated that clinical medicine accounts
for only 20% of population health; 80% is due to
psycho-social-environmental-behavioral-
economic factors that are beyond the control of
the healthcare system. Therefore, interventions
based on good data can result in no impact 23
• Just because you have technology and
voluminous data doesn’t mean it changes patient
outcomes. Example: fitness devices affecting
behavior 24
62. Make data science part of patient care
Not everyone will be able to afford a robust
analytics platform overlaid on a clinical data
warehouse and the ability to handle Big Data.
But we can start the educational process to learn
more about data science
64. Conclusions
• Data science is a new information science that
serves as an umbrella for data creation,
manipulation, analysis and research
• Data scientists are in high demand and it will
take years before we can educate enough
scientists to meet the demand
• Data science is a team sport; it will require teams
with individual skill sets to accomplish robust
data science
• New tools such as Watson and WEKA likely
represent the beginning of analysis automation
65. Conclusions
• I encourage everyone to increase their
knowledge in data science areas, such as
predictive analytics
• There are a myriad of free and affordable
courses now available online (mentioned in my
blog and on the resource page)
• I encourage academic centers and HIT vendors
to expand their data science offerings at multiple
levels
1 Data Science Association. www.datascienceassn.org Accessed September 12, 2106
2 Analytics. Wikipedia. www.wikipedia.org Accessed January 16, 2016
3 Wegwarth O, Schwartz LM, Woloshin S et al. Do physicians understand cancer screening statistics? A national survey of primary care physicians in the United States. Ann Intern Med 2012;156(5):340-9
4 Manrai AK, Bhatia G, Strymish J et al. Medicine’s uncomfortable relationship with math. Research Letter. June 2014. JAMA Intern Med 2014;174(6):991-993
IBM Big Data and Analytics Hub http://www.ibmbigdatahub.com/blog/why-only-one-5-vs-big-data-really-matters
6 ONC definition of learning health system. Connecting Health and Care for the Nation. A Shared Nationwide Interoperability Roadmap. October 2015
7 National Research Council. Towards Precision Medicine: Building a Network for Biomedical Research and a new Taxonomy of Disease. National Academies Press. 2011
Gartner IT Glossary: http://www.gartner.com/it-glossary/predictive-analytics/
Desautels T, Calvert J, Hoffman J et al. Prediction of sepsis in the ICU with minimal EHR data: a machine learning approach. JMIR Medical Informatics 2016;4(3):e28
Echouffo-Tcheugui JB, Kengne AP. Risk models to predict chronic kidney disease and its progression: a systematic review. PLOS Medicine. November 20, 2012. Journals.plos.org
Most hospitals face 30 day readmissions penalty in fiscal 2016. August 3, 2015 www.modernhealthcare.com
Amarasingham R, Patel P, Tolo K et al. Allocating scarce resources in real-time to reduce heart failure readmissions: a prospective controlled trial. BMJ Quality Safety. July 31 2013
Stanton M. The high concentration of US Health Care expenditures. Research in Action. Issue 19. 2002. AHRQ Archive. https://archive.ahrq.gov
Chechulin Y, Nazerian A, Rais S. Predicting patients with high risk of becoming high cost healthcare users in Ontario. Healthcare Policy. 2014;9(3):68-79
From Doing Data Science by O’Neill and Schutt. O’Reilly Media. 2014
WEKA: http://www.cs.waikato.ac.nz/ml/weka/
Pentaho Community www.community.pentaho.com
RapidMiner Community www.community.rapidminer.com
KNIME www.knime.org
Orange data mining www.orange.biolab.si
C-statistic (used to compare logistic regression models): The probability that predicting the outcome is better than chance. Used to compare the goodness of fit of logistic regression models, values for this measure range from 0.5 to 1.0. A value of 0.5 indicates that the model is no better than chance at making a prediction of membership in a group and a value of 1.0 indicates that the model perfectly identifies those within a group and those not. Models are typically considered reasonable when the C-statistic is higher than 0.7 and strong when C exceeds 0.8 (Hosmer & Lemeshow, 2000; Hosmer & Lemeshow, 1989). http://mchp-appserv.cpe.umanitoba.ca/viewDefinition.php?definitionID=104234
Area under the curve: based on prediction rules with true positives plotted against false positives (1-specificity). The closer to 1, the better. 0.5 is essentially worthless. http://gim.unmc.edu/dxtests/roc3.htm
Manyika J, Chui M, Brown B. Et al. Big Data: The Next Frontier for Innovation, Competition and Productivity. http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation
Krzysztof JC, Moore GW. Uniqueness of medical data mining. Art Int Med 2002;26:1-24
Press G. Cleaning big data: most time consuming, least enjoyable data science task, survey says. Forbes. March 23 2016 www.forbes.com
Jacobsen RM, Isham GJ, Rutten LJF. Population Health as a means for health care organizations to deliver value. Mayo Clinic Proceedings November 2015;90(11):1465-1470
Jakicic JM, David KK, Rogers RJ et al. Effect of wearable technology combined with lifestyle intervention on increases long term weight loss. The IDEA RCT.JAMA 2016:316(11):1161-1171
Parikh RB, Obermeyer Z, Bates DW. Making predictive analytics a routine part of patient care. Harvard Business Review. April 21, 2016. https://hbr.org