SlideShare ist ein Scribd-Unternehmen logo
1 von 66
Data Science 101
Robert Hoyt MD FACP
January 12, 2017
Disclaimer
• I have no conflicts of interest to report
• The opinions presented are those of the author
and do not necessarily reflect those of the
University of West Florida
Learning Objectives
Upon completion of the presentation participants
should be able to:
• Summarize the characteristics of data science
• Summarize the skill sets for data scientists
• Compare and contrast predictive analytics using
statistics vs. machine learning
• Enumerate features of IBM Watson Analytics
(IBMWA)
• Enumerate features of WEKA machine learning
• List the challenges facing data science
Look Familiar?
AHIMA Supports Data Analytics
Definitions
• Data science is “the scientific study of the
creation, validation and transformation of data to
create meaning.” 1 Because data science is
relatively new, definitions are still evolving. Data
science is a good “umbrella” term.
• Analytics is “the discovery and communication
of meaningful patterns in data.” While some
would argue for separating data analytics from
data mining and knowledge discovery from data
(KDD), we will use the terms interchangeably. 2
Venn diagram of Data Science
Data
Science
Critical need for data scientists with:
• Domain expertise (example: healthcare)
• In depth statistical knowledge
• Computer science expertise
• Machine learning expertise
• Programming expertise: R, SQL and Python
languages
• Relational database system (RDBS) knowledge
• Comfort level with “Big Data”
Historical Background
• While all industries (including sports) are
incorporating analytics and data science, the
business world was first.
• Businesses benefitted from knowing which
customers were likely to unsubscribe (churn) and
if you purchased item A, would you purchase item
B (market basket analysis).
• As far back as the 1960s a small group of
statisticians suggested their field should be
broadened to handle more volume and variety of
data.
• In the 1990s computer scientists developed and
promoted machine learning software.
Historical background
• There is evidence that many healthcare workers
lack training in statistics and machine learning. 3
• There is also evidence that statistics is not easy to
teach to non-statisticians and difficult to retain. 4
• Statisticians recommend knowledge of calculus
and linear algebra; not routinely studied by
healthcare workers. They often prefer that
statistical formulas should be calculated
longhand.
Stats: Logistic Regression
B.
Would you prefer approach A or B?
A.
Historical Background
• As a result, many workers are not comfortable
with statistics and data analytics
• This observation flies in the face of a health data
explosion and a shortage of data scientists
• The data explosion is fueled by genomic, EHR-
related, wearable technology and social media
data.
• About 75% of healthcare data is unstructured
(free text), so difficult to analyze
• Enter the Big Data era to further confuse matters
Big Data Definition
• #1 Too much data to analyze on one computer
• #2 The Five V’s
• Volume: massive amounts of data are being generated
• Velocity: data is being generated so rapidly that it needs
to be analyzed without placing it in a database
• Variety: roughly 80% of data in existence is unstructured
so it won’t fit into a database or spreadsheet.
• Veracity: current data can be “messy” with missing data
and other challenges.
• Value: data scientists now have the capability to turn
large volumes of unstructured data into something
meaningful.5
Data Science is part of the federal
vision of a healthcare system
• Learning health system: “an ecosystem where all
stakeholders can securely, effectively and efficiently
contribute, share and analyze data” 6 (the PDCA
cycle)
• Precision medicine: “identifying which approaches
will be effective for which patients based on genetic,
environmental, and lifestyle factors.” Clearly this
initiative requires a big data approach to integrate
these data.7
• Population health requires data analytics
• Value based care requires data
Types of analytics (Gartner)
Predictive analytics describes four
attributes:
1. An emphasis on prediction
2. Rapid analysis measured in hours or
days
3. An emphasis on the business
relevance of the resulting insights (no
ivory tower analyses)
4. An emphasis on ease of use, thus
making the tools accessible to business
users.8
Predictive Analytics
• It could be argued that predictive analytics is the
most important aspect of data science, where an
outcome of importance is predicted based on
multiple factors influencing the outcome. This is
the area I will focus on
• Use cases will be discussed in the next slide
• I will not cover:
• Text mining with natural language processing (NLP) is
very important for mining unstructured data
• Data visualization software, such as Tableau and
QlikSense, used for descriptive analytics
• Deep Learning based on artificial intelligence (AI)
Predictive Analytics Use Cases
• Predict poor patient outcomes (morbidity)
• Sepsis prediction 9
• Impending renal failure 10
• Predict death (mortality)
• Predict readmissions: in fiscal 2016 only 24%
(799/3400) of reporting hospitals will not receive a
penalty (0.1%-3% range) for too many
readmissions. 11-12
• Predict high cost patients for population health
care management: 5% of Medicare/Medicaid
patients use 50% of resources. 13-14
Predictive Modeling Approaches
1. Modeling with statistics
2. Modeling with Machine Learning
3. Modeling using the R or Python programming
languages (not covered)
Predictive analytics
Design the
model
Statistical
Modeling
Statistics
Machine
learning
Association Regression Classification Clustering
Programming
Languages
Data Science/Analytics Process
Predictive Analytics
• The most common approach is to use
classification where you predict an outcome
(dependent variable) that is categorical data (e.g.
lived, died) with multiple predictors (independent
variables). For example, you have a data set of
pregnant women with Zika virus. Some have
children with micro-encephalopathy and others
don’t. You run a classification model to see if
factors such as age, trimester of infection, fever,
symptoms, etc. predict micro-encephalopathy
• If the outcome is numerical then you would use
linear regression
Need for better data analytical tools
• We would benefit from more user friendly tools and
some degree of automation
• MS Excel with the Analysis ToolPak add-in is a
possibility but implies you know which stats tests to use
• There are also multiple statistical packages, such as
SPSS and SAS, also associated with a steep learning
curve
Need for better data analytical tools
• Tool #1: IBM Watson Analytics: automatics
predictive, descriptive and visualization analytics
• Tool #2 WEKA: open source machine learning
platform
IBM Watson Analytics
• New program offered in 2015 that is not related to Watson
Health (cognitive computing). Business oriented
• Program is based on SPSS-based statistical tests. Covers
regression, classification, decision trees, chi-square, t-
tests, etc.
• Program can automatically convert nominal data to
numerical and vice versa
• Versions
• Free
• Professional (Academic)
IBM Watson Analytics Academic
program
• Free for universities to use for teaching (non-commercial)
purposes. Includes 100 students/professor/year
• University of West Florida has used the program for about
12 months in a Health Informatics graduate course and a
Data Mining (computer science) course
• IBM did an on-site visit for training
• Multiple videos on YouTube
• PDF user guide available
IBM Watson analytics features
• IBMWA is completely online
• Accepts Excel and CSV input, as well as feeds from
most relational database systems (RDBSs)
• 100 GB storage
• Limits: 500 columns and 10 million rows
• Twitter Feed analysis
Our 2016 Review Article
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5080525/
IBMWA versions
• About the time we submitted our detailed analysis of Watson
Analytics, IBM had created Watson-2 that combined
“Explore” and “Predict” into “Discover.” Watson-1 will be
retired shortly.
• Watson-2 includes statistical details about prediction when
the target/outcome is numerical. They are working on
adding the statistical details for categorical
targets/outcomes.
• Watson-2 includes a “data quality” score but doesn’t point
out missing or skewed data and outliers.
Breast-Cancer-spreadsheet
• 286 patients
• One outcome and 9 Attributes or predictors
Step 1: upload data
Step 2 Create a new Discovery
“Ask a question” function
Select a Visualization
Predictive Analytics: What Drives Breast
Cancer Recurrence + predictive model
Predict breast cancer recurrence
Degree of malignancy and prediction
of recurrence and no recurrence
Decision Tree Results
Confusion matrix is created but not explained
(degree of malignancy and no recurrence)
Predicted
No recurrence Recurrence
Actual No recurrence TN = 161 FP = 40 201
Recurrence FN = 40 TP = 45 85
201 85 286
Accuracy = TP + TN/Total = 72%
Sensitivity (recall) TP/FN + TP = 53%
Specificity = TN/TN + FP = 80%
Precision = TP/TP + FP = 53%
Create your Display
• Display can be shared by email, hyperlink, Tweet or downloaded
• Display is interactive
IBMWA limitations
• Business oriented, so not aligned perfectly with healthcare
data analytics. Predictive strength is good, but we are used
to sensitivity, specificity, PPV, NPV, ROC curves, etc.
• No choice of statistical tests
• IBMWA does not perform unsupervised learning
• This approach (results first, stats second) may not appeal to
purists
• Sample dataset I used was of excellent quality, therefore not
typical of many datasets
questions
• Process to apply for the academic program is easy. Apply at:
https://www.ibm.com/blogs/watson-analytics/calling-all-
academics-have-we-got-a-watson-analytics-for-you/
• IBM Contact information: Randy Messina at
randymessina@us.ibm.com
• My contact information: rhoyt@uwf.edu
IBMWA Application Process
Machine Learning (ML)
• Machine learning was developed by computer
scientists and is largely based on mathematics,
like statistics
• While some ML algorithms are difficult to
understand (e.g. neural networks), others are
easier, such as decision trees and regression
• Modeling is like baking: you decide what you
want to bake and the select the best recipe
(algorithm) to accomplish it. Optimally, you select
multiple recipes and compare the results!
Example: you want a model to decide what is
spam email. You test many algorithms for best
results and determine the best combination of
predictors
Algorithm Types
• Supervised learning
• Classification for categorical data (spam v no spam)
• Regression for numerical data ($, mortality rate)
• Unsupervised learning
• Association rules: an example would be market basket
analysis
• Clustering: when you don’t know the data categories
and you are looking for patterns in large data sets.
Used extensively with genetic data sets
• What ML has in common with statistical approach
• Both will perform linear regression, logistic regression
and decision trees
Open Source Free ML Programs
• WEKA 15
• Pentaho Community 16
• RapidMiner Community 17
• KNIME 18
• Orange 19
Machine Learning with Orange
University of Ljubljana,Slovenia
WEKA
• Named after a bird in New Zealand and stands
for Waikato Environment for Knowledge
Assessment
• Free software is associated with a free ML
course and a low cost textbook
• Software works on all operating systems
• WEKA is the only ML program mentioned that
does not require moving around widgets or
operators
Predicting type 2 diabetes (WEKA)
Results using logistic regression
Outcome Measurements
Accuracy is hitting a the bulls-eye every time.
Precision is hitting the same place each time, even if it is not the place you aimed for.
Receiver operator characteristic
(ROC) curve (c-statistic = AUC)
TP
FP
Decision Tree for Contact Lenses
Clustering algorithm
(3 groups identified)
Predictive Analytics Report Card
• Many risk prediction models yield mediocre
results at this point (C-statistic .56 -.80), but we
are early in the game.
• Models need to work in real-time ideally
• Some risk models are used in healthcare
organizations that might not fit your patient
demographic, such as safety-net hospitals, etc.
• It is helpful to identify patients at risk for
morbidity and mortality but you still have to have
an intervention team, ready to apply additional
resources to high risk patients
Data Science Education Stats
• Certificate (82); Bachelor (24); Masters (259);
Doctorate (14)
• 37% of courses are offered online
• 101 programs were related to business schools,
40 related to mathematics and statistics
departments, 39 related to computer science
departments and 9 related to new data science
departments. The remainder were from a wide
variety of college and university departments.
Data Science Centers
• Multiple universities and medical centers have
created “data centers” to create the right
environment for data analysis and research
• They tend to be multi-disciplinary and not just
relegated to the computer science department
• Every industry seems to have interest in data
science and analytics, hence the need to create
a central hub
Data Science Resources
• My web site www.informaticseducation.org has a
resource center with Chapter 23: Data Science
resources:
• Data sets: health and non-health related
• Free data science courses
• Free statistics resources
• Free visualization software
• Free programming tutorials
• Other helpful stuff
• Chapter 23 is available thru Lulu.com for $2.99
ONC Sponsored Free Courses
• Healthcare Data Analytics
• Bellevue College (limited to Veterans Administration
Staff Only)
• Columbia University
• Normandale College
• Oregon Health & Science University
• University of Alabama at Birmingham
• University of Texas Health Science Center at Houston
Machine Learning Resources
• I would recommend beginning with Jason
Brownlee’s eBooks:
• Machine Learning Algorithms $27 (163 pages)
• Machine Learning Mastery with WEKA $27 (248 pages)
• www.machinelearningmastery.com
Machine Learning Data Sets
Data Science Challenges
• Not enough data scientists; it is estimated that we
will need 140,000 by 2018 20
• Not enough data science training programs
• Expensive to build big data and data science
centers
• Privacy and security concerns
• Hype. Adverse unintended consequences (AUCs)
• Medical data is heterogeneous and complex,
compared to other industries 21
• Correlation does not equal causation
• 80% of the time spent with data analysis is spent
preparing the data for analysis 22
Data Science Challenges
• Difficult to find patient-level data
• It has been stated that clinical medicine accounts
for only 20% of population health; 80% is due to
psycho-social-environmental-behavioral-
economic factors that are beyond the control of
the healthcare system. Therefore, interventions
based on good data can result in no impact 23
• Just because you have technology and
voluminous data doesn’t mean it changes patient
outcomes. Example: fitness devices affecting
behavior 24
Make data science part of patient care
Not everyone will be able to afford a robust
analytics platform overlaid on a clinical data
warehouse and the ability to handle Big Data.
But we can start the educational process to learn
more about data science
Anticipate the onslaught of
data analytical vendors
Conclusions
• Data science is a new information science that
serves as an umbrella for data creation,
manipulation, analysis and research
• Data scientists are in high demand and it will
take years before we can educate enough
scientists to meet the demand
• Data science is a team sport; it will require teams
with individual skill sets to accomplish robust
data science
• New tools such as Watson and WEKA likely
represent the beginning of analysis automation
Conclusions
• I encourage everyone to increase their
knowledge in data science areas, such as
predictive analytics
• There are a myriad of free and affordable
courses now available online (mentioned in my
blog and on the resource page)
• I encourage academic centers and HIT vendors
to expand their data science offerings at multiple
levels
Questions?
Slides available as Data Science 101
on www.slideshare.net

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceEdureka!
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
Alteryx Desktop Designer Overview
Alteryx Desktop Designer OverviewAlteryx Desktop Designer Overview
Alteryx Desktop Designer OverviewTridant
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache HadoopSuman Saurabh
 
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...Edureka!
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Edureka!
 
Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Yaman Hajja, Ph.D.
 
Building a Data-Driven Culture
Building a Data-Driven CultureBuilding a Data-Driven Culture
Building a Data-Driven CultureLucas Neo
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Matteo Manca
 
Business Intelligence - Intro
Business Intelligence - IntroBusiness Intelligence - Intro
Business Intelligence - IntroDavid Hubbard
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...Simplilearn
 

Was ist angesagt? (20)

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Data science
Data scienceData science
Data science
 
Data Science and Analytics
Data Science and Analytics Data Science and Analytics
Data Science and Analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Alteryx Desktop Designer Overview
Alteryx Desktop Designer OverviewAlteryx Desktop Designer Overview
Alteryx Desktop Designer Overview
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
 
Data Science
Data ScienceData Science
Data Science
 
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
How to Become a Data Scientist | Data Scientist Skills | Data Science Trainin...
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
 
Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)Big Data & Analytics (Conceptual and Practical Introduction)
Big Data & Analytics (Conceptual and Practical Introduction)
 
Building a Data-Driven Culture
Building a Data-Driven CultureBuilding a Data-Driven Culture
Building a Data-Driven Culture
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
Business Intelligence - Intro
Business Intelligence - IntroBusiness Intelligence - Intro
Business Intelligence - Intro
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 

Ähnlich wie Data science 101

Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSpartan60
 
STATISCAL PAKAGE.pptx
STATISCAL PAKAGE.pptxSTATISCAL PAKAGE.pptx
STATISCAL PAKAGE.pptxBasitRamzan1
 
Introduction to Data Analytics - PPM.pptx
Introduction to Data Analytics - PPM.pptxIntroduction to Data Analytics - PPM.pptx
Introduction to Data Analytics - PPM.pptxssuser5cdaa93
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAjaved75
 
Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1sasi
 
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfneelakandan2001kpm
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSubrata Saharia
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreTuri, Inc.
 
IBM Data Analyst Professional Certificate - C01 - W01.pptx
IBM Data Analyst Professional Certificate - C01 - W01.pptxIBM Data Analyst Professional Certificate - C01 - W01.pptx
IBM Data Analyst Professional Certificate - C01 - W01.pptxMOHAMEDAKRAMSADKI
 
Final spss hands on training (descriptive analysis) may 24th 2013
Final spss  hands on training (descriptive analysis) may 24th 2013Final spss  hands on training (descriptive analysis) may 24th 2013
Final spss hands on training (descriptive analysis) may 24th 2013Tin Myo Han
 
Practical Applications of Machine Learning in Cybersecurity
Practical Applications of Machine Learning in CybersecurityPractical Applications of Machine Learning in Cybersecurity
Practical Applications of Machine Learning in Cybersecurityscoopnewsgroup
 
Statistical analysis and Statistical process in 2023 .pptx
Statistical analysis and Statistical process in 2023 .pptxStatistical analysis and Statistical process in 2023 .pptx
Statistical analysis and Statistical process in 2023 .pptxFayaz Ahmad
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxssuser1a4f0f
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxwahiba ben abdessalem
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data scienceJordan Engbers
 
Analysis of "A Predictive Analytics Primer" by Tom Davenport
 Analysis of "A Predictive Analytics Primer" by Tom Davenport Analysis of "A Predictive Analytics Primer" by Tom Davenport
Analysis of "A Predictive Analytics Primer" by Tom DavenportEt Hish
 

Ähnlich wie Data science 101 (20)

Intro scikitlearnstatsmodels
Intro scikitlearnstatsmodelsIntro scikitlearnstatsmodels
Intro scikitlearnstatsmodels
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
STATISCAL PAKAGE.pptx
STATISCAL PAKAGE.pptxSTATISCAL PAKAGE.pptx
STATISCAL PAKAGE.pptx
 
Data Science and Analysis.pptx
Data Science and Analysis.pptxData Science and Analysis.pptx
Data Science and Analysis.pptx
 
Introduction to Data Analytics - PPM.pptx
Introduction to Data Analytics - PPM.pptxIntroduction to Data Analytics - PPM.pptx
Introduction to Data Analytics - PPM.pptx
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1Fundamentals of Data science Introduction Unit 1
Fundamentals of Data science Introduction Unit 1
 
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
IBM Data Analyst Professional Certificate - C01 - W01.pptx
IBM Data Analyst Professional Certificate - C01 - W01.pptxIBM Data Analyst Professional Certificate - C01 - W01.pptx
IBM Data Analyst Professional Certificate - C01 - W01.pptx
 
Final spss hands on training (descriptive analysis) may 24th 2013
Final spss  hands on training (descriptive analysis) may 24th 2013Final spss  hands on training (descriptive analysis) may 24th 2013
Final spss hands on training (descriptive analysis) may 24th 2013
 
Practical Applications of Machine Learning in Cybersecurity
Practical Applications of Machine Learning in CybersecurityPractical Applications of Machine Learning in Cybersecurity
Practical Applications of Machine Learning in Cybersecurity
 
Statistical analysis and Statistical process in 2023 .pptx
Statistical analysis and Statistical process in 2023 .pptxStatistical analysis and Statistical process in 2023 .pptx
Statistical analysis and Statistical process in 2023 .pptx
 
Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Data_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptxData_Science_Applications_&_Use_Cases.pptx
Data_Science_Applications_&_Use_Cases.pptx
 
Making an impact with data science
Making an impact  with data scienceMaking an impact  with data science
Making an impact with data science
 
analysis plan.ppt
analysis plan.pptanalysis plan.ppt
analysis plan.ppt
 
Analysis of "A Predictive Analytics Primer" by Tom Davenport
 Analysis of "A Predictive Analytics Primer" by Tom Davenport Analysis of "A Predictive Analytics Primer" by Tom Davenport
Analysis of "A Predictive Analytics Primer" by Tom Davenport
 

Kürzlich hochgeladen

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 

Kürzlich hochgeladen (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 

Data science 101

  • 1. Data Science 101 Robert Hoyt MD FACP January 12, 2017
  • 2. Disclaimer • I have no conflicts of interest to report • The opinions presented are those of the author and do not necessarily reflect those of the University of West Florida
  • 3. Learning Objectives Upon completion of the presentation participants should be able to: • Summarize the characteristics of data science • Summarize the skill sets for data scientists • Compare and contrast predictive analytics using statistics vs. machine learning • Enumerate features of IBM Watson Analytics (IBMWA) • Enumerate features of WEKA machine learning • List the challenges facing data science
  • 6. Definitions • Data science is “the scientific study of the creation, validation and transformation of data to create meaning.” 1 Because data science is relatively new, definitions are still evolving. Data science is a good “umbrella” term. • Analytics is “the discovery and communication of meaningful patterns in data.” While some would argue for separating data analytics from data mining and knowledge discovery from data (KDD), we will use the terms interchangeably. 2
  • 7. Venn diagram of Data Science Data Science
  • 8. Critical need for data scientists with: • Domain expertise (example: healthcare) • In depth statistical knowledge • Computer science expertise • Machine learning expertise • Programming expertise: R, SQL and Python languages • Relational database system (RDBS) knowledge • Comfort level with “Big Data”
  • 9. Historical Background • While all industries (including sports) are incorporating analytics and data science, the business world was first. • Businesses benefitted from knowing which customers were likely to unsubscribe (churn) and if you purchased item A, would you purchase item B (market basket analysis). • As far back as the 1960s a small group of statisticians suggested their field should be broadened to handle more volume and variety of data. • In the 1990s computer scientists developed and promoted machine learning software.
  • 10. Historical background • There is evidence that many healthcare workers lack training in statistics and machine learning. 3 • There is also evidence that statistics is not easy to teach to non-statisticians and difficult to retain. 4 • Statisticians recommend knowledge of calculus and linear algebra; not routinely studied by healthcare workers. They often prefer that statistical formulas should be calculated longhand.
  • 11. Stats: Logistic Regression B. Would you prefer approach A or B? A.
  • 12. Historical Background • As a result, many workers are not comfortable with statistics and data analytics • This observation flies in the face of a health data explosion and a shortage of data scientists • The data explosion is fueled by genomic, EHR- related, wearable technology and social media data. • About 75% of healthcare data is unstructured (free text), so difficult to analyze • Enter the Big Data era to further confuse matters
  • 13. Big Data Definition • #1 Too much data to analyze on one computer • #2 The Five V’s • Volume: massive amounts of data are being generated • Velocity: data is being generated so rapidly that it needs to be analyzed without placing it in a database • Variety: roughly 80% of data in existence is unstructured so it won’t fit into a database or spreadsheet. • Veracity: current data can be “messy” with missing data and other challenges. • Value: data scientists now have the capability to turn large volumes of unstructured data into something meaningful.5
  • 14. Data Science is part of the federal vision of a healthcare system • Learning health system: “an ecosystem where all stakeholders can securely, effectively and efficiently contribute, share and analyze data” 6 (the PDCA cycle) • Precision medicine: “identifying which approaches will be effective for which patients based on genetic, environmental, and lifestyle factors.” Clearly this initiative requires a big data approach to integrate these data.7 • Population health requires data analytics • Value based care requires data
  • 15. Types of analytics (Gartner) Predictive analytics describes four attributes: 1. An emphasis on prediction 2. Rapid analysis measured in hours or days 3. An emphasis on the business relevance of the resulting insights (no ivory tower analyses) 4. An emphasis on ease of use, thus making the tools accessible to business users.8
  • 16. Predictive Analytics • It could be argued that predictive analytics is the most important aspect of data science, where an outcome of importance is predicted based on multiple factors influencing the outcome. This is the area I will focus on • Use cases will be discussed in the next slide • I will not cover: • Text mining with natural language processing (NLP) is very important for mining unstructured data • Data visualization software, such as Tableau and QlikSense, used for descriptive analytics • Deep Learning based on artificial intelligence (AI)
  • 17. Predictive Analytics Use Cases • Predict poor patient outcomes (morbidity) • Sepsis prediction 9 • Impending renal failure 10 • Predict death (mortality) • Predict readmissions: in fiscal 2016 only 24% (799/3400) of reporting hospitals will not receive a penalty (0.1%-3% range) for too many readmissions. 11-12 • Predict high cost patients for population health care management: 5% of Medicare/Medicaid patients use 50% of resources. 13-14
  • 18. Predictive Modeling Approaches 1. Modeling with statistics 2. Modeling with Machine Learning 3. Modeling using the R or Python programming languages (not covered)
  • 21. Predictive Analytics • The most common approach is to use classification where you predict an outcome (dependent variable) that is categorical data (e.g. lived, died) with multiple predictors (independent variables). For example, you have a data set of pregnant women with Zika virus. Some have children with micro-encephalopathy and others don’t. You run a classification model to see if factors such as age, trimester of infection, fever, symptoms, etc. predict micro-encephalopathy • If the outcome is numerical then you would use linear regression
  • 22. Need for better data analytical tools • We would benefit from more user friendly tools and some degree of automation • MS Excel with the Analysis ToolPak add-in is a possibility but implies you know which stats tests to use • There are also multiple statistical packages, such as SPSS and SAS, also associated with a steep learning curve
  • 23. Need for better data analytical tools • Tool #1: IBM Watson Analytics: automatics predictive, descriptive and visualization analytics • Tool #2 WEKA: open source machine learning platform
  • 24. IBM Watson Analytics • New program offered in 2015 that is not related to Watson Health (cognitive computing). Business oriented • Program is based on SPSS-based statistical tests. Covers regression, classification, decision trees, chi-square, t- tests, etc. • Program can automatically convert nominal data to numerical and vice versa • Versions • Free • Professional (Academic)
  • 25. IBM Watson Analytics Academic program • Free for universities to use for teaching (non-commercial) purposes. Includes 100 students/professor/year • University of West Florida has used the program for about 12 months in a Health Informatics graduate course and a Data Mining (computer science) course • IBM did an on-site visit for training • Multiple videos on YouTube • PDF user guide available
  • 26. IBM Watson analytics features • IBMWA is completely online • Accepts Excel and CSV input, as well as feeds from most relational database systems (RDBSs) • 100 GB storage • Limits: 500 columns and 10 million rows • Twitter Feed analysis
  • 27. Our 2016 Review Article https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5080525/
  • 28. IBMWA versions • About the time we submitted our detailed analysis of Watson Analytics, IBM had created Watson-2 that combined “Explore” and “Predict” into “Discover.” Watson-1 will be retired shortly. • Watson-2 includes statistical details about prediction when the target/outcome is numerical. They are working on adding the statistical details for categorical targets/outcomes. • Watson-2 includes a “data quality” score but doesn’t point out missing or skewed data and outliers.
  • 29. Breast-Cancer-spreadsheet • 286 patients • One outcome and 9 Attributes or predictors
  • 31. Step 2 Create a new Discovery
  • 34. Predictive Analytics: What Drives Breast Cancer Recurrence + predictive model
  • 35. Predict breast cancer recurrence
  • 36. Degree of malignancy and prediction of recurrence and no recurrence
  • 38. Confusion matrix is created but not explained (degree of malignancy and no recurrence) Predicted No recurrence Recurrence Actual No recurrence TN = 161 FP = 40 201 Recurrence FN = 40 TP = 45 85 201 85 286 Accuracy = TP + TN/Total = 72% Sensitivity (recall) TP/FN + TP = 53% Specificity = TN/TN + FP = 80% Precision = TP/TP + FP = 53%
  • 39. Create your Display • Display can be shared by email, hyperlink, Tweet or downloaded • Display is interactive
  • 40. IBMWA limitations • Business oriented, so not aligned perfectly with healthcare data analytics. Predictive strength is good, but we are used to sensitivity, specificity, PPV, NPV, ROC curves, etc. • No choice of statistical tests • IBMWA does not perform unsupervised learning • This approach (results first, stats second) may not appeal to purists • Sample dataset I used was of excellent quality, therefore not typical of many datasets
  • 41. questions • Process to apply for the academic program is easy. Apply at: https://www.ibm.com/blogs/watson-analytics/calling-all- academics-have-we-got-a-watson-analytics-for-you/ • IBM Contact information: Randy Messina at randymessina@us.ibm.com • My contact information: rhoyt@uwf.edu IBMWA Application Process
  • 42. Machine Learning (ML) • Machine learning was developed by computer scientists and is largely based on mathematics, like statistics • While some ML algorithms are difficult to understand (e.g. neural networks), others are easier, such as decision trees and regression • Modeling is like baking: you decide what you want to bake and the select the best recipe (algorithm) to accomplish it. Optimally, you select multiple recipes and compare the results! Example: you want a model to decide what is spam email. You test many algorithms for best results and determine the best combination of predictors
  • 43. Algorithm Types • Supervised learning • Classification for categorical data (spam v no spam) • Regression for numerical data ($, mortality rate) • Unsupervised learning • Association rules: an example would be market basket analysis • Clustering: when you don’t know the data categories and you are looking for patterns in large data sets. Used extensively with genetic data sets • What ML has in common with statistical approach • Both will perform linear regression, logistic regression and decision trees
  • 44. Open Source Free ML Programs • WEKA 15 • Pentaho Community 16 • RapidMiner Community 17 • KNIME 18 • Orange 19
  • 45. Machine Learning with Orange University of Ljubljana,Slovenia
  • 46. WEKA • Named after a bird in New Zealand and stands for Waikato Environment for Knowledge Assessment • Free software is associated with a free ML course and a low cost textbook • Software works on all operating systems • WEKA is the only ML program mentioned that does not require moving around widgets or operators
  • 47. Predicting type 2 diabetes (WEKA)
  • 49. Outcome Measurements Accuracy is hitting a the bulls-eye every time. Precision is hitting the same place each time, even if it is not the place you aimed for.
  • 50. Receiver operator characteristic (ROC) curve (c-statistic = AUC) TP FP
  • 51. Decision Tree for Contact Lenses
  • 53. Predictive Analytics Report Card • Many risk prediction models yield mediocre results at this point (C-statistic .56 -.80), but we are early in the game. • Models need to work in real-time ideally • Some risk models are used in healthcare organizations that might not fit your patient demographic, such as safety-net hospitals, etc. • It is helpful to identify patients at risk for morbidity and mortality but you still have to have an intervention team, ready to apply additional resources to high risk patients
  • 54. Data Science Education Stats • Certificate (82); Bachelor (24); Masters (259); Doctorate (14) • 37% of courses are offered online • 101 programs were related to business schools, 40 related to mathematics and statistics departments, 39 related to computer science departments and 9 related to new data science departments. The remainder were from a wide variety of college and university departments.
  • 55. Data Science Centers • Multiple universities and medical centers have created “data centers” to create the right environment for data analysis and research • They tend to be multi-disciplinary and not just relegated to the computer science department • Every industry seems to have interest in data science and analytics, hence the need to create a central hub
  • 56. Data Science Resources • My web site www.informaticseducation.org has a resource center with Chapter 23: Data Science resources: • Data sets: health and non-health related • Free data science courses • Free statistics resources • Free visualization software • Free programming tutorials • Other helpful stuff • Chapter 23 is available thru Lulu.com for $2.99
  • 57. ONC Sponsored Free Courses • Healthcare Data Analytics • Bellevue College (limited to Veterans Administration Staff Only) • Columbia University • Normandale College • Oregon Health & Science University • University of Alabama at Birmingham • University of Texas Health Science Center at Houston
  • 58. Machine Learning Resources • I would recommend beginning with Jason Brownlee’s eBooks: • Machine Learning Algorithms $27 (163 pages) • Machine Learning Mastery with WEKA $27 (248 pages) • www.machinelearningmastery.com
  • 60. Data Science Challenges • Not enough data scientists; it is estimated that we will need 140,000 by 2018 20 • Not enough data science training programs • Expensive to build big data and data science centers • Privacy and security concerns • Hype. Adverse unintended consequences (AUCs) • Medical data is heterogeneous and complex, compared to other industries 21 • Correlation does not equal causation • 80% of the time spent with data analysis is spent preparing the data for analysis 22
  • 61. Data Science Challenges • Difficult to find patient-level data • It has been stated that clinical medicine accounts for only 20% of population health; 80% is due to psycho-social-environmental-behavioral- economic factors that are beyond the control of the healthcare system. Therefore, interventions based on good data can result in no impact 23 • Just because you have technology and voluminous data doesn’t mean it changes patient outcomes. Example: fitness devices affecting behavior 24
  • 62. Make data science part of patient care Not everyone will be able to afford a robust analytics platform overlaid on a clinical data warehouse and the ability to handle Big Data. But we can start the educational process to learn more about data science
  • 63. Anticipate the onslaught of data analytical vendors
  • 64. Conclusions • Data science is a new information science that serves as an umbrella for data creation, manipulation, analysis and research • Data scientists are in high demand and it will take years before we can educate enough scientists to meet the demand • Data science is a team sport; it will require teams with individual skill sets to accomplish robust data science • New tools such as Watson and WEKA likely represent the beginning of analysis automation
  • 65. Conclusions • I encourage everyone to increase their knowledge in data science areas, such as predictive analytics • There are a myriad of free and affordable courses now available online (mentioned in my blog and on the resource page) • I encourage academic centers and HIT vendors to expand their data science offerings at multiple levels
  • 66. Questions? Slides available as Data Science 101 on www.slideshare.net

Hinweis der Redaktion

  1. 1 Data Science Association. www.datascienceassn.org Accessed September 12, 2106 2 Analytics. Wikipedia. www.wikipedia.org Accessed January 16, 2016
  2. 3 Wegwarth O, Schwartz LM, Woloshin S et al. Do physicians understand cancer screening statistics? A national survey of primary care physicians in the United States. Ann Intern Med 2012;156(5):340-9 4 Manrai AK, Bhatia G, Strymish J et al. Medicine’s uncomfortable relationship with math. Research Letter. June 2014. JAMA Intern Med 2014;174(6):991-993
  3. IBM Big Data and Analytics Hub http://www.ibmbigdatahub.com/blog/why-only-one-5-vs-big-data-really-matters
  4. 6 ONC definition of learning health system. Connecting Health and Care for the Nation. A Shared Nationwide Interoperability Roadmap. October 2015 7 National Research Council. Towards Precision Medicine: Building a Network for Biomedical Research and a new Taxonomy of Disease. National Academies Press. 2011
  5. Gartner IT Glossary: http://www.gartner.com/it-glossary/predictive-analytics/
  6. Desautels T, Calvert J, Hoffman J et al. Prediction of sepsis in the ICU with minimal EHR data: a machine learning approach. JMIR Medical Informatics 2016;4(3):e28 Echouffo-Tcheugui JB, Kengne AP. Risk models to predict chronic kidney disease and its progression: a systematic review. PLOS Medicine. November 20, 2012. Journals.plos.org Most hospitals face 30 day readmissions penalty in fiscal 2016. August 3, 2015 www.modernhealthcare.com Amarasingham R, Patel P, Tolo K et al. Allocating scarce resources in real-time to reduce heart failure readmissions: a prospective controlled trial. BMJ Quality Safety. July 31 2013 Stanton M. The high concentration of US Health Care expenditures. Research in Action. Issue 19. 2002. AHRQ Archive. https://archive.ahrq.gov Chechulin Y, Nazerian A, Rais S. Predicting patients with high risk of becoming high cost healthcare users in Ontario. Healthcare Policy. 2014;9(3):68-79
  7. From Doing Data Science by O’Neill and Schutt. O’Reilly Media. 2014
  8. WEKA: http://www.cs.waikato.ac.nz/ml/weka/ Pentaho Community www.community.pentaho.com RapidMiner Community www.community.rapidminer.com KNIME www.knime.org Orange data mining www.orange.biolab.si
  9. C-statistic (used to compare logistic regression models): The probability that predicting the outcome is better than chance. Used to compare the goodness of fit of logistic regression models, values for this measure range from 0.5 to 1.0. A value of 0.5 indicates that the model is no better than chance at making a prediction of membership in a group and a value of 1.0 indicates that the model perfectly identifies those within a group and those not. Models are typically considered reasonable when the C-statistic is higher than 0.7 and strong when C exceeds 0.8 (Hosmer & Lemeshow, 2000; Hosmer & Lemeshow, 1989). http://mchp-appserv.cpe.umanitoba.ca/viewDefinition.php?definitionID=104234 Area under the curve: based on prediction rules with true positives plotted against false positives (1-specificity). The closer to 1, the better. 0.5 is essentially worthless. http://gim.unmc.edu/dxtests/roc3.htm
  10. DataScience Community. http://datascience.community/colleges
  11. Manyika J, Chui M, Brown B. Et al. Big Data: The Next Frontier for Innovation, Competition and Productivity. http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/big-data-the-next-frontier-for-innovation Krzysztof JC, Moore GW. Uniqueness of medical data mining. Art Int Med 2002;26:1-24 Press G. Cleaning big data: most time consuming, least enjoyable data science task, survey says. Forbes. March 23 2016 www.forbes.com
  12. Jacobsen RM, Isham GJ, Rutten LJF. Population Health as a means for health care organizations to deliver value. Mayo Clinic Proceedings November 2015;90(11):1465-1470 Jakicic JM, David KK, Rogers RJ et al. Effect of wearable technology combined with lifestyle intervention on increases long term weight loss. The IDEA RCT.JAMA 2016:316(11):1161-1171
  13. Parikh RB, Obermeyer Z, Bates DW. Making predictive analytics a routine part of patient care. Harvard Business Review. April 21, 2016. https://hbr.org
  14. CloudMEDX www.cloudmedxhealth.com