SlideShare ist ein Scribd-Unternehmen logo
1 von 80
Data Science 101
David Gerster
Strategic Advisory Board
About me
• 10+ years experience in data science at various consumer
web companies
• Worked on web search at Yahoo and Microsoft
• Led the Mobile data science team at Groupon
• Joined BigML as VP Data Science in July 2013
• Joined JLL Spark as VP Data in July 2017
• Advisor to High Fidelity Genetics
3
Finding meaningful patterns in data
• The famous “Iris” data set has measurements for 150 flowers
• Given a flower’s measurements, can we predict its species?
Iris setosa Iris versicolor Iris virginica
5
PetalWidth(cm)
Petal Length (cm)
Iris setosa, red dots
Iris versicolor, green dots
Iris virginica, blue dots
6
PetalWidth(cm)
Petal Length (cm)
Congratulations! You just trained a model.
7
PetalWidth(cm)
Petal Length (cm)
PetalWidth(cm)
Petal Length (cm)
Prediction: Iris setosa
Prediction: Iris versicolor
Prediction: Iris virginica
Prediction:
Iris virginica
8
PetalWidth(cm)
Petal Length (cm)
Prediction: Iris setosa
Prediction: Iris versicolor
Prediction: Iris virginica
Prediction:
Iris virginica
Congratulations! You just scored
four new flowers using your model,
and made a prediction about the
species of each one.
9
Training versus Scoring
• This process had two steps: training and scoring
• When training on historical data, you’re using data gathered over
some length of time
• When scoring new data points, you want the answer immediately
(in “real time”)
10
11
Predicts “blue” with high confidence
Explains a large chunk of the data
(high support)
Predicts “blue” with low confidence
Explains a small chunk of the data
(low support)
Support and Confidence
• A rectangle with a large number of data points has high “support”
• A rectangle that is purely one color has high “confidence”
• If there is a small number of data points, confidence is low even if
it’s purely one color
12
PetalWidth(cm)
Petal Length (cm)
13
Width <= 0.8? Width > 0.8?
Width > 1.75? Width <= 1.75?
Length <= 5? Length > 5?
50 red
45 blue
1 blue, 48 green 4 blue, 2 green
“Decision Tree”
“Leaf Nodes”
50 blue, 50 green
5 blue, 50 green
50 red, 50 blue, 50 green
• Data is just a table of values
• Each row is an instance, an example
of the concept to be learned
• Each column is an attribute or
feature of the instance
• The column we want to predict is the
label or output
• Because we have a label, this is
supervised learning
14
instance
instance
feature labelfeature
Demo: The General Social Survey
• Sociology survey given in the United States since 1972
• Data is 39,000 responses, almost 400 questions each
• Demographic data like income, race, gender, education, marital status
• Many questions about personal beliefs
• “Should an atheist be allowed to teach college, or not?”
• “Are we spending the right amount of money on education?”
• Can we predict income from these responses?
16
How good is our model?
• The model looks good, but how do we quantify this?
17
80%
training set
20%
holdout set
3 out of 4 predictions are correct
Accuracy = 75%
100% of data
1. Train a model using
80% training set
2. Pretend 20% holdout
is new data, and
feed it to the model
3. Check accuracy of
predictions
Predicting political views
• What happens if we predict political views instead of income?
• A different subset of variables becomes important!
19
20
Finding the important variables
21
22
The Value of Predictive Modeling
• Provides deep insight into your data
• Finds the small subset of important variables
• Extremely useful for business!
23
Demo: The StumbleUpon Dataset
• StumbleUpon is an app that recommends web pages
• Dataset of 7,400 web pages is provided, with each page labeled as
either “evergreen” or “ephemeral”
• We want to predict the page’s class using this historical data
24
While some pages we recommend, such as news
articles or seasonal recipes, are only relevant for a
short period of time, others maintain a timeless
quality and can be recommended to users long after
they are discovered. In other words, pages can
either be classified as "ephemeral" or "evergreen".
Training a model on StumbleUpon data
• Live demo: training a model on StumbleUpon data
• Key concepts:
• “Bag of words” text analysis
• Evaluating the model using a holdout set
• Combining multiple models to improve accuracy
• The “ensemble” of multiple models has better accuracy!
25
“Ensembles” of Models
• Training multiple models on random subsets of the data gave us a
better result!
• Why?
26
Bias and Variance
• We train a model with the goal of fitting it correctly to the data
• When a model isn’t flexible enough, it may underfit the data, and we
say it has high bias
• When a model is too flexible, it may overfit the data, and we say it
has high variance
For a formal definition of bias and variance, see
Thomas Dietterich’s paper on the subject
28
High Bias
29
High Variance
Decision trees have high variance
• Decision trees can represent complex functions
• But they are prone to overfitting; they have high variance
• If you draw enough lines, you can create a “model” that just
memorizes the dataset!
Decision trees have high variance
• We can reduce this problem by:
• Taking several random samples from the original data set
• Training a decision tree on each sample
• Having these trees vote on the class
• Goal: Get the expressiveness of a decision tree, with less overfitting
100% of data
Prediction
Single Tree
100% of
data
Bootstrap
sample
Bootstrap
sample
Bootstrap
sample
Bootstrap
sample
Bootstrap
sample
Vote on
Prediction
Ensemble of Trees
39
40
41
42
45
Blue side
Red sideVote:
2-1, Blue
Vote:
2-1, Red
Vote:
2-1, Blue
Benefits of a Decision Tree Ensemble
• Voted boundary is more accurate than for a single tree
• “Best of both worlds”: Get most of the expressiveness of decision
trees with lower variance
• We’re actually taking advantage of the variance by feeding a different
random sample to each tree and seeing what happens!
46
Why draw straight lines in decision trees?
• Imagine you have 400 variables in your dataset
• You only need to examine 400 variables to draw
the “best” straight line between the dots
• If you want a diagonal line in two dimensions,
there are (400 choose 2) or 79,800
combinations of variables to examine
• Some biology datasets have 100,000 variables!
• (100,000 choose 2) = 4,999,950,000
combinations of 2 variables!
47
Popular algorithms for supervised learning
• We got pretty deep into Decision Trees and ensembles of trees
• Other popular algorithms for supervised learning:
• Support Vector Machines
• Neural Nets (“Deep Learning”)
• Check out BigML’s automated deep learning!
50
Recap: Supervised Learning Topics
• Definition of supervised learning
• Training and scoring a model
• Support and confidence
• Model evaluation using a holdout set
• Bias and variance, underfitting and overfitting
• Using ensembles to improve models
• … And a whole lot about decision trees!
51
53
PetalWidth(cm)
Petal Length (cm)
What if we don’t have labels?
• Can we still get insight into our data if we don’t know the
colors of the dots?
• Since we don’t have labels, this is unsupervised learning
• Clustering: Find “clumps” of unlabeled data that might be interesting
• Anomaly detection: Find outliers in unlabeled data
• Topic Modeling: Identify topics in free text
54
Clustering
• Concept: Find “lumps” of data that exist in distinct clusters
• K-means clustering:
1. Choose a number of clusters k that you are looking for
2. Choose initial “centroids” for the clusters
3. Compute which data points are closest to each centroid
4. Compute the actual center for each of the sets of datapoints
5. Continue until the k centroids stop moving
55
Demo: The Whisky Dataset
• Data on the flavors of 86 single-malt Scotch whiskies
• No labels, just a bunch of taste information
• Can we get insight into this dataset?
69
Demo: Breast Cancer Dataset
• Train a predictive model using the 699 biopsies
• The “label” of benign or malignant is known for each one
• We can train a highly accurate predictive model with this
data
74
Demo: Breast Cancer Dataset
• What if we remove the labels of “benign” and “malignant”?
75
10 lines are needed
to isolate this data point
(not anomalous)
Only 4 lines are needed
to isolate this data point
(highly anomalous)
Demo: Anomaly Detection
• Remove the labels of benign or malignant
• Train an anomaly detector on this unlabeled data
• Create a new dataset with the anomaly scores as “labels”
• Use these “labels” to train a predictive model!
78
Who Needs Labels?
Minority Report
• Anomaly detection works great on large unlabeled datasets,
especially if you expect to find an (adversarial) minority class
• Millions of credit card transactions, billions of network events …
• Doesn’t require you to know what you’re looking for!
81
Topic Modeling using LDA
• Uncovers groups of related words (“topics”) in documents
• Does not require an external corpus (e.g. training on Wikipedia)
• No semantic parsing of text
• Unsupervised
Topic modeling on
IMDB reviews
• 52,000 reviews
• 883 movies
Top 3 Topics in Shrek Reviews (n=26)
Topics Topic distribution for
this document
Borrowed/stolen from Prof. David Blei, with apologies
…
The (assumed) generative process
children
A topic, which is a
distribution over words
A distribution over topics,
specific to each document
A distribution over
topic distributions,
fixed for this corpus
A word
in a document
Topic 1
Topic 3 Topic 2
A distribution over
word distributions,
fixed for this corpus
Word 1
Word 2Word 3
What we observe
children
A word
in a document
n = 26
Shrek
n = 26
Shrek
n = 31
The Sum of All Fears
n = 31
The Sum of All Fears
n = 100
Love, Actually
How do we get such “good” topics?
• Imagine that each document can only belong to one topic
• Does that make it easier or harder to find “good” clusters of words?
• LDA allows documents to belong to multiple topics
Recap: Unsupervised Learning Topics
• Unsupervised learning uses unlabeled data
• Clustering: Finding clumps in unlabeled data
• Anomaly Detection: Finding “weird” instances in unlabeled data
• Topic Modeling: Extracting meaningful topics from free text
94
Final Thought
• Supervised learning has many different algorithms to solve one
problem (predicting the output)
• Unsupervised learning has a many different algorithms to solve many
different problems
95
David Gerster
gerster@bigml.com
Backup Slides
102

Weitere ähnliche Inhalte

Ähnlich wie Data Science 101

Ai4life aiml-xops-sig
Ai4life aiml-xops-sigAi4life aiml-xops-sig
Ai4life aiml-xops-sigmadhucharis
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptxPallabiSahoo5
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiVijay Susheedran C G
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introductionNeeraj Tewari
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk KnowledgeKrishna Sankar
 
Research methods and data analysis
Research methods and data analysisResearch methods and data analysis
Research methods and data analysisyogesh shrestha
 
03-Data-Exploration.pptx
03-Data-Exploration.pptx03-Data-Exploration.pptx
03-Data-Exploration.pptxShree Shree
 
Coding qualitative data for non-researchers
Coding qualitative data for non-researchersCoding qualitative data for non-researchers
Coding qualitative data for non-researchersKelley Howell
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian Aurisano
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data ScienceThinkful
 
How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...Ruth Kearney
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 
Mastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really DoMastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really DoAvrio Analytics
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Cloudera, Inc.
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetCongChen35
 

Ähnlich wie Data Science 101 (20)

CS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptxCS194Lec0hbh6EDA.pptx
CS194Lec0hbh6EDA.pptx
 
Ai4life aiml-xops-sig
Ai4life aiml-xops-sigAi4life aiml-xops-sig
Ai4life aiml-xops-sig
 
Data Wrangling_1.pptx
Data Wrangling_1.pptxData Wrangling_1.pptx
Data Wrangling_1.pptx
 
Big Data Real Time Training in Chennai
Big Data Real Time Training in ChennaiBig Data Real Time Training in Chennai
Big Data Real Time Training in Chennai
 
Big Data 101 - An introduction
Big Data 101 - An introductionBig Data 101 - An introduction
Big Data 101 - An introduction
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Research methods and data analysis
Research methods and data analysisResearch methods and data analysis
Research methods and data analysis
 
03-Data-Exploration.pptx
03-Data-Exploration.pptx03-Data-Exploration.pptx
03-Data-Exploration.pptx
 
Mini datathon
Mini datathonMini datathon
Mini datathon
 
Coding qualitative data for non-researchers
Coding qualitative data for non-researchersCoding qualitative data for non-researchers
Coding qualitative data for non-researchers
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data Science
 
How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...How to Become a Data Science Company instead of a company with Data Scientist...
How to Become a Data Science Company instead of a company with Data Scientist...
 
L14. Anomaly Detection
L14. Anomaly DetectionL14. Anomaly Detection
L14. Anomaly Detection
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
Mastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really DoMastering the 80% of Analytics: What Data Scientists Really Do
Mastering the 80% of Analytics: What Data Scientists Really Do
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer Dataset
 
Predictive Analysis
Predictive AnalysisPredictive Analysis
Predictive Analysis
 

Mehr von ideatoipo

Anatomy of a Patent for Executives and Entrepreneurs
Anatomy of a Patent for Executives and EntrepreneursAnatomy of a Patent for Executives and Entrepreneurs
Anatomy of a Patent for Executives and Entrepreneursideatoipo
 
How to Master Resume Writing in a Competitive Market
How to Master Resume Writing in a Competitive MarketHow to Master Resume Writing in a Competitive Market
How to Master Resume Writing in a Competitive Marketideatoipo
 
How to Answer the Most Important Question In Your Interview
How to Answer the Most Important Question In Your InterviewHow to Answer the Most Important Question In Your Interview
How to Answer the Most Important Question In Your Interviewideatoipo
 
How to Write a Resume in a Competitive Job Market
How to Write a Resume in a Competitive Job MarketHow to Write a Resume in a Competitive Job Market
How to Write a Resume in a Competitive Job Marketideatoipo
 
How to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech StartupHow to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech Startupideatoipo
 
How to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech StartupHow to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech Startupideatoipo
 
How to do a Venture Capital Financing in 2024
How to do a Venture Capital Financing  in 2024How to do a Venture Capital Financing  in 2024
How to do a Venture Capital Financing in 2024ideatoipo
 
How to Protect Your Intellectual Property
How to Protect Your Intellectual PropertyHow to Protect Your Intellectual Property
How to Protect Your Intellectual Propertyideatoipo
 
How to Systematize Your Job Search in 2024
How to Systematize Your Job Search in 2024How to Systematize Your Job Search in 2024
How to Systematize Your Job Search in 2024ideatoipo
 
Top Ten Legal Mistakes That Could Doom Your Startup
Top Ten Legal Mistakes That Could Doom Your StartupTop Ten Legal Mistakes That Could Doom Your Startup
Top Ten Legal Mistakes That Could Doom Your Startupideatoipo
 
How to Recession-Proof Your Job Search in 2024
How to Recession-Proof Your Job Search in 2024How to Recession-Proof Your Job Search in 2024
How to Recession-Proof Your Job Search in 2024ideatoipo
 
H1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & Investors
H1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & InvestorsH1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & Investors
H1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & Investorsideatoipo
 
How to Strategically Prepare Your Job Search for 2024
How to Strategically Prepare Your Job Search for 2024How to Strategically Prepare Your Job Search for 2024
How to Strategically Prepare Your Job Search for 2024ideatoipo
 
How to Secure Seed and Pre-Seed Investment for Your Startup
How to Secure Seed and Pre-Seed Investment for Your StartupHow to Secure Seed and Pre-Seed Investment for Your Startup
How to Secure Seed and Pre-Seed Investment for Your Startupideatoipo
 
How to Get Funding for Your Startup by Building Your Corporate Credit
How to Get Funding for Your Startup by Building Your Corporate CreditHow to Get Funding for Your Startup by Building Your Corporate Credit
How to Get Funding for Your Startup by Building Your Corporate Creditideatoipo
 
How to Raise Seed Funding for Your Startup: Convertible Notes and SAFEs
How to Raise Seed Funding for Your Startup: Convertible Notes and SAFEsHow to Raise Seed Funding for Your Startup: Convertible Notes and SAFEs
How to Raise Seed Funding for Your Startup: Convertible Notes and SAFEsideatoipo
 
Startup Law 101:How to Avoid Legal Pitfalls that Could Doom Your Startup
Startup Law 101:How to Avoid Legal Pitfalls that Could Doom Your StartupStartup Law 101:How to Avoid Legal Pitfalls that Could Doom Your Startup
Startup Law 101:How to Avoid Legal Pitfalls that Could Doom Your Startupideatoipo
 
How to Prepare Your Job Search for 2024 Success
How to Prepare Your Job Search for 2024 SuccessHow to Prepare Your Job Search for 2024 Success
How to Prepare Your Job Search for 2024 Successideatoipo
 
How to Move Your Startup Company to the U.S.
How to Move Your Startup Company to the U.S.How to Move Your Startup Company to the U.S.
How to Move Your Startup Company to the U.S.ideatoipo
 
How to Do a Venture Capital Financing
How to Do a Venture Capital FinancingHow to Do a Venture Capital Financing
How to Do a Venture Capital Financingideatoipo
 

Mehr von ideatoipo (20)

Anatomy of a Patent for Executives and Entrepreneurs
Anatomy of a Patent for Executives and EntrepreneursAnatomy of a Patent for Executives and Entrepreneurs
Anatomy of a Patent for Executives and Entrepreneurs
 
How to Master Resume Writing in a Competitive Market
How to Master Resume Writing in a Competitive MarketHow to Master Resume Writing in a Competitive Market
How to Master Resume Writing in a Competitive Market
 
How to Answer the Most Important Question In Your Interview
How to Answer the Most Important Question In Your InterviewHow to Answer the Most Important Question In Your Interview
How to Answer the Most Important Question In Your Interview
 
How to Write a Resume in a Competitive Job Market
How to Write a Resume in a Competitive Job MarketHow to Write a Resume in a Competitive Job Market
How to Write a Resume in a Competitive Job Market
 
How to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech StartupHow to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech Startup
 
How to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech StartupHow to Get Venture and Angel Funding for Your Tech Startup
How to Get Venture and Angel Funding for Your Tech Startup
 
How to do a Venture Capital Financing in 2024
How to do a Venture Capital Financing  in 2024How to do a Venture Capital Financing  in 2024
How to do a Venture Capital Financing in 2024
 
How to Protect Your Intellectual Property
How to Protect Your Intellectual PropertyHow to Protect Your Intellectual Property
How to Protect Your Intellectual Property
 
How to Systematize Your Job Search in 2024
How to Systematize Your Job Search in 2024How to Systematize Your Job Search in 2024
How to Systematize Your Job Search in 2024
 
Top Ten Legal Mistakes That Could Doom Your Startup
Top Ten Legal Mistakes That Could Doom Your StartupTop Ten Legal Mistakes That Could Doom Your Startup
Top Ten Legal Mistakes That Could Doom Your Startup
 
How to Recession-Proof Your Job Search in 2024
How to Recession-Proof Your Job Search in 2024How to Recession-Proof Your Job Search in 2024
How to Recession-Proof Your Job Search in 2024
 
H1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & Investors
H1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & InvestorsH1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & Investors
H1-B + U.S. Visa Options for Int'l Tech Professionals, Entrepreneurs & Investors
 
How to Strategically Prepare Your Job Search for 2024
How to Strategically Prepare Your Job Search for 2024How to Strategically Prepare Your Job Search for 2024
How to Strategically Prepare Your Job Search for 2024
 
How to Secure Seed and Pre-Seed Investment for Your Startup
How to Secure Seed and Pre-Seed Investment for Your StartupHow to Secure Seed and Pre-Seed Investment for Your Startup
How to Secure Seed and Pre-Seed Investment for Your Startup
 
How to Get Funding for Your Startup by Building Your Corporate Credit
How to Get Funding for Your Startup by Building Your Corporate CreditHow to Get Funding for Your Startup by Building Your Corporate Credit
How to Get Funding for Your Startup by Building Your Corporate Credit
 
How to Raise Seed Funding for Your Startup: Convertible Notes and SAFEs
How to Raise Seed Funding for Your Startup: Convertible Notes and SAFEsHow to Raise Seed Funding for Your Startup: Convertible Notes and SAFEs
How to Raise Seed Funding for Your Startup: Convertible Notes and SAFEs
 
Startup Law 101:How to Avoid Legal Pitfalls that Could Doom Your Startup
Startup Law 101:How to Avoid Legal Pitfalls that Could Doom Your StartupStartup Law 101:How to Avoid Legal Pitfalls that Could Doom Your Startup
Startup Law 101:How to Avoid Legal Pitfalls that Could Doom Your Startup
 
How to Prepare Your Job Search for 2024 Success
How to Prepare Your Job Search for 2024 SuccessHow to Prepare Your Job Search for 2024 Success
How to Prepare Your Job Search for 2024 Success
 
How to Move Your Startup Company to the U.S.
How to Move Your Startup Company to the U.S.How to Move Your Startup Company to the U.S.
How to Move Your Startup Company to the U.S.
 
How to Do a Venture Capital Financing
How to Do a Venture Capital FinancingHow to Do a Venture Capital Financing
How to Do a Venture Capital Financing
 

Kürzlich hochgeladen

Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxAndy Lambert
 
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageMatteo Carbone
 
Call Girls In Noida 959961⊹3876 Independent Escort Service Noida
Call Girls In Noida 959961⊹3876 Independent Escort Service NoidaCall Girls In Noida 959961⊹3876 Independent Escort Service Noida
Call Girls In Noida 959961⊹3876 Independent Escort Service Noidadlhescort
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Neil Kimberley
 
John Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdfJohn Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdfAmzadHosen3
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...lizamodels9
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLSeo
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Serviceritikaroy0888
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfAdmir Softic
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communicationskarancommunications
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with CultureSeta Wicaksana
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsP&CO
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...allensay1
 
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...Sheetaleventcompany
 
RSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataRSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataExhibitors Data
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentationuneakwhite
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...lizamodels9
 

Kürzlich hochgeladen (20)

Monthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptxMonthly Social Media Update April 2024 pptx.pptx
Monthly Social Media Update April 2024 pptx.pptx
 
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Hebbal Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Insurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usageInsurers' journeys to build a mastery in the IoT usage
Insurers' journeys to build a mastery in the IoT usage
 
Call Girls In Noida 959961⊹3876 Independent Escort Service Noida
Call Girls In Noida 959961⊹3876 Independent Escort Service NoidaCall Girls In Noida 959961⊹3876 Independent Escort Service Noida
Call Girls In Noida 959961⊹3876 Independent Escort Service Noida
 
Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023Mondelez State of Snacking and Future Trends 2023
Mondelez State of Snacking and Future Trends 2023
 
John Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdfJohn Halpern sued for sexual assault.pdf
John Halpern sued for sexual assault.pdf
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
 
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRLMONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
MONA 98765-12871 CALL GIRLS IN LUDHIANA LUDHIANA CALL GIRL
 
Call Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine ServiceCall Girls In Panjim North Goa 9971646499 Genuine Service
Call Girls In Panjim North Goa 9971646499 Genuine Service
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 
Pharma Works Profile of Karan Communications
Pharma Works Profile of Karan CommunicationsPharma Works Profile of Karan Communications
Pharma Works Profile of Karan Communications
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with Culture
 
Value Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and painsValue Proposition canvas- Customer needs and pains
Value Proposition canvas- Customer needs and pains
 
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Navi Mumbai Just Call 9907093804 Top Class Call Girl Service Avail...
 
Falcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in indiaFalcon Invoice Discounting platform in india
Falcon Invoice Discounting platform in india
 
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
 
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
 
RSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors DataRSA Conference Exhibitor List 2024 - Exhibitors Data
RSA Conference Exhibitor List 2024 - Exhibitors Data
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentation
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
 

Data Science 101

  • 1. Data Science 101 David Gerster Strategic Advisory Board
  • 2. About me • 10+ years experience in data science at various consumer web companies • Worked on web search at Yahoo and Microsoft • Led the Mobile data science team at Groupon • Joined BigML as VP Data Science in July 2013 • Joined JLL Spark as VP Data in July 2017 • Advisor to High Fidelity Genetics 3
  • 3. Finding meaningful patterns in data • The famous “Iris” data set has measurements for 150 flowers • Given a flower’s measurements, can we predict its species? Iris setosa Iris versicolor Iris virginica 5
  • 4. PetalWidth(cm) Petal Length (cm) Iris setosa, red dots Iris versicolor, green dots Iris virginica, blue dots 6
  • 6. PetalWidth(cm) Petal Length (cm) PetalWidth(cm) Petal Length (cm) Prediction: Iris setosa Prediction: Iris versicolor Prediction: Iris virginica Prediction: Iris virginica 8
  • 7. PetalWidth(cm) Petal Length (cm) Prediction: Iris setosa Prediction: Iris versicolor Prediction: Iris virginica Prediction: Iris virginica Congratulations! You just scored four new flowers using your model, and made a prediction about the species of each one. 9
  • 8. Training versus Scoring • This process had two steps: training and scoring • When training on historical data, you’re using data gathered over some length of time • When scoring new data points, you want the answer immediately (in “real time”) 10
  • 9. 11 Predicts “blue” with high confidence Explains a large chunk of the data (high support) Predicts “blue” with low confidence Explains a small chunk of the data (low support)
  • 10. Support and Confidence • A rectangle with a large number of data points has high “support” • A rectangle that is purely one color has high “confidence” • If there is a small number of data points, confidence is low even if it’s purely one color 12
  • 11. PetalWidth(cm) Petal Length (cm) 13 Width <= 0.8? Width > 0.8? Width > 1.75? Width <= 1.75? Length <= 5? Length > 5? 50 red 45 blue 1 blue, 48 green 4 blue, 2 green “Decision Tree” “Leaf Nodes” 50 blue, 50 green 5 blue, 50 green 50 red, 50 blue, 50 green
  • 12. • Data is just a table of values • Each row is an instance, an example of the concept to be learned • Each column is an attribute or feature of the instance • The column we want to predict is the label or output • Because we have a label, this is supervised learning 14 instance instance feature labelfeature
  • 13. Demo: The General Social Survey • Sociology survey given in the United States since 1972 • Data is 39,000 responses, almost 400 questions each • Demographic data like income, race, gender, education, marital status • Many questions about personal beliefs • “Should an atheist be allowed to teach college, or not?” • “Are we spending the right amount of money on education?” • Can we predict income from these responses? 16
  • 14. How good is our model? • The model looks good, but how do we quantify this? 17
  • 15. 80% training set 20% holdout set 3 out of 4 predictions are correct Accuracy = 75% 100% of data 1. Train a model using 80% training set 2. Pretend 20% holdout is new data, and feed it to the model 3. Check accuracy of predictions
  • 16. Predicting political views • What happens if we predict political views instead of income? • A different subset of variables becomes important! 19
  • 17. 20
  • 18. Finding the important variables 21
  • 19. 22
  • 20. The Value of Predictive Modeling • Provides deep insight into your data • Finds the small subset of important variables • Extremely useful for business! 23
  • 21. Demo: The StumbleUpon Dataset • StumbleUpon is an app that recommends web pages • Dataset of 7,400 web pages is provided, with each page labeled as either “evergreen” or “ephemeral” • We want to predict the page’s class using this historical data 24 While some pages we recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as "ephemeral" or "evergreen".
  • 22. Training a model on StumbleUpon data • Live demo: training a model on StumbleUpon data • Key concepts: • “Bag of words” text analysis • Evaluating the model using a holdout set • Combining multiple models to improve accuracy • The “ensemble” of multiple models has better accuracy! 25
  • 23. “Ensembles” of Models • Training multiple models on random subsets of the data gave us a better result! • Why? 26
  • 24. Bias and Variance • We train a model with the goal of fitting it correctly to the data • When a model isn’t flexible enough, it may underfit the data, and we say it has high bias • When a model is too flexible, it may overfit the data, and we say it has high variance For a formal definition of bias and variance, see Thomas Dietterich’s paper on the subject
  • 27. Decision trees have high variance • Decision trees can represent complex functions • But they are prone to overfitting; they have high variance • If you draw enough lines, you can create a “model” that just memorizes the dataset!
  • 28.
  • 29. Decision trees have high variance • We can reduce this problem by: • Taking several random samples from the original data set • Training a decision tree on each sample • Having these trees vote on the class • Goal: Get the expressiveness of a decision tree, with less overfitting
  • 32. 39
  • 33. 40
  • 34. 41
  • 35. 42
  • 36. 45 Blue side Red sideVote: 2-1, Blue Vote: 2-1, Red Vote: 2-1, Blue
  • 37. Benefits of a Decision Tree Ensemble • Voted boundary is more accurate than for a single tree • “Best of both worlds”: Get most of the expressiveness of decision trees with lower variance • We’re actually taking advantage of the variance by feeding a different random sample to each tree and seeing what happens! 46
  • 38. Why draw straight lines in decision trees? • Imagine you have 400 variables in your dataset • You only need to examine 400 variables to draw the “best” straight line between the dots • If you want a diagonal line in two dimensions, there are (400 choose 2) or 79,800 combinations of variables to examine • Some biology datasets have 100,000 variables! • (100,000 choose 2) = 4,999,950,000 combinations of 2 variables! 47
  • 39. Popular algorithms for supervised learning • We got pretty deep into Decision Trees and ensembles of trees • Other popular algorithms for supervised learning: • Support Vector Machines • Neural Nets (“Deep Learning”) • Check out BigML’s automated deep learning! 50
  • 40. Recap: Supervised Learning Topics • Definition of supervised learning • Training and scoring a model • Support and confidence • Model evaluation using a holdout set • Bias and variance, underfitting and overfitting • Using ensembles to improve models • … And a whole lot about decision trees! 51
  • 42. What if we don’t have labels? • Can we still get insight into our data if we don’t know the colors of the dots? • Since we don’t have labels, this is unsupervised learning • Clustering: Find “clumps” of unlabeled data that might be interesting • Anomaly detection: Find outliers in unlabeled data • Topic Modeling: Identify topics in free text 54
  • 43. Clustering • Concept: Find “lumps” of data that exist in distinct clusters • K-means clustering: 1. Choose a number of clusters k that you are looking for 2. Choose initial “centroids” for the clusters 3. Compute which data points are closest to each centroid 4. Compute the actual center for each of the sets of datapoints 5. Continue until the k centroids stop moving 55
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57. Demo: The Whisky Dataset • Data on the flavors of 86 single-malt Scotch whiskies • No labels, just a bunch of taste information • Can we get insight into this dataset? 69
  • 58. Demo: Breast Cancer Dataset • Train a predictive model using the 699 biopsies • The “label” of benign or malignant is known for each one • We can train a highly accurate predictive model with this data 74
  • 59. Demo: Breast Cancer Dataset • What if we remove the labels of “benign” and “malignant”? 75
  • 60. 10 lines are needed to isolate this data point (not anomalous)
  • 61. Only 4 lines are needed to isolate this data point (highly anomalous)
  • 62. Demo: Anomaly Detection • Remove the labels of benign or malignant • Train an anomaly detector on this unlabeled data • Create a new dataset with the anomaly scores as “labels” • Use these “labels” to train a predictive model! 78
  • 64. Minority Report • Anomaly detection works great on large unlabeled datasets, especially if you expect to find an (adversarial) minority class • Millions of credit card transactions, billions of network events … • Doesn’t require you to know what you’re looking for! 81
  • 65. Topic Modeling using LDA • Uncovers groups of related words (“topics”) in documents • Does not require an external corpus (e.g. training on Wikipedia) • No semantic parsing of text • Unsupervised
  • 66. Topic modeling on IMDB reviews • 52,000 reviews • 883 movies
  • 67. Top 3 Topics in Shrek Reviews (n=26)
  • 68. Topics Topic distribution for this document Borrowed/stolen from Prof. David Blei, with apologies …
  • 69. The (assumed) generative process children A topic, which is a distribution over words A distribution over topics, specific to each document A distribution over topic distributions, fixed for this corpus A word in a document Topic 1 Topic 3 Topic 2 A distribution over word distributions, fixed for this corpus Word 1 Word 2Word 3
  • 70. What we observe children A word in a document
  • 73. n = 31 The Sum of All Fears
  • 74. n = 31 The Sum of All Fears
  • 75. n = 100 Love, Actually
  • 76. How do we get such “good” topics? • Imagine that each document can only belong to one topic • Does that make it easier or harder to find “good” clusters of words? • LDA allows documents to belong to multiple topics
  • 77. Recap: Unsupervised Learning Topics • Unsupervised learning uses unlabeled data • Clustering: Finding clumps in unlabeled data • Anomaly Detection: Finding “weird” instances in unlabeled data • Topic Modeling: Extracting meaningful topics from free text 94
  • 78. Final Thought • Supervised learning has many different algorithms to solve one problem (predicting the output) • Unsupervised learning has a many different algorithms to solve many different problems 95 David Gerster gerster@bigml.com
  • 80. 102