SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
The Wild West of Data
Wrangling
Sarah Guido
PyCon 2017
@sarah_guido
This talk:
•  A day in the life
•  Three examples of dealing with uncooperative data
•  Not ground truth!
Who am I?
•  Senior data scientist at Mashable
•  Mashable == internet culture media!
•  Data sciencing in Python
•  Twitter: @sarah_guido
Iris Dataset
Iris Dataset
Example 1: Predicting building sales
•  The problem: can we predict if a building will sell the
following year?
•  The data: floors, location, square footage, price per sqft,
etc
•  The goal: provide valuable insight to platform users
Example 1: Predicting building sales
•  First thought: logistic regression using scikit-learn
•  Binary classification: sale/no sale
Problem…
Data: 95% no sale, 5% sale
Logistic regression: 95% accurate
DONE!
Problem: Class imbalance
Class imbalance
When the values you are trying to predict are not equal, this
can create bias in classification models.
Solution: Gradient boosting
Gradient boosting
Produces a prediction model in the form of an ensemble of
weak prediction models, typically decision trees.
Example 2: Clustering user interactions
The problem: how can we identify similar patterns based on
click data?
The data: time, geolocation, cookie, browser useragent
string, referrer
The goal: understand how people interact with content over
time
Why Scala?
Problem: Clustering user interactions
K-means clustering
An unsupervised learning method of grouping data together
based on a distance metric.
Problem: Clustering the data
•  Only look at users with 5 or more interactions
•  Each user has a different number of interactions
•  Each data point ends up in a different cluster
Solution: Transform the data
Solution: Transform the data
date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01,
2017-05-12
Length of interactions: 5
Average time between interactions: ~8 days
Solution: Transform the data
referrer: facebook, twitter
One-hot encode and transform to matrix
•  Facebook: [1, 0]
•  Twitter: [0, 1]
Solution: Transform the data
Example 3: Understand audience composition
The problem: how can we effectively describe our audience?
The data: anonymized demographic and psychographic data
The goal: audience segmentation and channel analysis
Problem: insufficient data
•  Google Analytics data – 1/3 of urls
•  Finicky API
•  Semi-useless psychographic data
Solution: accept defeat
Solution: accept defeat make it work!
Solution: make it work!
•  Theory of highly-performant links
•  Segmentation through archetypal analysis
•  Go get more data!
General strategy
•  What problem are you trying to solve?
•  What’s wrong with your data?
•  What do you need that you don’t have?
Keep in mind…
•  Data your company collects is complicated
•  What you do to your data will affect the model
•  Creativity is your friend
•  Lots of ways to solve the problem
Thank you!
@sarah_guido

Weitere ähnliche Inhalte

Was ist angesagt?

Mapping a Privacy Framework to a Reference Model of Learning Analytics
Mapping a Privacy Framework to  a Reference Model of Learning AnalyticsMapping a Privacy Framework to  a Reference Model of Learning Analytics
Mapping a Privacy Framework to a Reference Model of Learning AnalyticsOpen Cyber University of Korea
 
Analytics 101 - Getting Started
Analytics 101 - Getting Started Analytics 101 - Getting Started
Analytics 101 - Getting Started Gautam Munshi
 
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...Istituto nazionale di statistica
 
Group Concpet Mapping Learning Analytics @ LASI Amsterdam
Group Concpet Mapping Learning Analytics @ LASI Amsterdam Group Concpet Mapping Learning Analytics @ LASI Amsterdam
Group Concpet Mapping Learning Analytics @ LASI Amsterdam Hendrik Drachsler
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Ilkay Altintas, Ph.D.
 
Learning about Systems Using Machine Learning:Towards More Data-Driven Feedba...
Learning about Systems Using Machine Learning:Towards More Data-Driven Feedba...Learning about Systems Using Machine Learning:Towards More Data-Driven Feedba...
Learning about Systems Using Machine Learning:Towards More Data-Driven Feedba...Mahmoud Elbattah
 
Predict oscars (4:17)
Predict oscars (4:17)Predict oscars (4:17)
Predict oscars (4:17)Thinkful
 
Creating data dashboards to support planning
Creating data dashboards to support planningCreating data dashboards to support planning
Creating data dashboards to support planningMarieke Guy
 
Adopting data8 at a two year college
Adopting data8 at a two year collegeAdopting data8 at a two year college
Adopting data8 at a two year collegeAva Meredith
 
Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' ProblemsAdding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' ProblemsSimon Price
 
Citi Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics PresentationCiti Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics PresentationMarquis Cabrera
 
CIRPA 2016: Individual Level Predictive Analytics for Improving Student Enrol...
CIRPA 2016: Individual Level Predictive Analytics for Improving Student Enrol...CIRPA 2016: Individual Level Predictive Analytics for Improving Student Enrol...
CIRPA 2016: Individual Level Predictive Analytics for Improving Student Enrol...Stephen Childs
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data ScienceArc & Codementor
 
GTU GeekDay Data Science and Applications
GTU GeekDay Data Science and ApplicationsGTU GeekDay Data Science and Applications
GTU GeekDay Data Science and ApplicationsKürşat İNCE
 
Intro to quant_s_tudents
Intro to quant_s_tudentsIntro to quant_s_tudents
Intro to quant_s_tudentsMPA502a
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data scienceAkira Shibata
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Frank Kienle
 

Was ist angesagt? (19)

Mapping a Privacy Framework to a Reference Model of Learning Analytics
Mapping a Privacy Framework to  a Reference Model of Learning AnalyticsMapping a Privacy Framework to  a Reference Model of Learning Analytics
Mapping a Privacy Framework to a Reference Model of Learning Analytics
 
Wilson Confidence, Skills, And Accepting that Good Enough is Good Enough
Wilson Confidence, Skills, And Accepting that Good Enough is Good EnoughWilson Confidence, Skills, And Accepting that Good Enough is Good Enough
Wilson Confidence, Skills, And Accepting that Good Enough is Good Enough
 
Analytics 101 - Getting Started
Analytics 101 - Getting Started Analytics 101 - Getting Started
Analytics 101 - Getting Started
 
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...
Session III Census and registers - R.Radini, M.Scannapieco, L.Tosco, The ital...
 
Group Concpet Mapping Learning Analytics @ LASI Amsterdam
Group Concpet Mapping Learning Analytics @ LASI Amsterdam Group Concpet Mapping Learning Analytics @ LASI Amsterdam
Group Concpet Mapping Learning Analytics @ LASI Amsterdam
 
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
Creating a Data Science Ecosystem for Scientific, Societal and Educational Im...
 
Learning about Systems Using Machine Learning:Towards More Data-Driven Feedba...
Learning about Systems Using Machine Learning:Towards More Data-Driven Feedba...Learning about Systems Using Machine Learning:Towards More Data-Driven Feedba...
Learning about Systems Using Machine Learning:Towards More Data-Driven Feedba...
 
Predict oscars (4:17)
Predict oscars (4:17)Predict oscars (4:17)
Predict oscars (4:17)
 
Creating data dashboards to support planning
Creating data dashboards to support planningCreating data dashboards to support planning
Creating data dashboards to support planning
 
Adopting data8 at a two year college
Adopting data8 at a two year collegeAdopting data8 at a two year college
Adopting data8 at a two year college
 
Adding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' ProblemsAdding Open Data Value to 'Closed Data' Problems
Adding Open Data Value to 'Closed Data' Problems
 
Citi Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics PresentationCiti Global T4I Accelerator Data and Analytics Presentation
Citi Global T4I Accelerator Data and Analytics Presentation
 
CIRPA 2016: Individual Level Predictive Analytics for Improving Student Enrol...
CIRPA 2016: Individual Level Predictive Analytics for Improving Student Enrol...CIRPA 2016: Individual Level Predictive Analytics for Improving Student Enrol...
CIRPA 2016: Individual Level Predictive Analytics for Improving Student Enrol...
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
GTU GeekDay Data Science and Applications
GTU GeekDay Data Science and ApplicationsGTU GeekDay Data Science and Applications
GTU GeekDay Data Science and Applications
 
Intro to quant_s_tudents
Intro to quant_s_tudentsIntro to quant_s_tudents
Intro to quant_s_tudents
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Kaggle and data science
Kaggle and data scienceKaggle and data science
Kaggle and data science
 
Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science Machine Learning part 3 - Introduction to data science
Machine Learning part 3 - Introduction to data science
 

Ähnlich wie The Wild West of Data Wrangling

The Wild West of Data Wrangling (PyTN)
The Wild West of Data Wrangling (PyTN)The Wild West of Data Wrangling (PyTN)
The Wild West of Data Wrangling (PyTN)Sarah Guido
 
MULTI-TOUCH ATTRIBUTION: A MARKETING PROBLEM SOLVED? - ABIGAIL LEBRECHT
MULTI-TOUCH ATTRIBUTION: A MARKETING PROBLEM SOLVED? - ABIGAIL LEBRECHTMULTI-TOUCH ATTRIBUTION: A MARKETING PROBLEM SOLVED? - ABIGAIL LEBRECHT
MULTI-TOUCH ATTRIBUTION: A MARKETING PROBLEM SOLVED? - ABIGAIL LEBRECHTBig Data Week
 
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comTDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comDaqing Zhao
 
Community-Assisted Software Engineering Decision Making
Community-Assisted Software Engineering Decision MakingCommunity-Assisted Software Engineering Decision Making
Community-Assisted Software Engineering Decision Makinggregoryg
 
Recommender Systems In Industry
Recommender Systems In IndustryRecommender Systems In Industry
Recommender Systems In IndustryXavier Amatriain
 
Drinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire HoseDrinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire HoseGigi Johnson
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data ScienceThinkful
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Thinkful
 
Search, Discovery and Questions at Quora
Search, Discovery and Questions at QuoraSearch, Discovery and Questions at Quora
Search, Discovery and Questions at QuoraNikhil Dandekar
 
Decoding Learner Digital Body Language: What our learners' actions tell us
Decoding Learner Digital Body Language: What our learners' actions tell usDecoding Learner Digital Body Language: What our learners' actions tell us
Decoding Learner Digital Body Language: What our learners' actions tell usTraining Industry Conference & Expo
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Thinkful
 
Emoocs2017 who wants to chat on a mooc v1.2
Emoocs2017   who wants to chat on a mooc v1.2Emoocs2017   who wants to chat on a mooc v1.2
Emoocs2017 who wants to chat on a mooc v1.2Rémi Bachelet
 
Ringing the changes: transforming teams and technologies
Ringing the changes: transforming teams and technologiesRinging the changes: transforming teams and technologies
Ringing the changes: transforming teams and technologies Zak Mensah
 
A new direction for recommender systems: balancing privacy and personalisation
A new direction for recommender systems: balancing privacy and personalisationA new direction for recommender systems: balancing privacy and personalisation
A new direction for recommender systems: balancing privacy and personalisationBenjamin Heitmann
 
Big Data Analysis and Business Intelligence
Big Data Analysis and Business IntelligenceBig Data Analysis and Business Intelligence
Big Data Analysis and Business IntelligenceDaqing Zhao
 
Data to Insights with Gogo's Data Science Lead
Data to Insights with Gogo's Data Science LeadData to Insights with Gogo's Data Science Lead
Data to Insights with Gogo's Data Science LeadPromotable
 
CC TEL- Simulation-based co-design of algorithms
CC TEL- Simulation-based co-design of algorithmsCC TEL- Simulation-based co-design of algorithms
CC TEL- Simulation-based co-design of algorithmsSebastian Dennerlein
 
Designing Big Content - Search Exchange 2013
Designing Big Content - Search Exchange 2013Designing Big Content - Search Exchange 2013
Designing Big Content - Search Exchange 2013Brian_Chappell
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetupmortardata
 

Ähnlich wie The Wild West of Data Wrangling (20)

The Wild West of Data Wrangling (PyTN)
The Wild West of Data Wrangling (PyTN)The Wild West of Data Wrangling (PyTN)
The Wild West of Data Wrangling (PyTN)
 
MULTI-TOUCH ATTRIBUTION: A MARKETING PROBLEM SOLVED? - ABIGAIL LEBRECHT
MULTI-TOUCH ATTRIBUTION: A MARKETING PROBLEM SOLVED? - ABIGAIL LEBRECHTMULTI-TOUCH ATTRIBUTION: A MARKETING PROBLEM SOLVED? - ABIGAIL LEBRECHT
MULTI-TOUCH ATTRIBUTION: A MARKETING PROBLEM SOLVED? - ABIGAIL LEBRECHT
 
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.comTDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
TDWI Solution Summit San Diego 2014 Advanced Analytics at Macys.com
 
Community-Assisted Software Engineering Decision Making
Community-Assisted Software Engineering Decision MakingCommunity-Assisted Software Engineering Decision Making
Community-Assisted Software Engineering Decision Making
 
Recommender Systems In Industry
Recommender Systems In IndustryRecommender Systems In Industry
Recommender Systems In Industry
 
Drinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire HoseDrinking from the Digital Data Fire Hose
Drinking from the Digital Data Fire Hose
 
Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Week2 chapters1 3
Week2 chapters1 3Week2 chapters1 3
Week2 chapters1 3
 
Search, Discovery and Questions at Quora
Search, Discovery and Questions at QuoraSearch, Discovery and Questions at Quora
Search, Discovery and Questions at Quora
 
Decoding Learner Digital Body Language: What our learners' actions tell us
Decoding Learner Digital Body Language: What our learners' actions tell usDecoding Learner Digital Body Language: What our learners' actions tell us
Decoding Learner Digital Body Language: What our learners' actions tell us
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Emoocs2017 who wants to chat on a mooc v1.2
Emoocs2017   who wants to chat on a mooc v1.2Emoocs2017   who wants to chat on a mooc v1.2
Emoocs2017 who wants to chat on a mooc v1.2
 
Ringing the changes: transforming teams and technologies
Ringing the changes: transforming teams and technologiesRinging the changes: transforming teams and technologies
Ringing the changes: transforming teams and technologies
 
A new direction for recommender systems: balancing privacy and personalisation
A new direction for recommender systems: balancing privacy and personalisationA new direction for recommender systems: balancing privacy and personalisation
A new direction for recommender systems: balancing privacy and personalisation
 
Big Data Analysis and Business Intelligence
Big Data Analysis and Business IntelligenceBig Data Analysis and Business Intelligence
Big Data Analysis and Business Intelligence
 
Data to Insights with Gogo's Data Science Lead
Data to Insights with Gogo's Data Science LeadData to Insights with Gogo's Data Science Lead
Data to Insights with Gogo's Data Science Lead
 
CC TEL- Simulation-based co-design of algorithms
CC TEL- Simulation-based co-design of algorithmsCC TEL- Simulation-based co-design of algorithms
CC TEL- Simulation-based co-design of algorithms
 
Designing Big Content - Search Exchange 2013
Designing Big Content - Search Exchange 2013Designing Big Content - Search Exchange 2013
Designing Big Content - Search Exchange 2013
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetup
 

Mehr von Sarah Guido

Data Science Retrospective
Data Science RetrospectiveData Science Retrospective
Data Science RetrospectiveSarah Guido
 
The Importance of Community
The Importance of CommunityThe Importance of Community
The Importance of CommunitySarah Guido
 
Spark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the UglySpark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the UglySarah Guido
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 
Network theory - PyCon 2015
Network theory - PyCon 2015Network theory - PyCon 2015
Network theory - PyCon 2015Sarah Guido
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With PythonSarah Guido
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnSarah Guido
 

Mehr von Sarah Guido (7)

Data Science Retrospective
Data Science RetrospectiveData Science Retrospective
Data Science Retrospective
 
The Importance of Community
The Importance of CommunityThe Importance of Community
The Importance of Community
 
Spark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the UglySpark: The Good, the Bad, and the Ugly
Spark: The Good, the Bad, and the Ugly
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Network theory - PyCon 2015
Network theory - PyCon 2015Network theory - PyCon 2015
Network theory - PyCon 2015
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
K-means Clustering with Scikit-Learn
K-means Clustering with Scikit-LearnK-means Clustering with Scikit-Learn
K-means Clustering with Scikit-Learn
 

Kürzlich hochgeladen

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Kürzlich hochgeladen (20)

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

The Wild West of Data Wrangling

  • 1. The Wild West of Data Wrangling Sarah Guido PyCon 2017 @sarah_guido
  • 2. This talk: •  A day in the life •  Three examples of dealing with uncooperative data •  Not ground truth!
  • 3. Who am I? •  Senior data scientist at Mashable •  Mashable == internet culture media! •  Data sciencing in Python •  Twitter: @sarah_guido
  • 6.
  • 7.
  • 8. Example 1: Predicting building sales •  The problem: can we predict if a building will sell the following year? •  The data: floors, location, square footage, price per sqft, etc •  The goal: provide valuable insight to platform users
  • 9. Example 1: Predicting building sales •  First thought: logistic regression using scikit-learn •  Binary classification: sale/no sale
  • 10. Problem… Data: 95% no sale, 5% sale Logistic regression: 95% accurate DONE!
  • 11.
  • 12. Problem: Class imbalance Class imbalance When the values you are trying to predict are not equal, this can create bias in classification models.
  • 13. Solution: Gradient boosting Gradient boosting Produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
  • 14. Example 2: Clustering user interactions The problem: how can we identify similar patterns based on click data? The data: time, geolocation, cookie, browser useragent string, referrer The goal: understand how people interact with content over time
  • 16. Problem: Clustering user interactions K-means clustering An unsupervised learning method of grouping data together based on a distance metric.
  • 17. Problem: Clustering the data •  Only look at users with 5 or more interactions •  Each user has a different number of interactions •  Each data point ends up in a different cluster
  • 18.
  • 19.
  • 20.
  • 21.
  • 23. Solution: Transform the data date: 2017-04-09, 2017-04-13, 2017-04-30, 2017-05-01, 2017-05-12 Length of interactions: 5 Average time between interactions: ~8 days
  • 24. Solution: Transform the data referrer: facebook, twitter One-hot encode and transform to matrix •  Facebook: [1, 0] •  Twitter: [0, 1]
  • 26. Example 3: Understand audience composition The problem: how can we effectively describe our audience? The data: anonymized demographic and psychographic data The goal: audience segmentation and channel analysis
  • 27. Problem: insufficient data •  Google Analytics data – 1/3 of urls •  Finicky API •  Semi-useless psychographic data
  • 29. Solution: accept defeat make it work!
  • 30. Solution: make it work! •  Theory of highly-performant links •  Segmentation through archetypal analysis •  Go get more data!
  • 31. General strategy •  What problem are you trying to solve? •  What’s wrong with your data? •  What do you need that you don’t have?
  • 32. Keep in mind… •  Data your company collects is complicated •  What you do to your data will affect the model •  Creativity is your friend •  Lots of ways to solve the problem
  • 33.