SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
©2018 dataiku, Inc.
Applied Data Science Online Course
3rd Class: Getting dirty; data
preparation and feature creation
©2018 dataiku, Inc.
● September 20th at 12PM ET: Learning the Basics, concepts, & your first ML model
● September 27th at 12PM ET: The data science workflow, building a predictive model flow
● October 4th at 12PM ET: Getting dirty; data preparation and feature creation
● October 11th at 12PM ET: Understanding your model - and communicating about it
Curriculum
Go from Small to Big Data in 4 weeks
©2018 dataiku, Inc.
Most Advanced version of the workflow
The Data Science Workflow
Determine
business
objectives
Assess situation
Determine data
science goals
Produce project
plan
Collect data Describe data Explore data
Assess qualitySelectCleanBuildReport Reformat
Select modeling
techniques
Generate test
design
Build model Assess model Evaluate results Review process
Determine next
steps
Check common
risks
Plan
deployment
Plan monitoring
& performance
Produce final
report
Review project
©2018 dataiku, Inc.
Data preparation
70% of the work
Select Clean Build ReportReformat
©2018 dataiku, Inc.
Different types of data
©2018 dataiku, Inc.
Different types of Data
Definitions
Structured Unstructured
Data stored with clearly defined data types whose
pattern makes them easily searchable and linkable
- most often tabular format
Data is not structured via pre-defined data
models or schema.
Examples:
• Database with columns for name, phone
number etc
Examples:
• Text files
• Websites and social media
©2018 dataiku, Inc.
Different types of Data
Definitions
©2018 dataiku, Inc.
Different types of Storage
Definitions
Relational Databases Unstructured data
storage
Data is stored in different tables that are
connected with unique identifiers
Systems that can store and process both
structured and unstructured data
Examples:
• Structured Query Language DB
Examples:
• Hadoop
• NOSQL
©2018 dataiku, Inc.
Different type of structured data
Data can be one of several
categories
Examples:
• Gender
• Nationality
• Hair color
Data is a number
Examples:
• Age
• Weight
• Salary
Data is free-form text
Examples:
• Tweets
• Documents
• Business name
+ semi-structured data: json
©2018 dataiku, Inc.
Different type of structured data
Examples
Eye Color (e.g. Blue)
Height (e.g. 170 cm)
Country of Birth (e.g. France)
Postal Code (e.g. 75001)
Date (e.g. Wednesday, 15 Jan 1976)
Address (e.g. 10 Rue Saint Martin, Paris)
Curriculum Vitae
©2018 dataiku, Inc.
Missing values
©2018 dataiku, Inc.
What to do with missing values?
©2018 dataiku, Inc.
Drop values
Delete all data from any participant with missing values
Few Warnings:
• Be sure your sample is large enough, then you likely can
drop data without substantial loss of statistical power.
• Be sure data is not missing at Random: There is a
pattern in the missing data that affect your primary
dependent variables.
For example, lower-income participants are less likely to
respond income column.
©2018 dataiku, Inc.
Imputation
Replacing missing values with substitute values.
14
Method #1: Common Value
For Number:
• Average
• Median
• Constant Value
For Category:
• Treat like the category
« Empty »
• Most frequent value
• A constant value
Method #2: Educated Guess
Infer a missing value:
• If Age is lower than 20, Income
is likely to be 0
• If living in a house in a rich city:
income is likely to be higher
than average
• Nb. of child is likely to not be 0
if age is high and situation not
married
Method #3: Sub-Model
• Create a specific model of
machine learning to
predict the missing
(Regression,
Classification)
©2018 dataiku, Inc.
Example
How to handle the missing data in this doc?
Not OK
• Drop Rows
• Average
Better :
• Educated guess from
SPCategory
• Sub-prediction Model
©2018 dataiku, Inc.
Grouping & Joining Data
©2018 dataiku, Inc.
Join operation
Similar to a VLOOKUP / INDEX MATCH
©2018 dataiku, Inc.
Group by operations
Definition
©2018 dataiku, Inc.
Aggregate data from an entity
How can you aggregate this dataset?
©2018 dataiku, Inc.
Aggregate data from an entity
How can you aggregate this dataset?
Group By Options:
For Number:
• Average
• Sum
• Minimum and Maximum
• Standard Deviation
For Category and Number:
• Count of Value
• Count of Distinct Value
• First and Last Value
• Most Frequent
©2018 dataiku, Inc.
Result
©2018 dataiku, Inc.
Dummification & Rescaling
©2018 dataiku, Inc.
● Some data is in a textual or numerical format but should be understood as a category by your model
● This also allows you to use non numerical data in a linear model
> Create a dummy variable that corresponds to these categories: 0-1 for linear model
● Your numerical data can be distributed in a way that will be misunderstood by your model
> Change the values to rearrange them on a scale
Dummification & Rescaling: the issue
The specificity of your data can create bias in your model
15
0
15
0
18
0
18
0
Y= a1
x1
+ a2
x2
+ a3
x3
+ a4
x4
+
a5
x5
Father’s Height
Your
Height
©2018 dataiku, Inc.
Dummification for linear models
For categorical variables
Then your formula will look like this: Y= abatteries
x1
+ aSoap
x2
+ aCandy
x3…
And X1,
X2,
X3…
are either 0 or 1
©2018 dataiku, Inc.
Rescaling for numerical variables
Without rescaling your formula will look like this:
Yrosalindrosenbaum
= aincome
*28114 + aChild
*1…
YBrettStamm
= aincome
*47901 + aChild
*3…
©2018 dataiku, Inc.
Feature rescaling for linear models
Feature scaling is a method used to standardize the range of independent variables
Example of Rescaling – Min-Max Rescaling
©2018 dataiku, Inc.
Qu s o s?
©2018 dataiku, Inc.
Hands-on
©2018 dataiku, Inc.
©2018 dataiku, Inc.
About Dataiku - Your Path to Enterprise AI

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceNiko Vuokko
 
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...Databricks
 
Top Data Mining Techniques and Their Applications
Top Data Mining Techniques and Their ApplicationsTop Data Mining Techniques and Their Applications
Top Data Mining Techniques and Their ApplicationsPromptCloud
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Anastasija Nikiforova
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data ScienceSpotle.ai
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecasesSreenatha Reddy K R
 
Data quality metrics infographic
Data quality metrics infographicData quality metrics infographic
Data quality metrics infographicIntellspot
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architectureanicewick
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine LearningMostafa
 
Business Analytics-1.pdf
Business Analytics-1.pdfBusiness Analytics-1.pdf
Business Analytics-1.pdfNaramsettiVamsi
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceData Science Thailand
 

Was ist angesagt? (20)

Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...
Needle in the Haystack—User Behavior Anomaly Detection for Information Securi...
 
Top Data Mining Techniques and Their Applications
Top Data Mining Techniques and Their ApplicationsTop Data Mining Techniques and Their Applications
Top Data Mining Techniques and Their Applications
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
 
Data science
Data scienceData science
Data science
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Data quality metrics infographic
Data quality metrics infographicData quality metrics infographic
Data quality metrics infographic
 
Data quality architecture
Data quality architectureData quality architecture
Data quality architecture
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Business Analytics-1.pdf
Business Analytics-1.pdfBusiness Analytics-1.pdf
Business Analytics-1.pdf
 
Data engineering
Data engineeringData engineering
Data engineering
 
data science
data sciencedata science
data science
 
BIG DATA and USE CASES
BIG DATA and USE CASESBIG DATA and USE CASES
BIG DATA and USE CASES
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Introduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data ScienceIntroduction to Big Data Analytics and Data Science
Introduction to Big Data Analytics and Data Science
 
Predictive analytics
Predictive analytics Predictive analytics
Predictive analytics
 

Ähnlich wie Applied Data Science Part 3: Getting dirty; data preparation and feature creation

Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...VMware Tanzu
 
Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...
Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...
Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...DATAVERSITY
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019DataKitchen
 
Big Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview PreparationBig Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview PreparationIntellipaat
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
Nonprofits + Data: Pathway to Innovation
Nonprofits + Data: Pathway to InnovationNonprofits + Data: Pathway to Innovation
Nonprofits + Data: Pathway to InnovationTim Sarrantonio
 
Big Data Testing Strategies
Big Data Testing StrategiesBig Data Testing Strategies
Big Data Testing StrategiesKnoldus Inc.
 
Leap to Next Generation Data Management with Denodo 7.0
Leap to Next Generation Data Management with Denodo 7.0Leap to Next Generation Data Management with Denodo 7.0
Leap to Next Generation Data Management with Denodo 7.0Denodo
 
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...DATAVERSITY
 
Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...
Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...
Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...Workday, Inc.
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
WHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjn
WHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjnWHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjn
WHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjnRohitKumar639388
 
Data Analytics Business Intelligence
Data Analytics Business IntelligenceData Analytics Business Intelligence
Data Analytics Business IntelligenceRavikanth-BA
 
Data Mining Services in various types
Data Mining Services in various typesData Mining Services in various types
Data Mining Services in various typesloginworks software
 

Ähnlich wie Applied Data Science Part 3: Getting dirty; data preparation and feature creation (20)

Big data
Big dataBig data
Big data
 
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
Webinar - The Science of Segmentation: What Questions You Should be Asking Yo...
 
Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...
Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...
Data as a Profit Driver – Emerging Techniques to Monetize Data as a Strategic...
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
Big Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview PreparationBig Data Developer Career Path: Job & Interview Preparation
Big Data Developer Career Path: Job & Interview Preparation
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
Nonprofits + Data: Pathway to Innovation
Nonprofits + Data: Pathway to InnovationNonprofits + Data: Pathway to Innovation
Nonprofits + Data: Pathway to Innovation
 
Data Visualization: Sales forecasting
Data Visualization: Sales forecastingData Visualization: Sales forecasting
Data Visualization: Sales forecasting
 
Big Data Testing Strategies
Big Data Testing StrategiesBig Data Testing Strategies
Big Data Testing Strategies
 
Leap to Next Generation Data Management with Denodo 7.0
Leap to Next Generation Data Management with Denodo 7.0Leap to Next Generation Data Management with Denodo 7.0
Leap to Next Generation Data Management with Denodo 7.0
 
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
Data Architecture Strategies: Building an Enterprise Data Strategy – Where to...
 
23.pdf
23.pdf23.pdf
23.pdf
 
Big data careers
Big data careersBig data careers
Big data careers
 
Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...
Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...
Remote but Still Resilient: Zynga and XactShare their Stories on People Analy...
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
WHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjn
WHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjnWHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjn
WHAT IS BUSINESS ANALYTICS um hj mnjh nit 1 ppt only kjjn
 
Big_Data.pptx
Big_Data.pptxBig_Data.pptx
Big_Data.pptx
 
Data Analytics Business Intelligence
Data Analytics Business IntelligenceData Analytics Business Intelligence
Data Analytics Business Intelligence
 
Data Mining Services in various types
Data Mining Services in various typesData Mining Services in various types
Data Mining Services in various types
 
Big Data and Business Insight
Big Data and Business InsightBig Data and Business Insight
Big Data and Business Insight
 

Mehr von Dataiku

Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...Dataiku
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare IndustryDataiku
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ? Dataiku
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015 Dataiku
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Dataiku
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHDataiku
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuDataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Dataiku
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 

Mehr von Dataiku (20)

Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...Applied Data Science Course Part 2: the data science workflow and basic model...
Applied Data Science Course Part 2: the data science workflow and basic model...
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...Dataiku  -  data driven nyc  - april  2016 - the  solitude of the data team m...
Dataiku - data driven nyc - april 2016 - the solitude of the data team m...
 
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
How to Build a Successful Data Team - Florian Douetteau (@Dataiku)
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
The US Healthcare Industry
The US Healthcare IndustryThe US Healthcare Industry
The US Healthcare Industry
 
How to Build Successful Data Team - Dataiku ?
How to Build Successful Data Team -  Dataiku ? How to Build Successful Data Team -  Dataiku ?
How to Build Successful Data Team - Dataiku ?
 
Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem Before Kaggle : from a business goal to a Machine Learning problem
Before Kaggle : from a business goal to a Machine Learning problem
 
04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku 04Juin2015_Symposium_Présentation_Coyote_Dataiku
04Juin2015_Symposium_Présentation_Coyote_Dataiku
 
Dataiku productive application to production - pap is may 2015
Dataiku    productive application to production - pap is may 2015 Dataiku    productive application to production - pap is may 2015
Dataiku productive application to production - pap is may 2015
 
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
Coyote & Dataiku - Séminaire Dixit GFII du 13 04-2015
 
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku -  Big data paris 2015 - A Hybrid Platform, a Hybrid Team
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team
 
The paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECHThe paradox of big data - dataiku / oxalide APEROTECH
The paradox of big data - dataiku / oxalide APEROTECH
 
OWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - DataikuOWF 2014 - Take back control of your Web tracking - Dataiku
OWF 2014 - Take back control of your Web tracking - Dataiku
 
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku at SF DataMining Meetup - Kaggle Yandex Challenge
Dataiku at SF DataMining Meetup - Kaggle Yandex Challenge
 
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
Lambda Architecture - Storm, Trident, SummingBird ... - Architecture and Over...
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Dataiku big data paris - the rise of the hadoop ecosystem
Dataiku   big data paris - the rise of the hadoop ecosystemDataiku   big data paris - the rise of the hadoop ecosystem
Dataiku big data paris - the rise of the hadoop ecosystem
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 

Kürzlich hochgeladen

怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制vexqp
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格q6pzkpark
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxVivek487417
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样wsppdmt
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxParas Gupta
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdftheeltifs
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 

Kürzlich hochgeladen (20)

Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptxThe-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
The-boAt-Story-Navigating-the-Waves-of-Innovation.pptx
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Harnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptxHarnessing the Power of GenAI for BI and Reporting.pptx
Harnessing the Power of GenAI for BI and Reporting.pptx
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Data Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdfData Analyst Tasks to do the internship.pdf
Data Analyst Tasks to do the internship.pdf
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 

Applied Data Science Part 3: Getting dirty; data preparation and feature creation

  • 1. ©2018 dataiku, Inc. Applied Data Science Online Course 3rd Class: Getting dirty; data preparation and feature creation
  • 2. ©2018 dataiku, Inc. ● September 20th at 12PM ET: Learning the Basics, concepts, & your first ML model ● September 27th at 12PM ET: The data science workflow, building a predictive model flow ● October 4th at 12PM ET: Getting dirty; data preparation and feature creation ● October 11th at 12PM ET: Understanding your model - and communicating about it Curriculum Go from Small to Big Data in 4 weeks
  • 3. ©2018 dataiku, Inc. Most Advanced version of the workflow The Data Science Workflow Determine business objectives Assess situation Determine data science goals Produce project plan Collect data Describe data Explore data Assess qualitySelectCleanBuildReport Reformat Select modeling techniques Generate test design Build model Assess model Evaluate results Review process Determine next steps Check common risks Plan deployment Plan monitoring & performance Produce final report Review project
  • 4. ©2018 dataiku, Inc. Data preparation 70% of the work Select Clean Build ReportReformat
  • 6. ©2018 dataiku, Inc. Different types of Data Definitions Structured Unstructured Data stored with clearly defined data types whose pattern makes them easily searchable and linkable - most often tabular format Data is not structured via pre-defined data models or schema. Examples: • Database with columns for name, phone number etc Examples: • Text files • Websites and social media
  • 7. ©2018 dataiku, Inc. Different types of Data Definitions
  • 8. ©2018 dataiku, Inc. Different types of Storage Definitions Relational Databases Unstructured data storage Data is stored in different tables that are connected with unique identifiers Systems that can store and process both structured and unstructured data Examples: • Structured Query Language DB Examples: • Hadoop • NOSQL
  • 9. ©2018 dataiku, Inc. Different type of structured data Data can be one of several categories Examples: • Gender • Nationality • Hair color Data is a number Examples: • Age • Weight • Salary Data is free-form text Examples: • Tweets • Documents • Business name + semi-structured data: json
  • 10. ©2018 dataiku, Inc. Different type of structured data Examples Eye Color (e.g. Blue) Height (e.g. 170 cm) Country of Birth (e.g. France) Postal Code (e.g. 75001) Date (e.g. Wednesday, 15 Jan 1976) Address (e.g. 10 Rue Saint Martin, Paris) Curriculum Vitae
  • 12. ©2018 dataiku, Inc. What to do with missing values?
  • 13. ©2018 dataiku, Inc. Drop values Delete all data from any participant with missing values Few Warnings: • Be sure your sample is large enough, then you likely can drop data without substantial loss of statistical power. • Be sure data is not missing at Random: There is a pattern in the missing data that affect your primary dependent variables. For example, lower-income participants are less likely to respond income column.
  • 14. ©2018 dataiku, Inc. Imputation Replacing missing values with substitute values. 14 Method #1: Common Value For Number: • Average • Median • Constant Value For Category: • Treat like the category « Empty » • Most frequent value • A constant value Method #2: Educated Guess Infer a missing value: • If Age is lower than 20, Income is likely to be 0 • If living in a house in a rich city: income is likely to be higher than average • Nb. of child is likely to not be 0 if age is high and situation not married Method #3: Sub-Model • Create a specific model of machine learning to predict the missing (Regression, Classification)
  • 15. ©2018 dataiku, Inc. Example How to handle the missing data in this doc? Not OK • Drop Rows • Average Better : • Educated guess from SPCategory • Sub-prediction Model
  • 17. ©2018 dataiku, Inc. Join operation Similar to a VLOOKUP / INDEX MATCH
  • 18. ©2018 dataiku, Inc. Group by operations Definition
  • 19. ©2018 dataiku, Inc. Aggregate data from an entity How can you aggregate this dataset?
  • 20. ©2018 dataiku, Inc. Aggregate data from an entity How can you aggregate this dataset? Group By Options: For Number: • Average • Sum • Minimum and Maximum • Standard Deviation For Category and Number: • Count of Value • Count of Distinct Value • First and Last Value • Most Frequent
  • 23. ©2018 dataiku, Inc. ● Some data is in a textual or numerical format but should be understood as a category by your model ● This also allows you to use non numerical data in a linear model > Create a dummy variable that corresponds to these categories: 0-1 for linear model ● Your numerical data can be distributed in a way that will be misunderstood by your model > Change the values to rearrange them on a scale Dummification & Rescaling: the issue The specificity of your data can create bias in your model 15 0 15 0 18 0 18 0 Y= a1 x1 + a2 x2 + a3 x3 + a4 x4 + a5 x5 Father’s Height Your Height
  • 24. ©2018 dataiku, Inc. Dummification for linear models For categorical variables Then your formula will look like this: Y= abatteries x1 + aSoap x2 + aCandy x3… And X1, X2, X3… are either 0 or 1
  • 25. ©2018 dataiku, Inc. Rescaling for numerical variables Without rescaling your formula will look like this: Yrosalindrosenbaum = aincome *28114 + aChild *1… YBrettStamm = aincome *47901 + aChild *3…
  • 26. ©2018 dataiku, Inc. Feature rescaling for linear models Feature scaling is a method used to standardize the range of independent variables Example of Rescaling – Min-Max Rescaling
  • 30. ©2018 dataiku, Inc. About Dataiku - Your Path to Enterprise AI