SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Data Preparation for Data Science
Casey Stella
@casey_stella
2016
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Table of Contents
Preliminaries
Demo
Questions
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Introduction
Hi, I’m Casey Stella!
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Garbage In =⇒ Garbage Out
“80% of the work in any data project is in cleaning the data.”
— D.J. Patel in Data Jujitsu
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Data Cleansing =⇒ Data Understanding
There are two ways to understand your data
• Syntactic Understanding
• Semantic Understanding
If you hope to get anything out of your data, you have to have a handle on both.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: True Types
A true type is a label applied to data points xi such that xi are mutually comparable.
• Schemas type != true data type
• A specific column can have many different types
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
Canonical representations are representations which give you an idea at a glance of the
data format
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
Canonical representations are representations which give you an idea at a glance of the
data format
• Replacing digits with the character ‘d’
• Stripping whitespace
• Normalizing punctuation
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density
Data density is an indication of how data is clumped together.
• For numerical data, distributions and statistical characteristics are informative
• For non-numeric data, counts and distinct counts of a canonical representation are
extremely useful.
Canonical representations are representations which give you an idea at a glance of the
data format
• Replacing digits with the character ‘d’
• Stripping whitespace
• Normalizing punctuation
Data density is an assumption underlying any conclusions drawn from your data.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
• Whether the assumptions of your analysis are violated
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
• Whether the assumptions of your analysis are violated
∆Density
∆t =⇒
• Automation
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Syntactic Understanding: Density over Time
∆Density
∆t is how data clumps change over time.
This kind of analysis can show
• Problems in the data pipeline
• Whether the assumptions of your analysis are violated
∆Density
∆t =⇒
• Automation
• Outlier Alerting
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
• May come from humans (e.g. domain experience and ontologies)
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
• May come from humans (e.g. domain experience and ontologies)
• May come from machine learning (e.g. analyzing usage patterns to find synonyms)
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Semantic Understanding: “Do what I mean, not what I say”
Semantic understanding is understanding based on how the data is used rather than
how it is stored.
• Finding equivalences based on semantic understanding are often context sensitive.
• May come from humans (e.g. domain experience and ontologies)
• May come from machine learning (e.g. analyzing usage patterns to find synonyms)
Semantic understanding does not imply SkyNet
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
DEMO
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Implications for Team Structure
To be successful,
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
• Your data science teams have to be willing to get their hands dirty
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
• Your data science teams have to be willing to get their hands dirty
• Your data science teams have to be allowed to get their hands dirty
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Implications for Team Structure
To be successful,
• Your data science teams have to be integrally involved in the data transformation
and understanding.
• Your data science teams have to be willing to get their hands dirty
• Your data science teams have to be allowed to get their hands dirty
• Your data science teams need software engineering chops.
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
Questions
Thanks for your attention! Questions?
• Code & scripts for this talk available on my github presentation page.1
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
1
http://github.com/cestella/presentations/
Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016

Weitere ähnliche Inhalte

Was ist angesagt?

Data Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise HadoopData Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise HadoopDataWorks Summit
 
Cortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogCortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogMSAdvAnalytics
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsCambridge Semantics
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadThink Big, a Teradata Company
 
Data Discoverability at SpotHero
Data Discoverability at SpotHeroData Discoverability at SpotHero
Data Discoverability at SpotHeroMaggie Hays
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystemmagda3695
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQLPhilippe Julio
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopCaserta
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo UnstructuredCambridge Semantics
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesCambridge Semantics
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?Precisely
 
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...DataStax
 
Bad Data is Polluting Big Data
Bad Data is Polluting Big DataBad Data is Polluting Big Data
Bad Data is Polluting Big DataStreamsets Inc.
 

Was ist angesagt? (20)

Data Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise HadoopData Discovery & Lineage in Enterprise Hadoop
Data Discovery & Lineage in Enterprise Hadoop
 
Cortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogCortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data Catalog
 
How to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using SemanticsHow to Build a Smart Data Lake Using Semantics
How to Build a Smart Data Lake Using Semantics
 
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on ReadBig Data Modeling and Analytic Patterns – Beyond Schema on Read
Big Data Modeling and Analytic Patterns – Beyond Schema on Read
 
Data Discoverability at SpotHero
Data Discoverability at SpotHeroData Discoverability at SpotHero
Data Discoverability at SpotHero
 
Big data ecosystem
Big data ecosystemBig data ecosystem
Big data ecosystem
 
Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes ...
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Introduction to Anzo Unstructured
Introduction to Anzo UnstructuredIntroduction to Anzo Unstructured
Introduction to Anzo Unstructured
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success Stories
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
 
Bad Data is Polluting Big Data
Bad Data is Polluting Big DataBad Data is Polluting Big Data
Bad Data is Polluting Big Data
 

Andere mochten auch

#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...
#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...
#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...MITX
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processingDatamining Tools
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureDataWorks Summit/Hadoop Summit
 
The Role of Analytics in Smart Cities
The Role of Analytics in Smart CitiesThe Role of Analytics in Smart Cities
The Role of Analytics in Smart CitiesManthan
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisDataWorks Summit/Hadoop Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...DataWorks Summit/Hadoop Summit
 
Business Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisBusiness Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisAhsan Khan Eco (Superior College)
 
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseConfiguring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseCloudera, Inc.
 
Data Preparation and Processing
Data Preparation and ProcessingData Preparation and Processing
Data Preparation and ProcessingMehul Gondaliya
 

Andere mochten auch (17)

#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...
#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...
#MITXData 2014 - Leveraging Self-Service Business Intelligence to Drive Marke...
 
Data MIning: Data processing
Data MIning: Data processingData MIning: Data processing
Data MIning: Data processing
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
 
LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data
 
The Role of Analytics in Smart Cities
The Role of Analytics in Smart CitiesThe Role of Analytics in Smart Cities
The Role of Analytics in Smart Cities
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Active Learning for Fraud Prevention
Active Learning for Fraud PreventionActive Learning for Fraud Prevention
Active Learning for Fraud Prevention
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
 
H20: A platform for big math
H20: A platform for big math H20: A platform for big math
H20: A platform for big math
 
Business Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysisBusiness Research Methods. data collection preparation and analysis
Business Research Methods. data collection preparation and analysis
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
 
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the EnterpriseConfiguring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
 
Managing a Multi-Tenant Data Lake
Managing a Multi-Tenant Data LakeManaging a Multi-Tenant Data Lake
Managing a Multi-Tenant Data Lake
 
Data Preparation and Processing
Data Preparation and ProcessingData Preparation and Processing
Data Preparation and Processing
 

Ähnlich wie Data Preparation of Data Science

MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache MetronMaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache MetronDataWorks Summit
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...DataWorks Summit
 
Spark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
Data Lifecycle Risks Considerations and Controls
Data Lifecycle Risks Considerations and ControlsData Lifecycle Risks Considerations and Controls
Data Lifecycle Risks Considerations and ControlsCarlos Chalico
 
The Field Guide to Data Science
The Field Guide to Data ScienceThe Field Guide to Data Science
The Field Guide to Data ScienceEMC
 
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...ryanorban
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceSusanna-Assunta Sansone
 
Enrich data and rewrite queries with the Elasticsearch percolator
Enrich data and rewrite queries with the Elasticsearch percolatorEnrich data and rewrite queries with the Elasticsearch percolator
Enrich data and rewrite queries with the Elasticsearch percolatorLucian Precup
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)James Hendler
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep LearningAndre Freitas
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data ScienceTJ Stalcup
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Paul Groth
 

Ähnlich wie Data Preparation of Data Science (20)

MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache MetronMaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
 
Streaming Outlier Analysis for Fun and Scalability
Streaming Outlier Analysis for Fun and Scalability Streaming Outlier Analysis for Fun and Scalability
Streaming Outlier Analysis for Fun and Scalability
 
Spark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey Stella
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-TextNLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text
 
NLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-TextNLP Structured Data Investigation on Non-Text
NLP Structured Data Investigation on Non-Text
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
The field-guide-to-data-science
The field-guide-to-data-scienceThe field-guide-to-data-science
The field-guide-to-data-science
 
Data Lifecycle Risks Considerations and Controls
Data Lifecycle Risks Considerations and ControlsData Lifecycle Risks Considerations and Controls
Data Lifecycle Risks Considerations and Controls
 
The Field Guide to Data Science
The Field Guide to Data ScienceThe Field Guide to Data Science
The Field Guide to Data Science
 
The Field Guide to Data Science
The Field Guide to Data ScienceThe Field Guide to Data Science
The Field Guide to Data Science
 
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
 
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better ScienceNC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
NC3Rs Publication Bias workshop - Sansone - Better Data = Better Science
 
Big Data for Library Services (2017)
Big Data for Library Services (2017)Big Data for Library Services (2017)
Big Data for Library Services (2017)
 
Enrich data and rewrite queries with the Elasticsearch percolator
Enrich data and rewrite queries with the Elasticsearch percolatorEnrich data and rewrite queries with the Elasticsearch percolator
Enrich data and rewrite queries with the Elasticsearch percolator
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 

Mehr von DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 

Mehr von DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Kürzlich hochgeladen

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Kürzlich hochgeladen (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Data Preparation of Data Science

  • 1. Data Preparation for Data Science Casey Stella @casey_stella 2016 Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 2. Table of Contents Preliminaries Demo Questions Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 3. Introduction Hi, I’m Casey Stella! Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 4. Garbage In =⇒ Garbage Out “80% of the work in any data project is in cleaning the data.” — D.J. Patel in Data Jujitsu Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 5. Data Cleansing =⇒ Data Understanding There are two ways to understand your data • Syntactic Understanding • Semantic Understanding If you hope to get anything out of your data, you have to have a handle on both. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 6. Syntactic Understanding: True Types A true type is a label applied to data points xi such that xi are mutually comparable. • Schemas type != true data type • A specific column can have many different types Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 7. Syntactic Understanding: Density Data density is an indication of how data is clumped together. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 8. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 9. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. Canonical representations are representations which give you an idea at a glance of the data format Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 10. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. Canonical representations are representations which give you an idea at a glance of the data format • Replacing digits with the character ‘d’ • Stripping whitespace • Normalizing punctuation Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 11. Syntactic Understanding: Density Data density is an indication of how data is clumped together. • For numerical data, distributions and statistical characteristics are informative • For non-numeric data, counts and distinct counts of a canonical representation are extremely useful. Canonical representations are representations which give you an idea at a glance of the data format • Replacing digits with the character ‘d’ • Stripping whitespace • Normalizing punctuation Data density is an assumption underlying any conclusions drawn from your data. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 12. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 13. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 14. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline • Whether the assumptions of your analysis are violated Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 15. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline • Whether the assumptions of your analysis are violated ∆Density ∆t =⇒ • Automation Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 16. Syntactic Understanding: Density over Time ∆Density ∆t is how data clumps change over time. This kind of analysis can show • Problems in the data pipeline • Whether the assumptions of your analysis are violated ∆Density ∆t =⇒ • Automation • Outlier Alerting Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 17. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 18. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 19. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. • May come from humans (e.g. domain experience and ontologies) Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 20. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. • May come from humans (e.g. domain experience and ontologies) • May come from machine learning (e.g. analyzing usage patterns to find synonyms) Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 21. Semantic Understanding: “Do what I mean, not what I say” Semantic understanding is understanding based on how the data is used rather than how it is stored. • Finding equivalences based on semantic understanding are often context sensitive. • May come from humans (e.g. domain experience and ontologies) • May come from machine learning (e.g. analyzing usage patterns to find synonyms) Semantic understanding does not imply SkyNet Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 22. DEMO Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 23.
  • 24.
  • 25.
  • 26.
  • 27. Implications for Team Structure To be successful, Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 28. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 29. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. • Your data science teams have to be willing to get their hands dirty Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 30. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. • Your data science teams have to be willing to get their hands dirty • Your data science teams have to be allowed to get their hands dirty Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 31. Implications for Team Structure To be successful, • Your data science teams have to be integrally involved in the data transformation and understanding. • Your data science teams have to be willing to get their hands dirty • Your data science teams have to be allowed to get their hands dirty • Your data science teams need software engineering chops. Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016
  • 32. Questions Thanks for your attention! Questions? • Code & scripts for this talk available on my github presentation page.1 • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: cstella@hortonworks.com 1 http://github.com/cestella/presentations/ Casey Stella@casey_stella (Hortonworks) Data Preparation for Data Science 2016