SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Industry Overview and Business
Applicability
Why, What and How
Data Wrangling
Ashwini Kuntamukkala
Enterprise Architect @ Vizient, Inc
Twitter: @akuntamukkala
Goal: Better Faster Cheaper!
0
1
2
3
4
5
2013 2014 2015 2016
Product A
Product B
Product C
Insights
Better
Marketing
Campaign
* Typical Business End Game
My data are 100% accurate but are they?
Million(USD)
Vicious cycle
Bad Data
Incorrect
Analysis
Invalid
Insights
Wrong
Decisions
Poor
Outcomes
0
1
2
3
4
5
6
7
8
9
2013 2014 2015 2016
Revenue(million)
Data Quality is an issue…
Data Quality Issue
• Gartner Report
• By 2017, 33% of the largest global companies will experience an
information crisis due to their inability to adequately value, govern and
trust their enterprise information.
Cartoonmadeusinghttp://www.toondoo.com/
If you torture the data long enough, it will confess to anything – Darrell Huff
Noise to Signal?
DB
Machine
sensor
Data has a habit of replicating itself
Data Wrangling is …
transforming
“raw”
analyzed
insights
Data Wrangling: aka…
• Data Preprocessing
• Data Preparation
• Data Cleansing
• Data Scrubbing
• Data Munging
• Data Transformation
• Data Fold, Spindle, Mutilate… signal
noise
Data Wrangling Steps
Obtain Understand
Transform Augment
Shape
An approximate answer to the right problem is worth a good deal more than an
exact answer to an approximate problem. – John Tukey
• Iterative process
• Understand
• Explore
• Transform
• Augment
• Visualize Share
Let’s take a PDF Invoice…for example
Let’s take an image…
Python + Textract +Tesseract
Understand your data
“Looks like my V8 Chevy is running
low on fuel. Didn’t I fill up just the
day before?”
DALDFWSFOEWRBOSDCALAXORDJFKMCO
Owner Vehicle Type Fuel Level Engine Last Fill
AK Chevy Gas 5% V8 05/04/16
Or
DAL DFW SFO EWR BOS DCA LAX ORD JFK MCO
Outliers
Age(Years)
75
80
65
55
67
78
88
90
45
58
69
80
110
???
75
80
65
55
67
78
88
90
45
58
69
80
110
Missing ValuesMissing with a bias
Missing @ Random
Missing completely
Missing due to inapplicability
Missing due to invalid data and ingestion
Types of data
• Qualitative
– Subjective
• Quantitative
– Discrete
– Continuous
• Categorical
• Credible
• Complete
• Verifiable
• Accurate
• Current
• Compliance
Data Source Selection Criteria
• Accessible
• Cost
• Legal
• Security
• Storage
• Provenance
Tidy Data: Not all tables are created equal
School 2012 2013 2014
Good
Samaritans
2321 4550 1293
Percy Grammar 1540 1400 2949
Column
Row
year
School Year Student Count
Good Samaritans 2012 2321
Good Samaritans 2013 4550
Good Samaritans 2014 1293
Percy Grammar 2012 1540
Percy Grammar 2013 1400
Percy Grammar 2014 2949
Observation
Variable
Year Comedy-Q1 Thriller-Q1 Action-Q1 …
2014 2 1 0
2015 0 3 2
Tidy Data: Not all tables are created equal
Category Quarter Year #Hits
Comedy Q1 2014 2
Thriller Q1 2014 1
Action Q1 2014 0
Comedy Q1 2015 0
Thriller Q1 2015 3
Action Q1 2015 2
Find total comedy movies in all of 2014? -> Not easy in current form
Find % of
hit
comedy
movies in
a 2015?
Very easy
to add a
new
column
Tidy Data: Not all tables are created equal
Category Rating Q1 Q2 Q3 …
Comedy Excellent 1 0 1
Comedy Good 2 0 2
Thriller Excellent 0 1 1
Thriller Good 1 0 3
Category Quarter Excellent Good
Comedy Q1 1 2
Comedy Q2 0 0
Comedy Q3 1 2
Thriller Q1 0 1
Thriller Q2 1 0
Thriller Q3 1 3
Very messy data
Variables in both rows and columns
Each row is complete
observation
Tidy Data: Not all tables are created equal
Invoice Bill To Sales % Total($) SKU# Item Qty Unit Price ($)
1 Jim Jones 8 8.03 A123 Hammer 1 3.55
1 Jim Jones 8 8.03 Q34 Screw Driver 2 2.05
2 Mike Z’Kale 8 97.20 W23 Hair Dryer 1 59.25
2 Mike Z’Kale 8 97.20 E452 Cologne 3 10.25
Invoice Bill To Sales % Total($)
1 Jim Jones 8 8.03
2 Mike Z’Kale 8 97.20
Invoice SKU# Item Qty Unit Price ($)
1 A123 Hammer 1 3.55
1 Q34 Screw Driver 2 2.05
2 W23 Hair Dryer 1 59.25
2 E452 Cologne 3 10.25
Normalize to avoid duplication
Tidy Data: Not all tables are created equal
Category Quarter Year #Hits
Comedy Q1 2014 2
Thriller Q1 2014 1
Action Q1 2014 0
Category Quarter Year #Hits
Comedy Q1 2014 2
Thriller Q1 2014 1
Action Q1 2014 0
Category Quarter Year #Hits
Comedy Q1 2014 2
Thriller Q1 2014 1
Action Q1 2014 0
Category Quarter Year #Hits
Comedy Q1 2014 2
Thriller Q1 2014 1
Action Q1 2014 0
Comedy Q1 2014 2
Thriller Q1 2014 1
Action Q1 2014 0
Multiple Tables
Divided by Time
Combine all tables
accommodating
varying formats
Schema-On-Design Vs Schema-On-Read
Spoil for Choices!
Popular Open Source Options
http://schoolofdata.org/
http://okfnlabs.org/
Commercial Vendors
Hands-On
Exercises
Hands on Data Wrangling
• Data Ingestion
– CSV
– PDF
– API/JSON
– HTML Web Scraping
• Data Exploration
– Visual inspection
– Graphing
• Data Shaping
– Tidying Data
• Data Cleansing
– Missing values
– Format
– Outliers
– Data Errors Per Domain
– Fat Fingered Data
• Data Augmenting
– Aggregate data sources
– Fuzzy/Exact match
R Basics
• Data Types
– Numeric
– Character
– Logical
– Categorical aka Factor
– Date
– List
– Matrix
– Data Frame
– Data Table
• Regular Expressions
• Libraries
– stringr
– dplyr
– tidyr
– readxl, xlsx
– lubridate
– gtools
– plyr
– rvest
• Control Statements
Trifacta Wrangler
Google’s Open Refine
Why should you care?
• Better Outcomes
• Tooling Innovation
• Increased
Productivity
• Ease of use
• Lessened skill gap
• Great skill to have
per Indeed.com 
Thank you & See you @
Dallas May 13-15 2016
• Las Colinas Convention
Center
500 West Las Colinas Boulevard,
Irving, TX 75039
Thank you for your participation

Weitere ähnliche Inhalte

Was ist angesagt?

What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 

Was ist angesagt? (20)

Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
Linear Regression Algorithm | Linear Regression in R | Data Science Training ...
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Data science
Data scienceData science
Data science
 
Data Science With Python
Data Science With PythonData Science With Python
Data Science With Python
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
The Basics of Statistics for Data Science By Statisticians
The Basics of Statistics for Data Science By StatisticiansThe Basics of Statistics for Data Science By Statisticians
The Basics of Statistics for Data Science By Statisticians
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
 
Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in Python
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
Confusion Matrix Explained
Confusion Matrix ExplainedConfusion Matrix Explained
Confusion Matrix Explained
 
Pre processing big data
Pre processing big dataPre processing big data
Pre processing big data
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olapData Mining:  Concepts and Techniques (3rd ed.)— Chapter _04 olap
Data Mining: Concepts and Techniques (3rd ed.) — Chapter _04 olap
 

Andere mochten auch

The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
Inside Analysis
 
Impact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceImpact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherence
Skillet Tony
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 

Andere mochten auch (20)

Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
 
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
Data Wrangling with Open Refine
Data Wrangling with Open RefineData Wrangling with Open Refine
Data Wrangling with Open Refine
 
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
Katharine Jarmul, Founder at Kjamistan - "Learn Data Wrangling with Python"
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)
 
Open refine to update and clean up your messy data
Open refine to update and clean up your messy dataOpen refine to update and clean up your messy data
Open refine to update and clean up your messy data
 
Real time analytics in Big Data
Real time analytics in Big DataReal time analytics in Big Data
Real time analytics in Big Data
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Impact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceImpact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherence
 
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
OUR GOAL AND FOCUS FOR "OPEN FOG CONSORTIUM"
 
Webinaire Business&Decision - Trifacta
Webinaire  Business&Decision - TrifactaWebinaire  Business&Decision - Trifacta
Webinaire Business&Decision - Trifacta
 
Lightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and SparkLightning fast analytics with Cassandra and Spark
Lightning fast analytics with Cassandra and Spark
 
DataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census dataDataMeet 4: Data cleaning & census data
DataMeet 4: Data cleaning & census data
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
PrĂŠparation de DonnĂŠes Hadoop avec Trifacta
PrĂŠparation de DonnĂŠes Hadoop avec TrifactaPrĂŠparation de DonnĂŠes Hadoop avec Trifacta
PrĂŠparation de DonnĂŠes Hadoop avec Trifacta
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 

Ähnlich wie Data Wrangling

Ähnlich wie Data Wrangling (20)

Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...
Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...
Risk Management and Reliable Forecasting using Un-reliable Data (magennis) - ...
 
Data In Action: Business Value of Data
Data In Action: Business Value of DataData In Action: Business Value of Data
Data In Action: Business Value of Data
 
Putting data science at the heart of business
Putting data science at the heart of businessPutting data science at the heart of business
Putting data science at the heart of business
 
State of Analytics: Retail and Consumer Goods
State of Analytics: Retail and Consumer GoodsState of Analytics: Retail and Consumer Goods
State of Analytics: Retail and Consumer Goods
 
Presentation for the Nexus Conference on the Internet of Things and the Evolu...
Presentation for the Nexus Conference on the Internet of Things and the Evolu...Presentation for the Nexus Conference on the Internet of Things and the Evolu...
Presentation for the Nexus Conference on the Internet of Things and the Evolu...
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
QUIRKS - Janvier 2015
QUIRKS - Janvier 2015QUIRKS - Janvier 2015
QUIRKS - Janvier 2015
 
Galorath - Why can't people estimate
Galorath - Why can't people estimateGalorath - Why can't people estimate
Galorath - Why can't people estimate
 
Startup Metrics, a love story. All slides of an 6h Lean Analytics workshop.
Startup Metrics, a love story. All slides of an 6h Lean Analytics workshop.Startup Metrics, a love story. All slides of an 6h Lean Analytics workshop.
Startup Metrics, a love story. All slides of an 6h Lean Analytics workshop.
 
Adaptive Apps: Reimagining the Future - Forrester
Adaptive Apps: Reimagining the Future  - ForresterAdaptive Apps: Reimagining the Future  - Forrester
Adaptive Apps: Reimagining the Future - Forrester
 
dotScale 2014
dotScale 2014dotScale 2014
dotScale 2014
 
Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science Process
 
Mattermark 1st Series A Deck
Mattermark 1st Series A DeckMattermark 1st Series A Deck
Mattermark 1st Series A Deck
 
Machine learning - What they don't teach you on Coursera ODSC London 2016
Machine learning - What they don't teach you on Coursera ODSC London 2016Machine learning - What they don't teach you on Coursera ODSC London 2016
Machine learning - What they don't teach you on Coursera ODSC London 2016
 
Making Digital Marketing More Human
Making Digital Marketing More HumanMaking Digital Marketing More Human
Making Digital Marketing More Human
 
Data Quality
Data QualityData Quality
Data Quality
 
How to Implement a Spend Analytics Program Using Machine Learning
 How to Implement a Spend Analytics Program Using Machine Learning How to Implement a Spend Analytics Program Using Machine Learning
How to Implement a Spend Analytics Program Using Machine Learning
 
Content Marketing Masters: Guerrilla User Testing Content Marketing - Stephen...
Content Marketing Masters: Guerrilla User Testing Content Marketing - Stephen...Content Marketing Masters: Guerrilla User Testing Content Marketing - Stephen...
Content Marketing Masters: Guerrilla User Testing Content Marketing - Stephen...
 
Data Quality Success Stories
Data Quality Success StoriesData Quality Success Stories
Data Quality Success Stories
 
What Is Good DataViz Design?
What Is Good DataViz Design?What Is Good DataViz Design?
What Is Good DataViz Design?
 

KĂźrzlich hochgeladen

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

KĂźrzlich hochgeladen (20)

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

Data Wrangling

  • 1.
  • 2. Industry Overview and Business Applicability Why, What and How Data Wrangling Ashwini Kuntamukkala Enterprise Architect @ Vizient, Inc Twitter: @akuntamukkala
  • 3. Goal: Better Faster Cheaper! 0 1 2 3 4 5 2013 2014 2015 2016 Product A Product B Product C Insights Better Marketing Campaign * Typical Business End Game My data are 100% accurate but are they? Million(USD)
  • 5. Data Quality Issue • Gartner Report • By 2017, 33% of the largest global companies will experience an information crisis due to their inability to adequately value, govern and trust their enterprise information. Cartoonmadeusinghttp://www.toondoo.com/ If you torture the data long enough, it will confess to anything – Darrell Huff
  • 6. Noise to Signal? DB Machine sensor Data has a habit of replicating itself
  • 7. Data Wrangling is … transforming “raw” analyzed insights
  • 8. Data Wrangling: aka… • Data Preprocessing • Data Preparation • Data Cleansing • Data Scrubbing • Data Munging • Data Transformation • Data Fold, Spindle, Mutilate… signal noise
  • 9. Data Wrangling Steps Obtain Understand Transform Augment Shape An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. – John Tukey • Iterative process • Understand • Explore • Transform • Augment • Visualize Share
  • 10. Let’s take a PDF Invoice…for example
  • 11. Let’s take an image… Python + Textract +Tesseract
  • 12. Understand your data “Looks like my V8 Chevy is running low on fuel. Didn’t I fill up just the day before?” DALDFWSFOEWRBOSDCALAXORDJFKMCO Owner Vehicle Type Fuel Level Engine Last Fill AK Chevy Gas 5% V8 05/04/16 Or DAL DFW SFO EWR BOS DCA LAX ORD JFK MCO
  • 14. Missing ValuesMissing with a bias Missing @ Random Missing completely Missing due to inapplicability Missing due to invalid data and ingestion
  • 15. Types of data • Qualitative – Subjective • Quantitative – Discrete – Continuous • Categorical
  • 16. • Credible • Complete • Verifiable • Accurate • Current • Compliance Data Source Selection Criteria • Accessible • Cost • Legal • Security • Storage • Provenance
  • 17. Tidy Data: Not all tables are created equal School 2012 2013 2014 Good Samaritans 2321 4550 1293 Percy Grammar 1540 1400 2949 Column Row year School Year Student Count Good Samaritans 2012 2321 Good Samaritans 2013 4550 Good Samaritans 2014 1293 Percy Grammar 2012 1540 Percy Grammar 2013 1400 Percy Grammar 2014 2949 Observation Variable
  • 18. Year Comedy-Q1 Thriller-Q1 Action-Q1 … 2014 2 1 0 2015 0 3 2 Tidy Data: Not all tables are created equal Category Quarter Year #Hits Comedy Q1 2014 2 Thriller Q1 2014 1 Action Q1 2014 0 Comedy Q1 2015 0 Thriller Q1 2015 3 Action Q1 2015 2 Find total comedy movies in all of 2014? -> Not easy in current form Find % of hit comedy movies in a 2015? Very easy to add a new column
  • 19. Tidy Data: Not all tables are created equal Category Rating Q1 Q2 Q3 … Comedy Excellent 1 0 1 Comedy Good 2 0 2 Thriller Excellent 0 1 1 Thriller Good 1 0 3 Category Quarter Excellent Good Comedy Q1 1 2 Comedy Q2 0 0 Comedy Q3 1 2 Thriller Q1 0 1 Thriller Q2 1 0 Thriller Q3 1 3 Very messy data Variables in both rows and columns Each row is complete observation
  • 20. Tidy Data: Not all tables are created equal Invoice Bill To Sales % Total($) SKU# Item Qty Unit Price ($) 1 Jim Jones 8 8.03 A123 Hammer 1 3.55 1 Jim Jones 8 8.03 Q34 Screw Driver 2 2.05 2 Mike Z’Kale 8 97.20 W23 Hair Dryer 1 59.25 2 Mike Z’Kale 8 97.20 E452 Cologne 3 10.25 Invoice Bill To Sales % Total($) 1 Jim Jones 8 8.03 2 Mike Z’Kale 8 97.20 Invoice SKU# Item Qty Unit Price ($) 1 A123 Hammer 1 3.55 1 Q34 Screw Driver 2 2.05 2 W23 Hair Dryer 1 59.25 2 E452 Cologne 3 10.25 Normalize to avoid duplication
  • 21. Tidy Data: Not all tables are created equal Category Quarter Year #Hits Comedy Q1 2014 2 Thriller Q1 2014 1 Action Q1 2014 0 Category Quarter Year #Hits Comedy Q1 2014 2 Thriller Q1 2014 1 Action Q1 2014 0 Category Quarter Year #Hits Comedy Q1 2014 2 Thriller Q1 2014 1 Action Q1 2014 0 Category Quarter Year #Hits Comedy Q1 2014 2 Thriller Q1 2014 1 Action Q1 2014 0 Comedy Q1 2014 2 Thriller Q1 2014 1 Action Q1 2014 0 Multiple Tables Divided by Time Combine all tables accommodating varying formats
  • 28. Hands on Data Wrangling • Data Ingestion – CSV – PDF – API/JSON – HTML Web Scraping • Data Exploration – Visual inspection – Graphing • Data Shaping – Tidying Data • Data Cleansing – Missing values – Format – Outliers – Data Errors Per Domain – Fat Fingered Data • Data Augmenting – Aggregate data sources – Fuzzy/Exact match
  • 29. R Basics • Data Types – Numeric – Character – Logical – Categorical aka Factor – Date – List – Matrix – Data Frame – Data Table • Regular Expressions • Libraries – stringr – dplyr – tidyr – readxl, xlsx – lubridate – gtools – plyr – rvest • Control Statements
  • 32. Why should you care? • Better Outcomes • Tooling Innovation • Increased Productivity • Ease of use • Lessened skill gap • Great skill to have per Indeed.com 
  • 33. Thank you & See you @ Dallas May 13-15 2016 • Las Colinas Convention Center 500 West Las Colinas Boulevard, Irving, TX 75039
  • 34. Thank you for your participation

Hinweis der Redaktion

  1. This presentation demonstrates the new capabilities of PowerPoint and it is best viewed in Slide Show. These slides are designed to give you great ideas for the presentations you’ll create in PowerPoint 2011! For more sample templates, click the File menu, and then click New From Template. Under Templates, click Presentations.