SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Downloaden Sie, um offline zu lesen
Dirty Data? Clean it up!
Or, how to do data science in the real world.
Dan Lynn
CEO, AgilData
@danklynn
dan@agildata.com
Patrick Russell
Director, Data Science, Craftsy
@patrickrm101
patrick@craftsy.com
© Phil Mislinksi - www.pmimage.com
Patrick Russell - Bass
Director, Data Science, Craftsy
Dan Lynn - Guitar
CEO, AgilData
© Phil Mislinksi - www.pmimage.com
www.craftsy.com
Learn It. Make it.
Explore expert-led video classes and shop
the best yarn, fabric and supplies for
quilting, sewing, knitting, cake
decorating & more.
© Phil Mislinksi - www.pmimage.com
EXPERT SOLUTIONS AND SERVICES FOR
COMPLEX DATA PROBLEMS
At AgilData, we help you get the most out of your data. We provide Software and Services to help firms deliver on
the promise of Big Data and complex data infrastructures:
● AgilData Scalable Cluster for MySQL – Massively scalable and performant MySQL databases combined
with 24×7 remote managed services for DBA/DevOps
● Trusted Big Data experts to solve problems, set strategy and develop solutions for BI, data
pipeline orchestration, ETL, APIs and custom applications.
www.agildata.com
Hey, you’re a data scientist, right? Great!
We have millions of users. How we can use email
to monetize our user base better?
— Marketing
1 / 1 + exp(-x)
https://www.etsy.com/shop/NausicaaDistribution
Source: https://www.oreilly.com/ideas/2015-data-science-salary-survey
http://www.lavante.com/the-hub/ap-industry/lavante-and-spend-matters-look-at-how-dirty-vendor-data-impacts-your-bottom-line/
Data Cleansing
Data Cleansing
● Dates & Times
● Numbers & Strings
● Addresses
● Clickstream Data
● Handling missing data
● Tidy Data
Dates & Times
● Timestamps can mean different things
○ ingested_date, event_timestamp
● Clocks can’t be trusted
○ Server time: which server? Is it synchronized?
○ Client time? Is there a synchronizing time scheme?
● Timezones
○ What tz is your own data in?
○ Your email provider? Your adwords account? Your Google Analytics?
Numbers & Strings
● Use the right types for your numbers (int, bigint, float, numeric
etc)
● Murphy’s Law of text inputs: If a user can put something in a text
field, anything and everything will happen.
● Watch out for floating point precision mistakes
Addresses
● Parsing / validation is not something you want to do yourself
○ USPS has validation and zip lookup for US addresses: https://www.usps.
com/business/web-tools-apis/documentation-updates.htm
● Remember zip codes are strings. And the rest of the world does not
use U.S. zips.
● IP geolocation: Get lat/long, state, city, postal & ISP, from visitor
IPs
○ https://www.maxmind.com/en/geoip2-city
○ This is ALWAYS approximate
● If working with GIS, recommend http://postgis.net/
○ Vanilla postgres also has earthdistance for great circle distance
Clickstream Data
● User agent => Device: Don’t do this yourself (we use WURFL and Google
Analytics)
● Query strings follow the rules of text. Everything will show up
○ They might be truncated
○ URL encoding might be missing characters (%2 instead of %20)
○ Use a library to parse params (ie Python ships with urlparse.parse_qs)
● If your system creates sessions (tomcat, Google Analytics), don’t be
afraid to create your own sessions on top of the pageview data
○ You’ll cross channel and cross device behavior this way
Clickstream Data
Missing / empty data
● Easy to overlook but important
● What does missing data in the context of your analysis mean?
○ Not collected (why not?)
○ Error state
○ N/A or undefined
○ Especially for histograms, missing data lead to very poor conclusions.
● Does your data use sentinel values? (ie -9999 or “null”)
○ df[‘nps_score’].replace(-9999, np.nan)
● Imputation
● Storage
Tidy Data
● Conceptual framework for structuring data for analysis and fitting
○ Each variable forms a column
○ Each observation is a row
○ Each type of observational unit forms a table
● Pretty much normal form from relational databases for stats
● Tidy can be different depending on the question asked
● R (dplyr, tidyr) and Python (pandas) have functions for making your
long data wide & wide data long (stack, unstack, melt, pivot)
● Paper: http://vita.had.co.nz/papers/tidy-data.pdf
● Python tutorial: http://tomaugspurger.github.io/modern-5-tidy.html
Tidy Data
● Example might be market place transaction data with 1 row per
transaction
● You might want to do analysis on participants, 1 row per participant
Hey, that’s a great model. How can we build it
into our decision-making process?
— Marketing
Operationalizing Data Science
● Doing an analysis once rarely delivers lasting value.
● The business needs continuous insight, so you need to get this stuff
into production.
○ Hosting
○ ETL
○ Pipelines
Operationalizing Data Science
Hosting
● Delivering continuous analyses requires operational infrastructure
○ Database(s)
○ Visualization tools (e.g. Chartio, Arcadia Data, Tableau, Looker, Qlik, etc..)
○ REST services / microservices
● These all have uptime requirements. You need to involve your (dev)ops
team earlier rather than later.
● Microservices / REST endpoints have architectural implications
● Visualization tools
○ Local (e.g. Jupyter, Zeppelin)
○ On-premise (Arcadia Data, Tableau, Qlik)
○ Hosted (Chartio)
● Visualization tools often require a SQL interface, thus….
ETL - Extract, Transform, Load
● Often used to herd data into some kind of data warehouse (e.g. RDBMS
+ star schema, Hadoop w/ unstructured data, etc..)
● Not just for data warehousing
● Not just for modeling
● No general solution
● Tooling
○ Apache Spark, Apache Sqoop
○ Commercial Tools: Informatica, Vertica, SQL Server, DataVirtuality etc…
● And then there is Apache Kafka…and the “NoETL” movement
○ Book: “I <3 Logs” - by Jay kreps
○ Replay history from the beginning of time as needed
ETL - Extract, Transform, Load - Example
● Not just for production runs
○ For example, Patrick does a lot of time-to-event analysis on email opens,
transactions, visits.
■ Survival functions, etc...
○ Setup ETL that builds tables With the right shape to put right into models
Pipelines
● From data to model output
● Define dependencies and define DAG for the work
○ Steps defined by assigning input as output of prior steps
○ Luigi (http://luigi.readthedocs.io/en/stable/index.html)
○ Drake (https://github.com/Factual/drake)
○ Scikit learn has its own Pipeline
■ That can be part of your bigger pipeline
● Scheduling can be trickier than you think
○ Resource contention
○ Loose dependencies
○ Cron is fine but Jenkins works really well for this!
● Don’t be afraid to create and teardown full environments as steps
○ For example, spin up and configure an EMR cluster, do stuff, tear it down*
* make your VP of Infrastructure less miserable
Pipelines - Luigi
● Written in Python. Steps implemented by subclassing Task
● Visualize your DAG
● Supports data in relational DBs, Redshift, HDFS, S3, file system
● Flexible and extensible
● Can parallelize jobs
● Workflow runs by executing last step scheduling all dependencies
Pipelines - Luigi
Pipelines - Drake
● JVM (written in Clojure)
● Like a Makefile but for data work
● Supports commands in Shell, Python, Ruby, Clojure
Pipelines - More Tools
● Oozie
○ The default job orchestration engine for Hadoop. Can chain together multiple jobs
to form a complete DAG.
○ Open source
● Kettle
○ Old-school, but still relevant.
○ Visual pipeline designer. Execution engine
○ Open source
● Informatica
○ Visual pipeline designer, mature toolset
○ Commercial
● Datavirtuality
○ Treats all your stores (including Google Analytics) like schemas in a single db
○ Great for microservice architectures
○ Commercial
© Patrick Coppinger
Thanks!
dan@agildata.com — patrick@craftsy.com
@danklynn — @patrickrm101
Shameless Plug:
Tonight at Galvanize, join us at the Denver/Boulder Big Data Meetup
to learn about distributed system design! (ask Dan for details)
References
● I Heart Logs
○ http://www.amazon.com/Heart-Logs-Stream-Processing-Integration/dp/1491909382
● Tidy Data
○ http://vita.had.co.nz/papers/tidy-data.pdf
Additional Tools
● Scientific python stack (ipython, numpy, scipy, pandas, matplotlib…)
● Hadleyverse for R (dplyr, ggplot, tidyr, lubridate…)
● csvkit: command line tools (csvcut, csvgrep, csvjoin...) for CSV data
● jq: fast command line tool for working with json (ie pipe cURL to jq)
● psql (if you use postgresql or Redshift)

Weitere ähnliche Inhalte

Was ist angesagt?

Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Wee Hyong Tok
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 

Was ist angesagt? (20)

Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Spark Core
Spark CoreSpark Core
Spark Core
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Apache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source EcosystemApache Spark Usage in the Open Source Ecosystem
Apache Spark Usage in the Open Source Ecosystem
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
 

Andere mochten auch

【STR2 Klab プレゼンテーション】
【STR2 Klab プレゼンテーション】【STR2 Klab プレゼンテーション】
【STR2 Klab プレゼンテーション】
Up Hatch
 
【STR2 株式会社ケイブ プレゼンテーション】
【STR2 株式会社ケイブ プレゼンテーション】【STR2 株式会社ケイブ プレゼンテーション】
【STR2 株式会社ケイブ プレゼンテーション】
Up Hatch
 
Understanding the misconception
Understanding the misconceptionUnderstanding the misconception
Understanding the misconception
leondorsey1986
 
Stacked deck presentation (1)
Stacked deck presentation (1)Stacked deck presentation (1)
Stacked deck presentation (1)
Joe Hines
 

Andere mochten auch (20)

Tijerina-RDA-NISO-Task Groups-sept11
Tijerina-RDA-NISO-Task Groups-sept11Tijerina-RDA-NISO-Task Groups-sept11
Tijerina-RDA-NISO-Task Groups-sept11
 
【STR2 Klab プレゼンテーション】
【STR2 Klab プレゼンテーション】【STR2 Klab プレゼンテーション】
【STR2 Klab プレゼンテーション】
 
Pdr v2
Pdr v2Pdr v2
Pdr v2
 
【STR2 株式会社ケイブ プレゼンテーション】
【STR2 株式会社ケイブ プレゼンテーション】【STR2 株式会社ケイブ プレゼンテーション】
【STR2 株式会社ケイブ プレゼンテーション】
 
Understanding the misconception
Understanding the misconceptionUnderstanding the misconception
Understanding the misconception
 
Stacked deck presentation (1)
Stacked deck presentation (1)Stacked deck presentation (1)
Stacked deck presentation (1)
 
Slides with sound
Slides with soundSlides with sound
Slides with sound
 
Fall sem 2010 exam set #3
Fall sem 2010 exam set #3Fall sem 2010 exam set #3
Fall sem 2010 exam set #3
 
Silver
SilverSilver
Silver
 
It og medier.
It og medier. It og medier.
It og medier.
 
Jasmine soap
Jasmine soapJasmine soap
Jasmine soap
 
Berkley Building Materials Project Gallary
Berkley Building Materials Project GallaryBerkley Building Materials Project Gallary
Berkley Building Materials Project Gallary
 
Technology and accountability – ideas
Technology and accountability – ideasTechnology and accountability – ideas
Technology and accountability – ideas
 
Wikipedia: a model for using the Internet for good
Wikipedia: a model for using the Internet for goodWikipedia: a model for using the Internet for good
Wikipedia: a model for using the Internet for good
 
Presentacio1
Presentacio1Presentacio1
Presentacio1
 
EveryCoin
EveryCoinEveryCoin
EveryCoin
 
Epidermis
EpidermisEpidermis
Epidermis
 
Mobile-led innovations for Direct customer relationships
Mobile-led innovations for Direct customer relationshipsMobile-led innovations for Direct customer relationships
Mobile-led innovations for Direct customer relationships
 
Seeds Of Greatness
Seeds Of GreatnessSeeds Of Greatness
Seeds Of Greatness
 
Interculturalidad compilacion de temas
Interculturalidad compilacion de temasInterculturalidad compilacion de temas
Interculturalidad compilacion de temas
 

Ähnlich wie Dirty data? Clean it up! - Datapalooza Denver 2016

Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
Starring sakila my sql university 2009
Starring sakila my sql university 2009Starring sakila my sql university 2009
Starring sakila my sql university 2009
David Paz
 

Ähnlich wie Dirty data? Clean it up! - Datapalooza Denver 2016 (20)

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
A Modern Interface for Data Science on Postgres/Greenplum - Greenplum Summit ...
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Workflow Engines + Luigi
Workflow Engines + LuigiWorkflow Engines + Luigi
Workflow Engines + Luigi
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
How to become a data scientist
How to become a data scientist How to become a data scientist
How to become a data scientist
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Starring sakila my sql university 2009
Starring sakila my sql university 2009Starring sakila my sql university 2009
Starring sakila my sql university 2009
 
Data_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdfData_and_Analytics_Industry_IESE_v3.pdf
Data_and_Analytics_Industry_IESE_v3.pdf
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 

Mehr von Dan Lynn (8)

The Holy Grail of Data Analytics
The Holy Grail of Data AnalyticsThe Holy Grail of Data Analytics
The Holy Grail of Data Analytics
 
AgilData - How I Learned to Stop Worrying and Evolve with On-Demand Schemas
AgilData - How I Learned to Stop Worrying and Evolve  with On-Demand SchemasAgilData - How I Learned to Stop Worrying and Evolve  with On-Demand Schemas
AgilData - How I Learned to Stop Worrying and Evolve with On-Demand Schemas
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology Overview
 
Data decay and the illusion of the present
Data decay and the illusion of the presentData decay and the illusion of the present
Data decay and the illusion of the present
 
Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.Storm - As deep into real-time data processing as you can get in 30 minutes.
Storm - As deep into real-time data processing as you can get in 30 minutes.
 
Storing and manipulating graphs in HBase
Storing and manipulating graphs in HBaseStoring and manipulating graphs in HBase
Storing and manipulating graphs in HBase
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
When it rains: Prepare for scale with Amazon EC2
When it rains: Prepare for scale with Amazon EC2When it rains: Prepare for scale with Amazon EC2
When it rains: Prepare for scale with Amazon EC2
 

Kürzlich hochgeladen

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 

Kürzlich hochgeladen (20)

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men  🔝Ongole🔝   Escorts S...
➥🔝 7737669865 🔝▻ Ongole Call-girls in Women Seeking Men 🔝Ongole🔝 Escorts S...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎️9352988975 Two shot with one girl ...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

Dirty data? Clean it up! - Datapalooza Denver 2016

  • 1. Dirty Data? Clean it up! Or, how to do data science in the real world. Dan Lynn CEO, AgilData @danklynn dan@agildata.com Patrick Russell Director, Data Science, Craftsy @patrickrm101 patrick@craftsy.com
  • 2. © Phil Mislinksi - www.pmimage.com Patrick Russell - Bass Director, Data Science, Craftsy Dan Lynn - Guitar CEO, AgilData
  • 3. © Phil Mislinksi - www.pmimage.com www.craftsy.com Learn It. Make it. Explore expert-led video classes and shop the best yarn, fabric and supplies for quilting, sewing, knitting, cake decorating & more.
  • 4. © Phil Mislinksi - www.pmimage.com EXPERT SOLUTIONS AND SERVICES FOR COMPLEX DATA PROBLEMS At AgilData, we help you get the most out of your data. We provide Software and Services to help firms deliver on the promise of Big Data and complex data infrastructures: ● AgilData Scalable Cluster for MySQL – Massively scalable and performant MySQL databases combined with 24×7 remote managed services for DBA/DevOps ● Trusted Big Data experts to solve problems, set strategy and develop solutions for BI, data pipeline orchestration, ETL, APIs and custom applications. www.agildata.com
  • 5. Hey, you’re a data scientist, right? Great! We have millions of users. How we can use email to monetize our user base better? — Marketing
  • 6. 1 / 1 + exp(-x)
  • 7.
  • 8.
  • 9.
  • 10.
  • 12.
  • 15. Data Cleansing ● Dates & Times ● Numbers & Strings ● Addresses ● Clickstream Data ● Handling missing data ● Tidy Data
  • 16. Dates & Times ● Timestamps can mean different things ○ ingested_date, event_timestamp ● Clocks can’t be trusted ○ Server time: which server? Is it synchronized? ○ Client time? Is there a synchronizing time scheme? ● Timezones ○ What tz is your own data in? ○ Your email provider? Your adwords account? Your Google Analytics?
  • 17. Numbers & Strings ● Use the right types for your numbers (int, bigint, float, numeric etc) ● Murphy’s Law of text inputs: If a user can put something in a text field, anything and everything will happen. ● Watch out for floating point precision mistakes
  • 18. Addresses ● Parsing / validation is not something you want to do yourself ○ USPS has validation and zip lookup for US addresses: https://www.usps. com/business/web-tools-apis/documentation-updates.htm ● Remember zip codes are strings. And the rest of the world does not use U.S. zips. ● IP geolocation: Get lat/long, state, city, postal & ISP, from visitor IPs ○ https://www.maxmind.com/en/geoip2-city ○ This is ALWAYS approximate ● If working with GIS, recommend http://postgis.net/ ○ Vanilla postgres also has earthdistance for great circle distance
  • 19. Clickstream Data ● User agent => Device: Don’t do this yourself (we use WURFL and Google Analytics) ● Query strings follow the rules of text. Everything will show up ○ They might be truncated ○ URL encoding might be missing characters (%2 instead of %20) ○ Use a library to parse params (ie Python ships with urlparse.parse_qs) ● If your system creates sessions (tomcat, Google Analytics), don’t be afraid to create your own sessions on top of the pageview data ○ You’ll cross channel and cross device behavior this way
  • 21. Missing / empty data ● Easy to overlook but important ● What does missing data in the context of your analysis mean? ○ Not collected (why not?) ○ Error state ○ N/A or undefined ○ Especially for histograms, missing data lead to very poor conclusions. ● Does your data use sentinel values? (ie -9999 or “null”) ○ df[‘nps_score’].replace(-9999, np.nan) ● Imputation ● Storage
  • 22. Tidy Data ● Conceptual framework for structuring data for analysis and fitting ○ Each variable forms a column ○ Each observation is a row ○ Each type of observational unit forms a table ● Pretty much normal form from relational databases for stats ● Tidy can be different depending on the question asked ● R (dplyr, tidyr) and Python (pandas) have functions for making your long data wide & wide data long (stack, unstack, melt, pivot) ● Paper: http://vita.had.co.nz/papers/tidy-data.pdf ● Python tutorial: http://tomaugspurger.github.io/modern-5-tidy.html
  • 23. Tidy Data ● Example might be market place transaction data with 1 row per transaction ● You might want to do analysis on participants, 1 row per participant
  • 24. Hey, that’s a great model. How can we build it into our decision-making process? — Marketing
  • 26. ● Doing an analysis once rarely delivers lasting value. ● The business needs continuous insight, so you need to get this stuff into production. ○ Hosting ○ ETL ○ Pipelines Operationalizing Data Science
  • 27. Hosting ● Delivering continuous analyses requires operational infrastructure ○ Database(s) ○ Visualization tools (e.g. Chartio, Arcadia Data, Tableau, Looker, Qlik, etc..) ○ REST services / microservices ● These all have uptime requirements. You need to involve your (dev)ops team earlier rather than later. ● Microservices / REST endpoints have architectural implications ● Visualization tools ○ Local (e.g. Jupyter, Zeppelin) ○ On-premise (Arcadia Data, Tableau, Qlik) ○ Hosted (Chartio) ● Visualization tools often require a SQL interface, thus….
  • 28. ETL - Extract, Transform, Load ● Often used to herd data into some kind of data warehouse (e.g. RDBMS + star schema, Hadoop w/ unstructured data, etc..) ● Not just for data warehousing ● Not just for modeling ● No general solution ● Tooling ○ Apache Spark, Apache Sqoop ○ Commercial Tools: Informatica, Vertica, SQL Server, DataVirtuality etc… ● And then there is Apache Kafka…and the “NoETL” movement ○ Book: “I <3 Logs” - by Jay kreps ○ Replay history from the beginning of time as needed
  • 29. ETL - Extract, Transform, Load - Example ● Not just for production runs ○ For example, Patrick does a lot of time-to-event analysis on email opens, transactions, visits. ■ Survival functions, etc... ○ Setup ETL that builds tables With the right shape to put right into models
  • 30. Pipelines ● From data to model output ● Define dependencies and define DAG for the work ○ Steps defined by assigning input as output of prior steps ○ Luigi (http://luigi.readthedocs.io/en/stable/index.html) ○ Drake (https://github.com/Factual/drake) ○ Scikit learn has its own Pipeline ■ That can be part of your bigger pipeline ● Scheduling can be trickier than you think ○ Resource contention ○ Loose dependencies ○ Cron is fine but Jenkins works really well for this! ● Don’t be afraid to create and teardown full environments as steps ○ For example, spin up and configure an EMR cluster, do stuff, tear it down* * make your VP of Infrastructure less miserable
  • 31. Pipelines - Luigi ● Written in Python. Steps implemented by subclassing Task ● Visualize your DAG ● Supports data in relational DBs, Redshift, HDFS, S3, file system ● Flexible and extensible ● Can parallelize jobs ● Workflow runs by executing last step scheduling all dependencies
  • 33. Pipelines - Drake ● JVM (written in Clojure) ● Like a Makefile but for data work ● Supports commands in Shell, Python, Ruby, Clojure
  • 34. Pipelines - More Tools ● Oozie ○ The default job orchestration engine for Hadoop. Can chain together multiple jobs to form a complete DAG. ○ Open source ● Kettle ○ Old-school, but still relevant. ○ Visual pipeline designer. Execution engine ○ Open source ● Informatica ○ Visual pipeline designer, mature toolset ○ Commercial ● Datavirtuality ○ Treats all your stores (including Google Analytics) like schemas in a single db ○ Great for microservice architectures ○ Commercial
  • 35. © Patrick Coppinger Thanks! dan@agildata.com — patrick@craftsy.com @danklynn — @patrickrm101 Shameless Plug: Tonight at Galvanize, join us at the Denver/Boulder Big Data Meetup to learn about distributed system design! (ask Dan for details)
  • 36. References ● I Heart Logs ○ http://www.amazon.com/Heart-Logs-Stream-Processing-Integration/dp/1491909382 ● Tidy Data ○ http://vita.had.co.nz/papers/tidy-data.pdf
  • 37. Additional Tools ● Scientific python stack (ipython, numpy, scipy, pandas, matplotlib…) ● Hadleyverse for R (dplyr, ggplot, tidyr, lubridate…) ● csvkit: command line tools (csvcut, csvgrep, csvjoin...) for CSV data ● jq: fast command line tool for working with json (ie pipe cURL to jq) ● psql (if you use postgresql or Redshift)