SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Applied Data Analytics 
Building a real data product
Github Repository 
http://bit.ly/1eLBzki 
Matrix Factorization 
http://slidesha.re/15Qssf0 
Links to various resources
Goals for this Course 
● Apply the ideas and tools learned during all previous program courses 
● Use a real world data set with actionable prediction 
● Present a completed project to faculty and peers 
● Build a data project portfolio 
What are your goals? 
● Understand the Data Science Pipeline 
● Understand what a complete data product looks like 
● Be able to set up and implement a data product in Python
Some Logistics 
This is a small class, I’m hoping for lots of participation! 
Course materials can be found in two places: 
● iPython: http://bit.ly/1gJ73Tt 
● Github: https://github.com/DistrictDataLabs/science-bookclub 
● Slides: on slideshare or on Blackboard 
Recommended Reading: 
● Matrix Factorization: A simple tutorial and implementation 
● http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial- 
and-implementation-in-python/
Agenda - Day One 
● Review Data Products 
● Review Data Science Pipeline 
● Discuss architecture of the data product we’re going to build. 
● Setting up our project 
● Ingestion of Goodreads Data 
● Lunch 
● Creating a command line admin program 
● Wrangling of Goodreads Data 
● A computational data store
Agenda - Day Two 
● Review current state of recommender project 
● Matrix math review 
● Introduction to matrix factorization 
● Building a recommender system 
● Reporting with Jinja2 
● Lunch 
● Presentations of Capstone Projects 
● Course wrap-up
Building Data Products
A data product is a product that is 
based on the combination of data 
and algorithms.” 
Hilary Mason 
“
A data application acquires its value from the 
data itself, and creates more data as a result. 
It’s not just an application with data; it’s a 
data product. Data science enables the 
creation of data products.” 
Mike Loukides 
“
The Data Science Pipeline
Data Ingestion Data Munging 
and Wrangling 
Computation and 
Analyses 
Modeling and 
Application 
Reporting and 
Visualization
Data Ingestion 
● There is a world of data out 
there- how to get it? Web 
crawlers, APIs, Sensors? Python 
and other web scripting 
languages are custom made for 
this task. 
● The real question is how can we 
deal with such a giant volume 
and velocity of data? 
● Big Data and Data Science often 
require ingestion specialists!
Data Wrangling 
● Warehousing the data means 
storing the data in as raw a form 
as possible. 
● Extract, transform, and load 
operations move data to 
operational storage locations. 
● Filtering, aggregation, 
normalization and 
denormalization all ensure data is 
in a form it can be computed on. 
● Annotated training sets must be 
created for ML tasks.
Computation and Analyses 
● Hypothesis driven computation 
includes design and development 
of predictive models. 
● Many models have to be trained 
or constrained into a 
computational form like a Graph 
database, and this is time 
consuming. 
● Other data products like indices, 
relations, classifications, and 
clusters may be computed.
Modeling and Application 
This is the part we’re most familiar with. 
Supervised classification, Unsupervised 
clustering - Bayes, Logistic Regression, 
Decision Trees, and other models. 
This is also where the money is.
Reporting and Visualization 
● Often overlooked, this part is 
crucial, even if we have data 
products. 
● Humans recognize patterns 
better than machines. Human 
feedback is crucial in Active 
Learning and remodeling (error 
detection). 
● Mashups and collaborations 
generate more data- and 
therefore more value!
Don’t forget feedback! 
(Active Learning for Data 
Products)
What we’re going to build today 
SCIENCE BOOKCLUB!! 
● A book club that chooses what to 
read via a recommender system. 
● Uses GoodReads data to ingest 
and return feedback on books. 
● Statistical model is a non-negative 
matrix factorization 
● Reporting using Jinja (almost a 
web app)
Workflow 
1. Setting up a Python skeleton 
2. Creating and Running Tests 
3. Wading in with a configuration 
4. Ingestion with urllib and requests 
5. Creating a command line admin with argparse 
6. Wrangling with BeautifulSoup and SQLAlchemy 
7. Modeling with numpy 
8. Reporting with Jinja2
Matplotlib Jinja2 
Reporting 
Module 
Recommender 
Module 
Octavo Architecture (really clear DSP) 
requests.py 
Ingestion 
Module 
Raw Data 
Storage Computational 
Data Storage 
Wrangling 
Module 
BeautifulSou 
p 
SQLAlchemy 
Numpy
Let’s dive into some code!

Weitere ähnliche Inhalte

Was ist angesagt?

Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
 
Introdução a web semântica e o case da globo.com
Introdução a web semântica e o case da globo.comIntrodução a web semântica e o case da globo.com
Introdução a web semântica e o case da globo.comRenan Moreira de Oliveira
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with PythonMartin Loetzsch
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonMOHITKUMAR1379
 
Credit Card Fraud Detection Using ML In Databricks
Credit Card Fraud Detection Using ML In DatabricksCredit Card Fraud Detection Using ML In Databricks
Credit Card Fraud Detection Using ML In DatabricksDatabricks
 
KNIME Data Science Learnathon: From Raw Data To Deployment
KNIME Data Science Learnathon: From Raw Data To DeploymentKNIME Data Science Learnathon: From Raw Data To Deployment
KNIME Data Science Learnathon: From Raw Data To DeploymentKNIMESlides
 
Recommendation Systems in banking and Financial Services
Recommendation Systems in banking and Financial ServicesRecommendation Systems in banking and Financial Services
Recommendation Systems in banking and Financial ServicesAndrea Gigli
 
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Stefan Urbanek
 
Stumbling stones when migrating from Oracle
 Stumbling stones when migrating from Oracle Stumbling stones when migrating from Oracle
Stumbling stones when migrating from OracleEDB
 
BI and Data Analytics
BI and Data Analytics BI and Data Analytics
BI and Data Analytics Incorta
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data PipelineManish Kumar
 
Fraud Detection and Neo4j
Fraud Detection and Neo4j Fraud Detection and Neo4j
Fraud Detection and Neo4j Max De Marzi
 
Big data in transport an international transport forum overview oct 2013
Big data in transport    an international transport forum overview oct 2013Big data in transport    an international transport forum overview oct 2013
Big data in transport an international transport forum overview oct 2013OpenSkyData
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 

Was ist angesagt? (20)

Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
Introdução a web semântica e o case da globo.com
Introdução a web semântica e o case da globo.comIntrodução a web semântica e o case da globo.com
Introdução a web semântica e o case da globo.com
 
Data Warehousing with Python
Data Warehousing with PythonData Warehousing with Python
Data Warehousing with Python
 
Data mining
Data miningData mining
Data mining
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
Fraud Analytics
Fraud AnalyticsFraud Analytics
Fraud Analytics
 
Credit Card Fraud Detection Using ML In Databricks
Credit Card Fraud Detection Using ML In DatabricksCredit Card Fraud Detection Using ML In Databricks
Credit Card Fraud Detection Using ML In Databricks
 
KNIME Data Science Learnathon: From Raw Data To Deployment
KNIME Data Science Learnathon: From Raw Data To DeploymentKNIME Data Science Learnathon: From Raw Data To Deployment
KNIME Data Science Learnathon: From Raw Data To Deployment
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Recommendation Systems in banking and Financial Services
Recommendation Systems in banking and Financial ServicesRecommendation Systems in banking and Financial Services
Recommendation Systems in banking and Financial Services
 
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
 
Stumbling stones when migrating from Oracle
 Stumbling stones when migrating from Oracle Stumbling stones when migrating from Oracle
Stumbling stones when migrating from Oracle
 
BI and Data Analytics
BI and Data Analytics BI and Data Analytics
BI and Data Analytics
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
 
Fraud Detection and Neo4j
Fraud Detection and Neo4j Fraud Detection and Neo4j
Fraud Detection and Neo4j
 
Dados importam, seja data-driven!
Dados importam, seja data-driven!Dados importam, seja data-driven!
Dados importam, seja data-driven!
 
Big data in transport an international transport forum overview oct 2013
Big data in transport    an international transport forum overview oct 2013Big data in transport    an international transport forum overview oct 2013
Big data in transport an international transport forum overview oct 2013
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 

Andere mochten auch

Startup Pitch Decks that Work: Creating a Winning Pitch Deck
Startup Pitch Decks that Work: Creating a Winning Pitch DeckStartup Pitch Decks that Work: Creating a Winning Pitch Deck
Startup Pitch Decks that Work: Creating a Winning Pitch DeckDavid Ehrenberg
 
300 Milligrams - Demo Day Presentation
300 Milligrams - Demo Day Presentation300 Milligrams - Demo Day Presentation
300 Milligrams - Demo Day Presentation500 Startups
 
500’s Demo Day Batch 12 >> Alfred
500’s Demo Day Batch 12 >> Alfred500’s Demo Day Batch 12 >> Alfred
500’s Demo Day Batch 12 >> Alfred500 Startups
 
BrandBoards demo day pitch deck
BrandBoards demo day pitch deckBrandBoards demo day pitch deck
BrandBoards demo day pitch deck500 Startups
 
Standard Treasury Series A Pitch Deck
Standard Treasury Series A Pitch DeckStandard Treasury Series A Pitch Deck
Standard Treasury Series A Pitch DeckZachary Townsend
 
Tealet - DRINK THE TEA
Tealet - DRINK THE TEATealet - DRINK THE TEA
Tealet - DRINK THE TEA500 Startups
 
500’s Demo Day Batch 11 >> Slidebean
500’s Demo Day Batch 11 >> Slidebean 500’s Demo Day Batch 11 >> Slidebean
500’s Demo Day Batch 11 >> Slidebean 500 Startups
 
Kickfolio - 500Startups Batch 5
Kickfolio - 500Startups Batch 5Kickfolio - 500Startups Batch 5
Kickfolio - 500Startups Batch 5500 Startups
 
TouristEye - Personalizing The Travel Experience - 500 Startups
TouristEye - Personalizing The Travel Experience - 500 StartupsTouristEye - Personalizing The Travel Experience - 500 Startups
TouristEye - Personalizing The Travel Experience - 500 Startups500 Startups
 
Pitch deck for Kejahunt
Pitch deck for KejahuntPitch deck for Kejahunt
Pitch deck for KejahuntJoshua Mutua
 
Square pitch deck
Square pitch deckSquare pitch deck
Square pitch deckpitchenvy
 
Contently Pitch Deck
Contently Pitch DeckContently Pitch Deck
Contently Pitch DeckRyan Gum
 

Andere mochten auch (20)

Startup Pitch Decks that Work: Creating a Winning Pitch Deck
Startup Pitch Decks that Work: Creating a Winning Pitch DeckStartup Pitch Decks that Work: Creating a Winning Pitch Deck
Startup Pitch Decks that Work: Creating a Winning Pitch Deck
 
300 Milligrams - Demo Day Presentation
300 Milligrams - Demo Day Presentation300 Milligrams - Demo Day Presentation
300 Milligrams - Demo Day Presentation
 
Cadee
CadeeCadee
Cadee
 
500’s Demo Day Batch 12 >> Alfred
500’s Demo Day Batch 12 >> Alfred500’s Demo Day Batch 12 >> Alfred
500’s Demo Day Batch 12 >> Alfred
 
Binpress
BinpressBinpress
Binpress
 
BrandBoards demo day pitch deck
BrandBoards demo day pitch deckBrandBoards demo day pitch deck
BrandBoards demo day pitch deck
 
Sverve
SverveSverve
Sverve
 
Standard Treasury Series A Pitch Deck
Standard Treasury Series A Pitch DeckStandard Treasury Series A Pitch Deck
Standard Treasury Series A Pitch Deck
 
PinMyPet
PinMyPetPinMyPet
PinMyPet
 
Farmeron
FarmeronFarmeron
Farmeron
 
Tealet - DRINK THE TEA
Tealet - DRINK THE TEATealet - DRINK THE TEA
Tealet - DRINK THE TEA
 
500’s Demo Day Batch 11 >> Slidebean
500’s Demo Day Batch 11 >> Slidebean 500’s Demo Day Batch 11 >> Slidebean
500’s Demo Day Batch 11 >> Slidebean
 
Kickfolio - 500Startups Batch 5
Kickfolio - 500Startups Batch 5Kickfolio - 500Startups Batch 5
Kickfolio - 500Startups Batch 5
 
Kibin
Kibin Kibin
Kibin
 
task.ly pitch deck
task.ly pitch decktask.ly pitch deck
task.ly pitch deck
 
TouristEye - Personalizing The Travel Experience - 500 Startups
TouristEye - Personalizing The Travel Experience - 500 StartupsTouristEye - Personalizing The Travel Experience - 500 Startups
TouristEye - Personalizing The Travel Experience - 500 Startups
 
Daily hundred Pitch Deck 2014
Daily hundred Pitch Deck 2014Daily hundred Pitch Deck 2014
Daily hundred Pitch Deck 2014
 
Pitch deck for Kejahunt
Pitch deck for KejahuntPitch deck for Kejahunt
Pitch deck for Kejahunt
 
Square pitch deck
Square pitch deckSquare pitch deck
Square pitch deck
 
Contently Pitch Deck
Contently Pitch DeckContently Pitch Deck
Contently Pitch Deck
 

Ähnlich wie Building Data Products with Python (Georgetown)

Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with PythonBenjamin Bengfort
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumVMware Tanzu
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo
 
Advanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryAdvanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryMark Constable
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
 
Lak2018: Scaling Nationally: Seven Lesson Learned
Lak2018:  Scaling Nationally: Seven Lesson LearnedLak2018:  Scaling Nationally: Seven Lesson Learned
Lak2018: Scaling Nationally: Seven Lesson Learnedmwebbjisc
 
KSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdfKSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdfJack Zheng
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceJuuso Parkkinen
 
KSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 UpdateKSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 UpdateJack Zheng
 
Career in Python and data science
Career in Python and data science Career in Python and data science
Career in Python and data science Sagar Hedau
 
Big Data overview
Big Data overviewBig Data overview
Big Data overviewalexisroos
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Amazon Web Services
 
Agile data science
Agile data scienceAgile data science
Agile data scienceJoel Horwitz
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
 
Visualising montioring and evaluation data
Visualising montioring and evaluation dataVisualising montioring and evaluation data
Visualising montioring and evaluation dataRob Worthington
 

Ähnlich wie Building Data Products with Python (Georgetown) (20)

Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Architecting for Data Science
Architecting for Data ScienceArchitecting for Data Science
Architecting for Data Science
 
2020 | Metadata Day | LinkedIn
2020 | Metadata Day | LinkedIn2020 | Metadata Day | LinkedIn
2020 | Metadata Day | LinkedIn
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal GreenplumSimplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
Simplified Machine Learning, Text, and Graph Analytics with Pivotal Greenplum
 
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
Denodo DataFest 2016: Comparing and Contrasting Data Virtualization With Data...
 
Advanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project DeliveryAdvanced Project Data Analytics for Improved Project Delivery
Advanced Project Data Analytics for Improved Project Delivery
 
Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez Course 8 : How to start your big data project by Eric Rodriguez
Course 8 : How to start your big data project by Eric Rodriguez
 
Lak2018: Scaling Nationally: Seven Lesson Learned
Lak2018:  Scaling Nationally: Seven Lesson LearnedLak2018:  Scaling Nationally: Seven Lesson Learned
Lak2018: Scaling Nationally: Seven Lesson Learned
 
KSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdfKSU IT Capstone Report 2012-2017.pdf
KSU IT Capstone Report 2012-2017.pdf
 
How to Prepare for a Career in Data Science
How to Prepare for a Career in Data ScienceHow to Prepare for a Career in Data Science
How to Prepare for a Career in Data Science
 
KSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 UpdateKSU IT4983 Capstone Projects Report 2017 Update
KSU IT4983 Capstone Projects Report 2017 Update
 
Career in Python and data science
Career in Python and data science Career in Python and data science
Career in Python and data science
 
Big Data overview
Big Data overviewBig Data overview
Big Data overview
 
Python and data analytics
Python and data analyticsPython and data analytics
Python and data analytics
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
Visualising montioring and evaluation data
Visualising montioring and evaluation dataVisualising montioring and evaluation data
Visualising montioring and evaluation data
 

Mehr von Benjamin Bengfort

Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningBenjamin Bengfort
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Benjamin Bengfort
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity ResolutionBenjamin Bengfort
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportBenjamin Bengfort
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataBenjamin Bengfort
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Benjamin Bengfort
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseBenjamin Bengfort
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXBenjamin Bengfort
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 

Mehr von Benjamin Bengfort (18)

Getting Started with TRISA
Getting Started with TRISAGetting Started with TRISA
Getting Started with TRISA
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation Report
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Annotation with Redfox
Annotation with RedfoxAnnotation with Redfox
Annotation with Redfox
 
Rasta processing of speech
Rasta processing of speechRasta processing of speech
Rasta processing of speech
 

Kürzlich hochgeladen

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Kürzlich hochgeladen (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

Building Data Products with Python (Georgetown)

  • 1. Applied Data Analytics Building a real data product
  • 2. Github Repository http://bit.ly/1eLBzki Matrix Factorization http://slidesha.re/15Qssf0 Links to various resources
  • 3. Goals for this Course ● Apply the ideas and tools learned during all previous program courses ● Use a real world data set with actionable prediction ● Present a completed project to faculty and peers ● Build a data project portfolio What are your goals? ● Understand the Data Science Pipeline ● Understand what a complete data product looks like ● Be able to set up and implement a data product in Python
  • 4. Some Logistics This is a small class, I’m hoping for lots of participation! Course materials can be found in two places: ● iPython: http://bit.ly/1gJ73Tt ● Github: https://github.com/DistrictDataLabs/science-bookclub ● Slides: on slideshare or on Blackboard Recommended Reading: ● Matrix Factorization: A simple tutorial and implementation ● http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial- and-implementation-in-python/
  • 5. Agenda - Day One ● Review Data Products ● Review Data Science Pipeline ● Discuss architecture of the data product we’re going to build. ● Setting up our project ● Ingestion of Goodreads Data ● Lunch ● Creating a command line admin program ● Wrangling of Goodreads Data ● A computational data store
  • 6. Agenda - Day Two ● Review current state of recommender project ● Matrix math review ● Introduction to matrix factorization ● Building a recommender system ● Reporting with Jinja2 ● Lunch ● Presentations of Capstone Projects ● Course wrap-up
  • 8. A data product is a product that is based on the combination of data and algorithms.” Hilary Mason “
  • 9.
  • 10. A data application acquires its value from the data itself, and creates more data as a result. It’s not just an application with data; it’s a data product. Data science enables the creation of data products.” Mike Loukides “
  • 11.
  • 12. The Data Science Pipeline
  • 13. Data Ingestion Data Munging and Wrangling Computation and Analyses Modeling and Application Reporting and Visualization
  • 14. Data Ingestion ● There is a world of data out there- how to get it? Web crawlers, APIs, Sensors? Python and other web scripting languages are custom made for this task. ● The real question is how can we deal with such a giant volume and velocity of data? ● Big Data and Data Science often require ingestion specialists!
  • 15. Data Wrangling ● Warehousing the data means storing the data in as raw a form as possible. ● Extract, transform, and load operations move data to operational storage locations. ● Filtering, aggregation, normalization and denormalization all ensure data is in a form it can be computed on. ● Annotated training sets must be created for ML tasks.
  • 16. Computation and Analyses ● Hypothesis driven computation includes design and development of predictive models. ● Many models have to be trained or constrained into a computational form like a Graph database, and this is time consuming. ● Other data products like indices, relations, classifications, and clusters may be computed.
  • 17. Modeling and Application This is the part we’re most familiar with. Supervised classification, Unsupervised clustering - Bayes, Logistic Regression, Decision Trees, and other models. This is also where the money is.
  • 18. Reporting and Visualization ● Often overlooked, this part is crucial, even if we have data products. ● Humans recognize patterns better than machines. Human feedback is crucial in Active Learning and remodeling (error detection). ● Mashups and collaborations generate more data- and therefore more value!
  • 19. Don’t forget feedback! (Active Learning for Data Products)
  • 20. What we’re going to build today SCIENCE BOOKCLUB!! ● A book club that chooses what to read via a recommender system. ● Uses GoodReads data to ingest and return feedback on books. ● Statistical model is a non-negative matrix factorization ● Reporting using Jinja (almost a web app)
  • 21. Workflow 1. Setting up a Python skeleton 2. Creating and Running Tests 3. Wading in with a configuration 4. Ingestion with urllib and requests 5. Creating a command line admin with argparse 6. Wrangling with BeautifulSoup and SQLAlchemy 7. Modeling with numpy 8. Reporting with Jinja2
  • 22. Matplotlib Jinja2 Reporting Module Recommender Module Octavo Architecture (really clear DSP) requests.py Ingestion Module Raw Data Storage Computational Data Storage Wrangling Module BeautifulSou p SQLAlchemy Numpy
  • 23. Let’s dive into some code!