3. Goals for this Course
● Apply the ideas and tools learned during all previous program courses
● Use a real world data set with actionable prediction
● Present a completed project to faculty and peers
● Build a data project portfolio
What are your goals?
● Understand the Data Science Pipeline
● Understand what a complete data product looks like
● Be able to set up and implement a data product in Python
4. Some Logistics
This is a small class, I’m hoping for lots of participation!
Course materials can be found in two places:
● iPython: http://bit.ly/1gJ73Tt
● Github: https://github.com/DistrictDataLabs/science-bookclub
● Slides: on slideshare or on Blackboard
Recommended Reading:
● Matrix Factorization: A simple tutorial and implementation
● http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-
and-implementation-in-python/
5. Agenda - Day One
● Review Data Products
● Review Data Science Pipeline
● Discuss architecture of the data product we’re going to build.
● Setting up our project
● Ingestion of Goodreads Data
● Lunch
● Creating a command line admin program
● Wrangling of Goodreads Data
● A computational data store
6. Agenda - Day Two
● Review current state of recommender project
● Matrix math review
● Introduction to matrix factorization
● Building a recommender system
● Reporting with Jinja2
● Lunch
● Presentations of Capstone Projects
● Course wrap-up
8. A data product is a product that is
based on the combination of data
and algorithms.”
Hilary Mason
“
9.
10. A data application acquires its value from the
data itself, and creates more data as a result.
It’s not just an application with data; it’s a
data product. Data science enables the
creation of data products.”
Mike Loukides
“
13. Data Ingestion Data Munging
and Wrangling
Computation and
Analyses
Modeling and
Application
Reporting and
Visualization
14. Data Ingestion
● There is a world of data out
there- how to get it? Web
crawlers, APIs, Sensors? Python
and other web scripting
languages are custom made for
this task.
● The real question is how can we
deal with such a giant volume
and velocity of data?
● Big Data and Data Science often
require ingestion specialists!
15. Data Wrangling
● Warehousing the data means
storing the data in as raw a form
as possible.
● Extract, transform, and load
operations move data to
operational storage locations.
● Filtering, aggregation,
normalization and
denormalization all ensure data is
in a form it can be computed on.
● Annotated training sets must be
created for ML tasks.
16. Computation and Analyses
● Hypothesis driven computation
includes design and development
of predictive models.
● Many models have to be trained
or constrained into a
computational form like a Graph
database, and this is time
consuming.
● Other data products like indices,
relations, classifications, and
clusters may be computed.
17. Modeling and Application
This is the part we’re most familiar with.
Supervised classification, Unsupervised
clustering - Bayes, Logistic Regression,
Decision Trees, and other models.
This is also where the money is.
18. Reporting and Visualization
● Often overlooked, this part is
crucial, even if we have data
products.
● Humans recognize patterns
better than machines. Human
feedback is crucial in Active
Learning and remodeling (error
detection).
● Mashups and collaborations
generate more data- and
therefore more value!
20. What we’re going to build today
SCIENCE BOOKCLUB!!
● A book club that chooses what to
read via a recommender system.
● Uses GoodReads data to ingest
and return feedback on books.
● Statistical model is a non-negative
matrix factorization
● Reporting using Jinja (almost a
web app)
21. Workflow
1. Setting up a Python skeleton
2. Creating and Running Tests
3. Wading in with a configuration
4. Ingestion with urllib and requests
5. Creating a command line admin with argparse
6. Wrangling with BeautifulSoup and SQLAlchemy
7. Modeling with numpy
8. Reporting with Jinja2
22. Matplotlib Jinja2
Reporting
Module
Recommender
Module
Octavo Architecture (really clear DSP)
requests.py
Ingestion
Module
Raw Data
Storage Computational
Data Storage
Wrangling
Module
BeautifulSou
p
SQLAlchemy
Numpy