7. Collect the Data
⢠What kind of data do we need?
⢠Financial data (Budget, box ofďŹceâŚ)
⢠Reviews, ratings and scores.
⢠Awards and nominations.
8. Process the data
⢠Howâs the data âdirtyâ and how can we ďŹx it?
⢠User input, redundancies, missing dataâŚ
⢠Formatting: adapt the data to meet certain
speciďŹcations.
⢠Cleaning: detecting and correcting
corrupt or inaccurate records.
9. Explore the data
⢠What are the meaningful patterns in the
data?
⢠How meaningful is each data point for our
predictions?
10. Goals
⢠Introduction to a data scientist's tools and
methods:
⢠Jupyter notebooks, numpy, pandas,
sklearnâŚ
⢠Overview of basic machine learning
concepts:
⢠Data formatting and cleaning, Decision
trees, OverďŹtting, Random ForestsâŚ
11. Jupyter Notebooks
⢠One of data scientistâs everyday tools.
⢠Find the links in our classroom tool.
⢠Contains cells with code.
12. NumPy
⢠The fundamental package for scientiďŹc
computing with Python.
⢠Provides powerful multi-dimensional array
objects.
⢠Many methods for fast operations on arrays.
13. Pandas
⢠Fundamental high-level building block for
doing practical, real world data analysis in
Python.
⢠Built on top of NumPy.
⢠Offers data structures and operations for
manipulating numerical tables and time
series.
14. Scikit-learn
⢠Python module for machine learning.
⢠Provides a large menu of libraries for
scientiďŹc computation, such as integration,
interpolation, signal processing, linear
algebra, statistics, etc.
16. Understanding your data
⢠.head(n) method: Returns ďŹrst n rows.
⢠.value_counts() method: Returns the counts
of unique values in the DataFrame.
18. Formatting your Data
⢠Rate values in a non-numeric format. Thus,
we will need to assign each rate a unique
integer so that Python can handle the
information.
⢠With the .ix method you create a subset of
rows and assign a value to a certain variable
of that subset of observations.
20. Decision Trees
⢠It breaks down a dataset into smaller and
smaller subsets.
⢠The ďŹnal result is a model with a tree
structure that has:
⢠Decision nodes: ask a question and have
two or more branches.
⢠Leaf nodes: represent a classiďŹcation or
decision.
21.
22. ClassiďŹcation vs Regression
⢠ClassiďŹcation â Predict categories.
⢠Identifying group membership.
⢠Regression â Predict values.
⢠Involves estimating or predicting a
response.
25. Creating your ďŹrst Decision Tree
You will use the scikit-learn and numpy
libraries to build your ďŹrst decision tree. We
will need the following to build a decision tree
⢠target: A one-dimensional numpy array
containing the target from the train data.
⢠features: A multidimensional numpy array
containing the features/predictors from the
train data.
27. Importances and Score
⢠.feature_importances_ attribute: tells us
how important the features are for the ďŹnal
result.
⢠.score() method: returns the mean accuracy
of our ďŹtting.
34. OverďŹtting
⢠Resulting model too tied to the training set.
⢠It doesnât generalize to new data, which is
the point of prediction.
35. Random Forest ClassiďŹer
⢠Random Forest ClassiďŹers use many
Decision Trees to build a classiďŹer.
⢠We introduce a bit of randomness.
⢠Each Tree can give a different answer (a
vote). The ďŹnal classiďŹcation is the most
common amongst the Trees.
52. More about Thinkful
⢠Anyone whoâs committed can learn to code
⢠1-on-1 mentorship is the best way to learn
⢠Flexibility! Learn anywhere, anytime, & at your
own pace
53. Our Program
Youâll learn concepts, practice with drills, and build
capstone projects â all guided by a personal mentor
55. Data Science Syllabus
⢠Managing data with SQL and Python
⢠Modeling with both supervised and unsupervised
models
⢠Data visualization and communicating with data
⢠Technical interviews + Career prep
57. Special Introductory Offer
⢠Prep course for 50% off â $250 instead of $500
⢠Covers math, stats, Python, and data science toolkit
⢠Option to continue into full program
⢠Talk to me (or email me) if youâre interested