5. STARTING OUT IN DATA ANALYSIS
• Online: blogs, GitHub, MOOCs, Kaggle,
Data Tau, Cross Validated,
Stackoverflow...
• Books
• School work
TOO MANY RESOURCES
8. • Programmed in C
• Used MATLAB at Uni
• Spent a long time playing with symbolic
langs Mathematica & Maple
START BY WHAT YOU KNOW & ASK YOUR FRIENDS
MY EXPERIENCE
P.S. I had not met the iPython notebook.
9. BIG REVEAL: I AM AN AVID R USER
MY EXPERIENCE (cont)
P.S. I had not met the iPython notebook.
• Don’t have a web dev background
• Surrounded by people doing Stats
• Pick the right tool for the task at hand
10. TL;DR - CAN BE CONFUSING FOR A NEWBIE
LANGUAGE WARS
Too many articles about:
• “Python Displacing R As The Programming Language
For Data Analysis”
• “Is Python really supplanting R for data work?”
• “10 Reasons Python Rocks for Research”
• “Why Python is steadily eating other languages' lunch”
• “Why I’m betting on Julia”
• “What are the advantages of using Python over R?”
• “Why Python with Coffee is better than R with Ice
Cream”
12. LANGUAGE WARS
However, it is good to have a general
understanding of the + and - of the various data
analysis tools, in order to pick the right tool for the
job.
• R has EVERYTHING you need for performing
statistical analysis.
• R / MATLAB / Python are great for prototyping
• Python is a full featured programming language
• Easier to incorportate Python outcomes into a full
data product workflow
13. DEFINE THE PROBLEM
Time better spent defining the problem and
determining what is the best way to solve it
GOOD TO HAVE A BIG BAG OF TRICKS
Re-do R analysis using Python data analysis stack
WILL IT PYTHON? CREDIT: SLENDER MEANS
14. PYTHON SCIKIT LEARN
IT IS PRETTY AWESOME
• Library of Machine Learning Algorithms
• Open source
• API
• Python, Numpy & Co
• Accessible, many models, documentation &
examples
16. CHOOSING A PROBLEM
Always a good idea to look for a data set that
is interesting to you.
1
2 Formulate a question
3 Formulate an hypothesis
4 Build Model to answer question and Test
SCIENTIFIC METHOD FTW
18. EMI MUSIC
“ONE MILLION INTERVIEW SET”
• One of the largest preference data sets in the
world.
• Extract used in Data Science London hackaton and
available in KAGGLE as four separate data sets.
19. FOUR DATA SETS
• TRAIN / TEST - artist, track, userID, time & ratings
• WORDS - userID, heard_of, own_artist_music ,
like_artist, 82 adjectives
• USERS - userID, gender, age, working status, region,
music, list_own (hours per day), list_back (hours
per day), 19 user habits questions (0-100)
20. USERS
KEY STRING
1 “Music is important to me but not necessarily most important”
2 “I like music but it does not feature heavily in my life”
3 “Music means a lot to me and it is a passion of mine”
4 “Music has no particular interest to me”
5 “Music is important to me but not necessarily more important
than other hobbies”
6 “Music is no longer as important as it used to be”
21. WORDS DATASET
UNINSPIRED, AGGRESSIVE, UNATTRACTIVE,
BORING, CHEAP, IRRELEVANT, WAY OUT,
ANNOYING, CHEESY, UNORIGINAL,
OUTDATED, UNAPPROACHABLE...
82 ADJECTIVES
23. USERS
19 MUSIC HABIT QUESTIONS:Rate (0-100) whether user agrees with
the statements:
“I enjoy actively searching for and discovering
music that I have never heard before”
“I am not willing to pay for music”
“I like to be at the cutting edge of new music”
“I love tech”
27. MOTIVATION
• PRODUCTION - Cheaper to produce (lower barriers to
entry for budding artists).
• DISTRIBUTION - Internet has made music more
accessible. Artists can decide where and how to
sell.
• CONSUMPTION - People’s listening habits have changed
due to the internet and to the change in devices.
TECHNOLOGY HAS BEEN A DISRUPTIVE FORCE IN THE MUSIC INDUSTRY.
28. PROBLEMS
• ARTISTS - Easier to produce music, harder to make
themselves known or earn a living.
• RECORD COMPANIES - People buy per song, easy for
listener to consume without paying. Wider
competition field.
• LISTENERS - Too many choices. Discovery is difficult.
29. QUESTIONS
• Can one predict the rating of a song?
• What factors are important to determine how
much a person likes a song?
• What is the minimal set of factors that are needed
to determine how much a person likes a song?
31. FIRST ATTEMPT
• Regression problem
• Turn categorical variables into numeric variables
• Consider ALL features and pick machine learning
algorithm to do the job.
CAN ONE PREDICT THE RATING OF A SONG?
32. FIRST ATTEMPT
• Because exploratory analysis revealed ratings are
highly clustered, we can look at five different
scores and formulate problem as a classification
one.
CAN ONE PREDICT THE RATING OF A SONG?
We split ratings 0-100 in 5 intervals,
so each becomes a class and we label these.
35. RANDOM FORESTS
• Random Forests are built from aggregating trees.
• Can be used for regression & classification problems.
• They do not overfit and can handle large amount of features
• They also output a list of features that are believed to be
important in predicting the variable
Highly versatile ensemble method - combines
several models into one.
A.K.A. BEST “BLACK-BOX” METHOD EVER (BREIMAN / CUTLER)
36. RANDOM FORESTS
THE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011)
MOVIES
20 QUESTIONS
WILL JAMIE LIKE X?
BRIENNE IS THE DECISION TREE FOR
JAMIE’S MOVIES PREFERENCES
37. RANDOM FORESTS
THE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011)
Ask Tywin, Cersei, Tyrion...Jamie gives each of them
slightly different info.
THEY FORM A BAGGED FOREST OF JAMIE’S MOVIES PREFERENCES
Jamie demands getting different questions every time.
THEY NOW FORM A RANDOM FOREST OF JAMIE’S MOVIES PREFERENCES
38. RANDOM FORESTS
• A tree of maximal depth is grown on a bootstrap sample of
size m of the training set. There is no pruning.
• A number m << p is specified such that at each node, m
variables are sampled at random out of p. The best split of
these variables is used to split the node into two subnodes.
• Final classification is given by majority voting of the
ensemble of trees in the forest.
• Only two “free” parameters: number of trees and number of
variables in random subset at each node.
39. RANDOM FORESTS
OUT-OF-BAG (OOB) ERROR
Each bootstrap sample not used in the construction of
the tree becomes a test set. The oob error estimate is
given by the misclassification error (MSE for regression),
averaged over all samples.
VARIABLE IMPORTANCE
Determined by looking at how much prediction error
increases when (OOB) data for that variable is permuted
while all others are left unchanged.
40. RANDOM FORESTS IN R & PYTHON
randomForest PACKAGE
• Various implementations - randomForest, CARET, PARTY, BIGRF
• We follow the KISS procedure - KEEP IT SIMPLE S.
• One can test various values of mtry and the number of
trees.
Used randomForest package 4.6-7 with R 2.15. Defaults are
n=500 trees & mtry= p/3 for regression & sqrt(p) for
classification.
41. RANDOM FORESTS IN R & PYTHON
SCIKIT LEARN
Used SCIKIT LEARN 0.14.1 running Python version 2.7.5.
COMPUTER: Macbook Pro 2.53 GHz Intel Core 2 Duo with 4
GB 1067 Mhz DDR3 runnning OS X 10.6.8
• Training Time
• RSQ & RMSE (Regression)
• Accuracy (Classification)
For the comparison we will build “small” forests and
focus on the following simple metrics:
42. RANDOM FORESTS IN R
RESULTS REGRESSION
Split data in training and test sets. Dataframe has
82,714 rows each and 114 columns.
Parameters: 60 trees, sample of 50,000.
Training time: 39.39 min
RMSE: 14.587
RSQ: 0.581
rf
<-‐
randomForest(training,ratings_train,ntree=60,
sampsize
=
50000,
importance
=
TRUE)
43. RANDOM FORESTS IN PYTHON
RESULTS REGRESSION
Split data in training and test sets. Dataframe has
82,714 rows each and 114 columns.
Parameters: 60 trees, sample of 50,000.
Training time: 3 min 7 sec
RMSE: 14.687
RSQ: 0.575
rf
=
RandomForestRegressor(n_estimators=60,
max_features='sqrt')
45. RANDOM FORESTS IN R
FEATURE IMPORTANCE
FEATURE (% INC MSE) FEATURE (% INC NODE PURITY)
Beautiful Talented
Boring Like Artist
Q16 Catchy
Catchy Beautiful
Talented Boring
Q9 Track
Q19 Distinctive
None of these Cool
Age Q11
Track Q12
Q16 - I would be willing to pay for the opp to
buy new music pre-release
Q9 - I am out of touch with new music
Q19 - I like to know about music before
other people
Q11 -Pop music is fun
Q12 - Pop music helps me escape
Like artist - To what extent do you like or dislike
listening to this artist?
47. RANDOM FORESTS IN PYTHON
FEATURE IMPORTANCE
FEATURE IMPORTANCE IN R RANDOM FOREST
Distinctive 7
Catchy 3
Like Artist 2
Fun -
Talented 1
Beautiful 4
Original -
Unoriginal -
Q11 9
Own Artist Music -
Own Artist Music - Do you have this artist in
your music collection?
Q11 -Pop music is fun
48. RANDOM FORESTS IN R & PYTHON
Model RMSE
R Random Forest 14.587
Python Scikit Learn Random Forest 14.687
Linear Regression 16.23
Multiple Linear Regs 15.53
RESULTS REGRESSION
51. RANDOM FORESTS IN R
FEATURE IMPORTANCE
FEATURE (% INC MSE) FEATURE (% INC NODE PURITY)
Q9 Track
Q7 Q11
Q5 Q12
Q6 Age
Age Q6
Q10 Q17
listBACK Q9
Q19 Q16
listOWN Q4
Q16 Q13
Q16 - I would be willing to pay for the opp to
buy new music pre-release
Q9 - I am out of touch with new music
Q19 - I like to know about music before
other people
Q11 -Pop music is fun
Q12 - Pop music helps me escape
Q7 - I enjoy music primarily from going out to
dance
Q5 - I used to know where to find music
Q6 - I am not willing to pay for music
Q10 - My music collection is a source of pride
Q4 - I would like to buy new music but I don’t
know what to buy
Q17 - I find seeing a new artist a useful way of
discovering new music
52. RANDOM FORESTS IN PYTHON
FEATURE IMPORTANCE
FEATURE IMPORTANCE IN R RANDOM FOREST
Q11 2
Q12 3
Age 4
Q6 5
Q17 6
Q5 -
Q4 9
Q10 -
Q16 7
Q7 -
Q16 - I would be willing to pay for the opp to
buy new music pre-release
Q11 -Pop music is fun
Q12 - Pop music helps me escape
Q5 - I used to know where to find music
Q6 - I am not willing to pay for music
Q10 - My music collection is a source of pride
Q4 - I would like to buy new music but I don’t
know what to buy
Q17 - I find seeing a new artist a useful way of
discovering new music
56. FEATURE SELECTION
PRINCIPAL COMPONENT ANALYSIS - WORDS
Determine which features account for most of the variance.
FEATURE PC1 PC2
Distinctive 0.20 -0.059
Authentic 0.19 -0.046
Talented 0.19 -0.083
Credible 0.19 -0.084
Stylish 0.18 -0.094
Annoying -0.06 -0.065
Intrusive -0.06 -0.058
Irrelevant -0.059 -0.087
Uninspired -0.056 -0.092
Noisy -0.053 -0.13
57. FEATURE SELECTIONMake a simple model choosing meaningful variables
WORDS - Annoying, Depressing, Boring, Catchy,
Talented, Distinctive, Beautiful, Superstar,
Soulful and Popular.
QUESTIONS - Q4, Q5, Q6, Q9, Q10 Q11 and Q19.
• Running time in R ~ 15 min.
• RMSE = 14.791 / Public leader board 13.076
59. COMMENTS
It is well known that Random Forests have
shown to be biased towards highly correlated
variables. Using conditional inference trees,
ameliorates that bias (See Party PACKAGE in R)
SCIKIT learn’s implementation has n_jobs parameter
to parallelise training. For a similar feature in R,
see bigRF package.
61. CONCLUDING REMARKS
We solved a problem using both R and PYTHON (via Scikit
learn). Clearly constraints for addressing a given
problem might differ and would dictate the
implementation of choice.
PICK THE TOOL THAT IS BEST FOR THE JOB
WORTH LEARNING ABOUT BOTH IMPLEMENTATIONS
Both R and PYTHON (via SCIKIT LEARN) implementations have
added functions that allow the user to explore the
resulting model and its performance.
62. CONCLUDING REMARKS
RANDOM FORESTS ARE GREAT
KEEP AN EYE OUT FOR INTERESTING DATA
It gives great accuracy, can handle many features,
does not require cross validation and it even
estimates what variables are important.
Having data that you are interested in, leads to
more interesting questions and reasons to explore
new methods and add a new trick to your bag.
63. CONCLUDING REMARKS
EMI DATASET IS GREAT TO TEST RIDE
TO DO’s - WILL IT PYTHON?
Set has a lot of behavioural information on a
subject that everyone has some intuition.
Prediction using SVM’s and other Matrix
Factorisation techniques. Full factor analysis, etc.