Random Forests R vs Python by Linda Uruchurtu

RANDOM FORESTS
R vs PYTHONR & PYTHON
Having fun when starting out in data analysis

WHO
LINDA URUCHURTU
@lindauruchurtu
Consultant at DBi
Web Analytics & Data Consultancy
Physicist by training

OUTLINE OF THIS TALK
• Motivation
• Random Forests: R & Python
• Example: EMI music set
• Concluding remarks

STARTING OUT IN DATA ANALYSIS
• Online: blogs, GitHub, MOOCs, Kaggle,
Data Tau, Cross Validated,
Stackoverflow...
• Books
• School work
TOO MANY RESOURCES

WHICH LANGUAGE SHOULD I USE?
POPULAR QUESTION

• Programmed in C
• Used MATLAB at Uni
• Spent a long time playing with symbolic
langs Mathematica & Maple
START BY WHAT YOU KNOW & ASK YOUR FRIENDS
MY EXPERIENCE
P.S. I had not met the iPython notebook.

BIG REVEAL: I AM AN AVID R USER
MY EXPERIENCE (cont)
P.S. I had not met the iPython notebook.
• Don’t have a web dev background
• Surrounded by people doing Stats
• Pick the right tool for the task at hand

TL;DR - CAN BE CONFUSING FOR A NEWBIE
LANGUAGE WARS
Too many articles about:
• “Python Displacing R As The Programming Language
For Data Analysis”
• “Is Python really supplanting R for data work?”
• “10 Reasons Python Rocks for Research”
• “Why Python is steadily eating other languages' lunch”
• “Why I’m betting on Julia”
• “What are the advantages of using Python over R?”
• “Why Python with Coffee is better than R with Ice
Cream”

[FAVE LANG] is BETTER
BECAUSE I SAY SO

LANGUAGE WARS
However, it is good to have a general
understanding of the + and - of the various data
analysis tools, in order to pick the right tool for the
job.
• R has EVERYTHING you need for performing
statistical analysis.
• R / MATLAB / Python are great for prototyping
• Python is a full featured programming language
• Easier to incorportate Python outcomes into a full
data product workflow

DEFINE THE PROBLEM
Time better spent defining the problem and
determining what is the best way to solve it
GOOD TO HAVE A BIG BAG OF TRICKS
Re-do R analysis using Python data analysis stack
WILL IT PYTHON? CREDIT: SLENDER MEANS

PYTHON SCIKIT LEARN
IT IS PRETTY AWESOME
• Library of Machine Learning Algorithms
• Open source
• API
• Python, Numpy & Co
• Accessible, many models, documentation &
examples

CHOOSING A PROBLEM
Always a good idea to look for a data set that
is interesting to you.
1
2 Formulate a question
3 Formulate an hypothesis
4 Build Model to answer question and Test
SCIENTIFIC METHOD FTW

EMI MUSIC
“ONE MILLION INTERVIEW SET”
• One of the largest preference data sets in the
world.
• Extract used in Data Science London hackaton and
available in KAGGLE as four separate data sets.

FOUR DATA SETS
• TRAIN / TEST - artist, track, userID, time & ratings
• WORDS - userID, heard_of, own_artist_music ,
like_artist, 82 adjectives
• USERS - userID, gender, age, working status, region,
music, list_own (hours per day), list_back (hours
per day), 19 user habits questions (0-100)

USERS
KEY STRING
1 “Music is important to me but not necessarily most important”
2 “I like music but it does not feature heavily in my life”
3 “Music means a lot to me and it is a passion of mine”
4 “Music has no particular interest to me”
5 “Music is important to me but not necessarily more important
than other hobbies”
6 “Music is no longer as important as it used to be”

WORDS DATASET
UNINSPIRED, AGGRESSIVE, UNATTRACTIVE,
BORING, CHEAP, IRRELEVANT, WAY OUT,
ANNOYING, CHEESY, UNORIGINAL,
OUTDATED, UNAPPROACHABLE...
82 ADJECTIVES

WHOLESOME
LEGENDARY
OLD
PIONEER DARK
WORDLY
NOSTALGIC
PROGRESSIVE
ICONIC

USERS
19 MUSIC HABIT QUESTIONS:Rate (0-100) whether user agrees with
the statements:
“I enjoy actively searching for and discovering
music that I have never heard before”
“I am not willing to pay for music”
“I like to be at the cutting edge of new music”
“I love tech”

MOTIVATION
• PRODUCTION - Cheaper to produce (lower barriers to
entry for budding artists).
• DISTRIBUTION - Internet has made music more
accessible. Artists can decide where and how to
sell.
• CONSUMPTION - People’s listening habits have changed
due to the internet and to the change in devices.
TECHNOLOGY HAS BEEN A DISRUPTIVE FORCE IN THE MUSIC INDUSTRY.

PROBLEMS
• ARTISTS - Easier to produce music, harder to make
themselves known or earn a living.
• RECORD COMPANIES - People buy per song, easy for
listener to consume without paying. Wider
competition field.
• LISTENERS - Too many choices. Discovery is difficult.

QUESTIONS
• Can one predict the rating of a song?
• What factors are important to determine how
much a person likes a song?
• What is the minimal set of factors that are needed
to determine how much a person likes a song?

FORMULATE AN HYPOTHESIS
STEP 3

FIRST ATTEMPT
• Regression problem
• Turn categorical variables into numeric variables
• Consider ALL features and pick machine learning
algorithm to do the job.
CAN ONE PREDICT THE RATING OF A SONG?

FIRST ATTEMPT
• Because exploratory analysis revealed ratings are
highly clustered, we can look at five different
scores and formulate problem as a classification
one.
CAN ONE PREDICT THE RATING OF A SONG?
We split ratings 0-100 in 5 intervals,
so each becomes a class and we label these.

RANDOM FORESTS
• Random Forests are built from aggregating trees.
• Can be used for regression & classification problems.
• They do not overfit and can handle large amount of features
• They also output a list of features that are believed to be
important in predicting the variable
Highly versatile ensemble method - combines
several models into one.
A.K.A. BEST “BLACK-BOX” METHOD EVER (BREIMAN / CUTLER)

RANDOM FORESTS
THE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011)
MOVIES
20 QUESTIONS
WILL JAMIE LIKE X?
BRIENNE IS THE DECISION TREE FOR
JAMIE’S MOVIES PREFERENCES

RANDOM FORESTS
THE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011)
Ask Tywin, Cersei, Tyrion...Jamie gives each of them
slightly different info.
THEY FORM A BAGGED FOREST OF JAMIE’S MOVIES PREFERENCES
Jamie demands getting different questions every time.
THEY NOW FORM A RANDOM FOREST OF JAMIE’S MOVIES PREFERENCES

RANDOM FORESTS
• A tree of maximal depth is grown on a bootstrap sample of
size m of the training set. There is no pruning.
• A number m << p is specified such that at each node, m
variables are sampled at random out of p. The best split of
these variables is used to split the node into two subnodes.
• Final classification is given by majority voting of the
ensemble of trees in the forest.
• Only two “free” parameters: number of trees and number of
variables in random subset at each node.

RANDOM FORESTS
OUT-OF-BAG (OOB) ERROR
Each bootstrap sample not used in the construction of
the tree becomes a test set. The oob error estimate is
given by the misclassification error (MSE for regression),
averaged over all samples.
VARIABLE IMPORTANCE
Determined by looking at how much prediction error
increases when (OOB) data for that variable is permuted
while all others are left unchanged.

RANDOM FORESTS IN R & PYTHON
randomForest PACKAGE
• Various implementations - randomForest, CARET, PARTY, BIGRF
• We follow the KISS procedure - KEEP IT SIMPLE S.
• One can test various values of mtry and the number of
trees.
Used randomForest package 4.6-7 with R 2.15. Defaults are
n=500 trees & mtry= p/3 for regression & sqrt(p) for
classification.

SCIKIT LEARN
Used SCIKIT LEARN 0.14.1 running Python version 2.7.5.
COMPUTER: Macbook Pro 2.53 GHz Intel Core 2 Duo with 4
GB 1067 Mhz DDR3 runnning OS X 10.6.8
• Training Time
• RSQ & RMSE (Regression)
• Accuracy (Classification)
For the comparison we will build “small” forests and
focus on the following simple metrics:

RANDOM FORESTS IN R
RESULTS REGRESSION
Split data in training and test sets. Dataframe has
82,714 rows each and 114 columns.
Parameters: 60 trees, sample of 50,000.
Training time: 39.39 min
RMSE: 14.587
RSQ: 0.581
rf
<-‐
randomForest(training,ratings_train,ntree=60,

sampsize
=
50000,
importance
=
TRUE)

RANDOM FORESTS IN PYTHON
RESULTS REGRESSION
Split data in training and test sets. Dataframe has
82,714 rows each and 114 columns.
Parameters: 60 trees, sample of 50,000.
Training time: 3 min 7 sec
RMSE: 14.687
RSQ: 0.575
rf
=
RandomForestRegressor(n_estimators=60,

max_features='sqrt')

R
PYTHON / SCIKIT LEARN

RANDOM FORESTS IN R
FEATURE IMPORTANCE
FEATURE (% INC MSE) FEATURE (% INC NODE PURITY)
Beautiful Talented
Boring Like Artist
Q16 Catchy
Catchy Beautiful
Talented Boring
Q9 Track
Q19 Distinctive
None of these Cool
Age Q11
Track Q12
Q16 - I would be willing to pay for the opp to
buy new music pre-release
Q9 - I am out of touch with new music
Q19 - I like to know about music before
other people
Q11 -Pop music is fun
Q12 - Pop music helps me escape
Like artist - To what extent do you like or dislike
listening to this artist?

RANDOM FORESTS IN R
FEATURE IMPORTANCE

FEATURE IMPORTANCE
FEATURE IMPORTANCE IN R RANDOM FOREST
Distinctive 7
Catchy 3
Like Artist 2
Fun -
Talented 1
Beautiful 4
Original -
Unoriginal -
Q11 9
Own Artist Music -
Own Artist Music - Do you have this artist in
your music collection?

Model RMSE
R Random Forest 14.587
Python Scikit Learn Random Forest 14.687
Linear Regression 16.23
Multiple Linear Regs 15.53
RESULTS REGRESSION

RANDOM FORESTS IN RRESULTS CLASSIFICATION
OOB error rate: 44.01%
Accuracy: 0.567
rf
<-‐
randomForest(training,ratings_train,ntree=60,

sampsize
=
50000,
importance
=
TRUE)
ratings_train<-‐as.factor(ratings_train)
1 2 3 4 5
1 16777 4863 1633 139 37
2 5760 12411 6213 504 89
3 1485 5559 13144 1880 329
4 176 888 4094 2592 625
5 59 204 1008 856 1388

RESULTS CLASSIFICATION
OOB Score: 0.1964
Accuracy: 0.566
rf
=
sk.RandomForestClassifier(n_estimators=60,
compute_importances=True,
oob_score=True)
1 2 3 4 5
1 16930 4682 1758 129 53
2 5517 12369 6475 506 106
3 1500 5367 13448 1737 275
4 186 791 4171 2598 561
5 48 161 999 880 1466
Precision: 0.564
Recall: 0.5653
F1 Score: 0.5611

RANDOM FORESTS IN R
FEATURE IMPORTANCE
FEATURE (% INC MSE) FEATURE (% INC NODE PURITY)
Q9 Track
Q7 Q11
Q5 Q12
Q6 Age
Age Q6
Q10 Q17
listBACK Q9
Q19 Q16
listOWN Q4
Q16 Q13
Q9 - I am out of touch with new music
Q19 - I like to know about music before
other people
Q7 - I enjoy music primarily from going out to
dance
Q5 - I used to know where to find music
Q6 - I am not willing to pay for music
Q10 - My music collection is a source of pride
Q4 - I would like to buy new music but I don’t
know what to buy
Q17 - I find seeing a new artist a useful way of
discovering new music

FEATURE IMPORTANCE
FEATURE IMPORTANCE IN R RANDOM FOREST
Q11 2
Q12 3
Age 4
Q6 5
Q17 6
Q5 -
Q4 9
Q10 -
Q16 7
Q7 -
Q5 - I used to know where to find music
Q6 - I am not willing to pay for music
Q10 - My music collection is a source of pride
Q4 - I would like to buy new music but I don’t
know what to buy
Q17 - I find seeing a new artist a useful way of
discovering new music

RANDOM FORESTS IN R
1 2 3 4 5 CLASS
1 16777 4863 1633 139 37 28.45%
2 5760 12411 6213 504 89 50.31%
3 1485 5559 13144 1880 329 41.31%
4 176 888 4094 2592 625 69.09%
5 59 204 1008 856 1388 60.51%
CONFUSION MATRIX

1 2 3 4 5 CLASS
1 16930 4682 1758 129 53 28.12%
2 5517 12369 6475 506 106 50.47%
3 1500 5367 13448 1737 275 39.77%
4 186 791 4171 2598 561 68.73%
5 48 161 999 880 1466 58.75%
CONFUSION MATRIX

(Re)FORMULATE AN HYPOTHESIS
STEP 2

FEATURE SELECTION
PRINCIPAL COMPONENT ANALYSIS - WORDS
Determine which features account for most of the variance.
FEATURE PC1 PC2
Distinctive 0.20 -0.059
Authentic 0.19 -0.046
Talented 0.19 -0.083
Credible 0.19 -0.084
Stylish 0.18 -0.094
Annoying -0.06 -0.065
Intrusive -0.06 -0.058
Irrelevant -0.059 -0.087
Uninspired -0.056 -0.092
Noisy -0.053 -0.13

FEATURE SELECTIONMake a simple model choosing meaningful variables
WORDS - Annoying, Depressing, Boring, Catchy,
Talented, Distinctive, Beautiful, Superstar,
Soulful and Popular.
QUESTIONS - Q4, Q5, Q6, Q9, Q10 Q11 and Q19.
• Running time in R ~ 15 min.
• RMSE = 14.791 / Public leader board 13.076

RESULTS
FULL MODELREDUCED MODEL

COMMENTS
It is well known that Random Forests have
shown to be biased towards highly correlated
variables. Using conditional inference trees,
ameliorates that bias (See Party PACKAGE in R)
SCIKIT learn’s implementation has n_jobs parameter
to parallelise training. For a similar feature in R,
see bigRF package.

CONCLUDING REMARKS
We solved a problem using both R and PYTHON (via Scikit
learn). Clearly constraints for addressing a given
problem might differ and would dictate the
implementation of choice.
PICK THE TOOL THAT IS BEST FOR THE JOB
WORTH LEARNING ABOUT BOTH IMPLEMENTATIONS
Both R and PYTHON (via SCIKIT LEARN) implementations have
added functions that allow the user to explore the
resulting model and its performance.

CONCLUDING REMARKS
RANDOM FORESTS ARE GREAT
KEEP AN EYE OUT FOR INTERESTING DATA
It gives great accuracy, can handle many features,
does not require cross validation and it even
estimates what variables are important.
Having data that you are interested in, leads to
more interesting questions and reasons to explore
new methods and add a new trick to your bag.

CONCLUDING REMARKS
EMI DATASET IS GREAT TO TEST RIDE
TO DO’s - WILL IT PYTHON?
Set has a lot of behavioural information on a
subject that everyone has some intuition.
Prediction using SVM’s and other Matrix
Factorisation techniques. Full factor analysis, etc.

Random Forests R vs Python by Linda Uruchurtu

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (11)

Ähnlich wie Random Forests R vs Python by Linda Uruchurtu

Ähnlich wie Random Forests R vs Python by Linda Uruchurtu (20)

Mehr von PyData

Mehr von PyData (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Random Forests R vs Python by Linda Uruchurtu