SlideShare ist ein Scribd-Unternehmen logo
1 von 64
Downloaden Sie, um offline zu lesen
RANDOM FORESTS
R vs PYTHONR & PYTHON
Having fun when starting out in data analysis
WHO
LINDA URUCHURTU
@lindauruchurtu
Consultant at DBi
Web Analytics & Data Consultancy
Physicist by training
OUTLINE OF THIS TALK
• Motivation
• Random Forests: R & Python
• Example: EMI music set
• Concluding remarks
MOTIVATION
STARTING OUT IN DATA ANALYSIS
• Online: blogs, GitHub, MOOCs, Kaggle,
Data Tau, Cross Validated,
Stackoverflow...
• Books
• School work
TOO MANY RESOURCES
WHICH LANGUAGE SHOULD I USE?
POPULAR QUESTION
LET’S ASK GOOGLE
• Programmed in C
• Used MATLAB at Uni
• Spent a long time playing with symbolic
langs Mathematica & Maple
START BY WHAT YOU KNOW & ASK YOUR FRIENDS
MY EXPERIENCE
P.S. I had not met the iPython notebook.
BIG REVEAL: I AM AN AVID R USER
MY EXPERIENCE (cont)
P.S. I had not met the iPython notebook.
• Don’t have a web dev background
• Surrounded by people doing Stats
• Pick the right tool for the task at hand
TL;DR - CAN BE CONFUSING FOR A NEWBIE
LANGUAGE WARS
Too many articles about:
• “Python Displacing R As The Programming Language
For Data Analysis”
• “Is Python really supplanting R for data work?”
• “10 Reasons Python Rocks for Research”
• “Why Python is steadily eating other languages' lunch”
• “Why I’m betting on Julia”
• “What are the advantages of using Python over R?”
• “Why Python with Coffee is better than R with Ice
Cream”
[FAVE LANG] is BETTER
BECAUSE I SAY SO
LANGUAGE WARS
However, it is good to have a general
understanding of the + and - of the various data
analysis tools, in order to pick the right tool for the
job.
• R has EVERYTHING you need for performing
statistical analysis.
• R / MATLAB / Python are great for prototyping
• Python is a full featured programming language
• Easier to incorportate Python outcomes into a full
data product workflow
DEFINE THE PROBLEM
Time better spent defining the problem and
determining what is the best way to solve it
GOOD TO HAVE A BIG BAG OF TRICKS
Re-do R analysis using Python data analysis stack
WILL IT PYTHON? CREDIT: SLENDER MEANS
PYTHON SCIKIT LEARN
IT IS PRETTY AWESOME
• Library of Machine Learning Algorithms
• Open source
• API
• Python, Numpy & Co
• Accessible, many models, documentation &
examples
EXAMPLE
CHOOSING A PROBLEM
Always a good idea to look for a data set that
is interesting to you.
1
2 Formulate a question
3 Formulate an hypothesis
4 Build Model to answer question and Test
SCIENTIFIC METHOD FTW
CHOOSING A DATA SET
STEP 1
EMI MUSIC
“ONE MILLION INTERVIEW SET”
• One of the largest preference data sets in the
world.
• Extract used in Data Science London hackaton and
available in KAGGLE as four separate data sets.
FOUR DATA SETS
• TRAIN / TEST - artist, track, userID, time & ratings
• WORDS - userID, heard_of, own_artist_music ,
like_artist, 82 adjectives
• USERS - userID, gender, age, working status, region,
music, list_own (hours per day), list_back (hours
per day), 19 user habits questions (0-100)
USERS
KEY STRING
1 “Music is important to me but not necessarily most important”
2 “I like music but it does not feature heavily in my life”
3 “Music means a lot to me and it is a passion of mine”
4 “Music has no particular interest to me”
5 “Music is important to me but not necessarily more important
than other hobbies”
6 “Music is no longer as important as it used to be”
WORDS DATASET
UNINSPIRED, AGGRESSIVE, UNATTRACTIVE,
BORING, CHEAP, IRRELEVANT, WAY OUT,
ANNOYING, CHEESY, UNORIGINAL,
OUTDATED, UNAPPROACHABLE...
82 ADJECTIVES
WHOLESOME
LEGENDARY
OLD
PIONEER DARK
WORDLY
NOSTALGIC
PROGRESSIVE
ICONIC
USERS
19 MUSIC HABIT QUESTIONS:Rate (0-100) whether user agrees with
the statements:
“I enjoy actively searching for and discovering
music that I have never heard before”
“I am not willing to pay for music”
“I like to be at the cutting edge of new music”
“I love tech”
WHOLESOME
LEGENDARY
OLD
PIONEER DARK
WORDLY
NOSTALGIC
PROGRESSIVE
ICONIC
FORMULATE A QUESTION
STEP 2
MOTIVATION
MOTIVATION
• PRODUCTION - Cheaper to produce (lower barriers to
entry for budding artists).
• DISTRIBUTION - Internet has made music more
accessible. Artists can decide where and how to
sell.
• CONSUMPTION - People’s listening habits have changed
due to the internet and to the change in devices.
TECHNOLOGY HAS BEEN A DISRUPTIVE FORCE IN THE MUSIC INDUSTRY.
PROBLEMS
• ARTISTS - Easier to produce music, harder to make
themselves known or earn a living.
• RECORD COMPANIES - People buy per song, easy for
listener to consume without paying. Wider
competition field.
• LISTENERS - Too many choices. Discovery is difficult.
QUESTIONS
• Can one predict the rating of a song?
• What factors are important to determine how
much a person likes a song?
• What is the minimal set of factors that are needed
to determine how much a person likes a song?
FORMULATE AN HYPOTHESIS
STEP 3
FIRST ATTEMPT
• Regression problem
• Turn categorical variables into numeric variables
• Consider ALL features and pick machine learning
algorithm to do the job.
CAN ONE PREDICT THE RATING OF A SONG?
FIRST ATTEMPT
• Because exploratory analysis revealed ratings are
highly clustered, we can look at five different
scores and formulate problem as a classification
one.
CAN ONE PREDICT THE RATING OF A SONG?
We split ratings 0-100 in 5 intervals,
so each becomes a class and we label these.
BUILD A MODEL
STEP 4
RANDOM FORESTS
RANDOM FORESTS
• Random Forests are built from aggregating trees.
• Can be used for regression & classification problems.
• They do not overfit and can handle large amount of features
• They also output a list of features that are believed to be
important in predicting the variable
Highly versatile ensemble method - combines
several models into one.
A.K.A. BEST “BLACK-BOX” METHOD EVER (BREIMAN / CUTLER)
RANDOM FORESTS
THE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011)
MOVIES
20 QUESTIONS
WILL JAMIE LIKE X?
BRIENNE IS THE DECISION TREE FOR
JAMIE’S MOVIES PREFERENCES
RANDOM FORESTS
THE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011)
Ask Tywin, Cersei, Tyrion...Jamie gives each of them
slightly different info.
THEY FORM A BAGGED FOREST OF JAMIE’S MOVIES PREFERENCES
Jamie demands getting different questions every time.
THEY NOW FORM A RANDOM FOREST OF JAMIE’S MOVIES PREFERENCES
RANDOM FORESTS
• A tree of maximal depth is grown on a bootstrap sample of
size m of the training set. There is no pruning.
• A number m << p is specified such that at each node, m
variables are sampled at random out of p. The best split of
these variables is used to split the node into two subnodes.
• Final classification is given by majority voting of the
ensemble of trees in the forest.
• Only two “free” parameters: number of trees and number of
variables in random subset at each node.
RANDOM FORESTS
OUT-OF-BAG (OOB) ERROR
Each bootstrap sample not used in the construction of
the tree becomes a test set. The oob error estimate is
given by the misclassification error (MSE for regression),
averaged over all samples.
VARIABLE IMPORTANCE
Determined by looking at how much prediction error
increases when (OOB) data for that variable is permuted
while all others are left unchanged.
RANDOM FORESTS IN R & PYTHON
randomForest PACKAGE
• Various implementations - randomForest, CARET, PARTY, BIGRF
• We follow the KISS procedure - KEEP IT SIMPLE S.
• One can test various values of mtry and the number of
trees.
Used randomForest package 4.6-7 with R 2.15. Defaults are
n=500 trees & mtry= p/3 for regression & sqrt(p) for
classification.
RANDOM FORESTS IN R & PYTHON
SCIKIT LEARN
Used SCIKIT LEARN 0.14.1 running Python version 2.7.5.
COMPUTER: Macbook Pro 2.53 GHz Intel Core 2 Duo with 4
GB 1067 Mhz DDR3 runnning OS X 10.6.8
• Training Time
• RSQ & RMSE (Regression)
• Accuracy (Classification)
For the comparison we will build “small” forests and
focus on the following simple metrics:
RANDOM FORESTS IN R
RESULTS REGRESSION
Split data in training and test sets. Dataframe has
82,714 rows each and 114 columns.
Parameters: 60 trees, sample of 50,000.
Training time: 39.39 min
RMSE: 14.587
RSQ: 0.581
rf	
  <-­‐	
  randomForest(training,ratings_train,ntree=60,	
  
sampsize	
  =	
  50000,	
  importance	
  =	
  TRUE)
RANDOM FORESTS IN PYTHON
RESULTS REGRESSION
Split data in training and test sets. Dataframe has
82,714 rows each and 114 columns.
Parameters: 60 trees, sample of 50,000.
Training time: 3 min 7 sec
RMSE: 14.687
RSQ: 0.575
rf	
  =	
  RandomForestRegressor(n_estimators=60,	
  
max_features='sqrt')
RANDOM FORESTS IN R & PYTHON
R
PYTHON / SCIKIT LEARN
RANDOM FORESTS IN R
FEATURE IMPORTANCE
FEATURE (% INC MSE) FEATURE (% INC NODE PURITY)
Beautiful Talented
Boring Like Artist
Q16 Catchy
Catchy Beautiful
Talented Boring
Q9 Track
Q19 Distinctive
None of these Cool
Age Q11
Track Q12
Q16 - I would be willing to pay for the opp to
buy new music pre-release
Q9 - I am out of touch with new music
Q19 - I like to know about music before
other people
Q11 -Pop music is fun
Q12 - Pop music helps me escape
Like artist - To what extent do you like or dislike
listening to this artist?
RANDOM FORESTS IN R
FEATURE IMPORTANCE
RANDOM FORESTS IN PYTHON
FEATURE IMPORTANCE
FEATURE IMPORTANCE IN R RANDOM FOREST
Distinctive 7
Catchy 3
Like Artist 2
Fun -
Talented 1
Beautiful 4
Original -
Unoriginal -
Q11 9
Own Artist Music -
Own Artist Music - Do you have this artist in
your music collection?
Q11 -Pop music is fun
RANDOM FORESTS IN R & PYTHON
Model RMSE
R Random Forest 14.587
Python Scikit Learn Random Forest 14.687
Linear Regression 16.23
Multiple Linear Regs 15.53
RESULTS REGRESSION
RANDOM FORESTS IN RRESULTS CLASSIFICATION
Training time: 8.75 min
OOB error rate: 44.01%
Accuracy: 0.567
rf	
  <-­‐	
  randomForest(training,ratings_train,ntree=60,	
  
sampsize	
  =	
  50000,	
  importance	
  =	
  TRUE)
ratings_train<-­‐as.factor(ratings_train)
1 2 3 4 5
1 16777 4863 1633 139 37
2 5760 12411 6213 504 89
3 1485 5559 13144 1880 329
4 176 888 4094 2592 625
5 59 204 1008 856 1388
RANDOM FORESTS IN PYTHON
RESULTS CLASSIFICATION
Training time: 2.56 min
OOB Score: 0.1964
Accuracy: 0.566
rf	
  =	
  sk.RandomForestClassifier(n_estimators=60,
compute_importances=True,	
  oob_score=True)
1 2 3 4 5
1 16930 4682 1758 129 53
2 5517 12369 6475 506 106
3 1500 5367 13448 1737 275
4 186 791 4171 2598 561
5 48 161 999 880 1466
Precision: 0.564
Recall: 0.5653
F1 Score: 0.5611
RANDOM FORESTS IN R
FEATURE IMPORTANCE
FEATURE (% INC MSE) FEATURE (% INC NODE PURITY)
Q9 Track
Q7 Q11
Q5 Q12
Q6 Age
Age Q6
Q10 Q17
listBACK Q9
Q19 Q16
listOWN Q4
Q16 Q13
Q16 - I would be willing to pay for the opp to
buy new music pre-release
Q9 - I am out of touch with new music
Q19 - I like to know about music before
other people
Q11 -Pop music is fun
Q12 - Pop music helps me escape
Q7 - I enjoy music primarily from going out to
dance
Q5 - I used to know where to find music
Q6 - I am not willing to pay for music
Q10 - My music collection is a source of pride
Q4 - I would like to buy new music but I don’t
know what to buy
Q17 - I find seeing a new artist a useful way of
discovering new music
RANDOM FORESTS IN PYTHON
FEATURE IMPORTANCE
FEATURE IMPORTANCE IN R RANDOM FOREST
Q11 2
Q12 3
Age 4
Q6 5
Q17 6
Q5 -
Q4 9
Q10 -
Q16 7
Q7 -
Q16 - I would be willing to pay for the opp to
buy new music pre-release
Q11 -Pop music is fun
Q12 - Pop music helps me escape
Q5 - I used to know where to find music
Q6 - I am not willing to pay for music
Q10 - My music collection is a source of pride
Q4 - I would like to buy new music but I don’t
know what to buy
Q17 - I find seeing a new artist a useful way of
discovering new music
RANDOM FORESTS IN R
1 2 3 4 5 CLASS
1 16777 4863 1633 139 37 28.45%
2 5760 12411 6213 504 89 50.31%
3 1485 5559 13144 1880 329 41.31%
4 176 888 4094 2592 625 69.09%
5 59 204 1008 856 1388 60.51%
CONFUSION MATRIX
RANDOM FORESTS IN PYTHON
1 2 3 4 5 CLASS
1 16930 4682 1758 129 53 28.12%
2 5517 12369 6475 506 106 50.47%
3 1500 5367 13448 1737 275 39.77%
4 186 791 4171 2598 561 68.73%
5 48 161 999 880 1466 58.75%
CONFUSION MATRIX
(Re)FORMULATE AN HYPOTHESIS
STEP 2
FEATURE SELECTION
PRINCIPAL COMPONENT ANALYSIS - WORDS
Determine which features account for most of the variance.
FEATURE PC1 PC2
Distinctive 0.20 -0.059
Authentic 0.19 -0.046
Talented 0.19 -0.083
Credible 0.19 -0.084
Stylish 0.18 -0.094
Annoying -0.06 -0.065
Intrusive -0.06 -0.058
Irrelevant -0.059 -0.087
Uninspired -0.056 -0.092
Noisy -0.053 -0.13
FEATURE SELECTIONMake a simple model choosing meaningful variables
WORDS - Annoying, Depressing, Boring, Catchy,
Talented, Distinctive, Beautiful, Superstar,
Soulful and Popular.
QUESTIONS - Q4, Q5, Q6, Q9, Q10 Q11 and Q19.
• Running time in R ~ 15 min.
• RMSE = 14.791 / Public leader board 13.076
RESULTS
FULL MODELREDUCED MODEL
COMMENTS
It is well known that Random Forests have
shown to be biased towards highly correlated
variables. Using conditional inference trees,
ameliorates that bias (See Party PACKAGE in R)
SCIKIT learn’s implementation has n_jobs parameter
to parallelise training. For a similar feature in R,
see bigRF package.
CONCLUDING REMARKS
CONCLUDING REMARKS
We solved a problem using both R and PYTHON (via Scikit
learn). Clearly constraints for addressing a given
problem might differ and would dictate the
implementation of choice.
PICK THE TOOL THAT IS BEST FOR THE JOB
WORTH LEARNING ABOUT BOTH IMPLEMENTATIONS
Both R and PYTHON (via SCIKIT LEARN) implementations have
added functions that allow the user to explore the
resulting model and its performance.
CONCLUDING REMARKS
RANDOM FORESTS ARE GREAT
KEEP AN EYE OUT FOR INTERESTING DATA
It gives great accuracy, can handle many features,
does not require cross validation and it even
estimates what variables are important.
Having data that you are interested in, leads to
more interesting questions and reasons to explore
new methods and add a new trick to your bag.
CONCLUDING REMARKS
EMI DATASET IS GREAT TO TEST RIDE
TO DO’s - WILL IT PYTHON?
Set has a lot of behavioural information on a
subject that everyone has some intuition.
Prediction using SVM’s and other Matrix
Factorisation techniques. Full factor analysis, etc.
THANKS!

Weitere ähnliche Inhalte

Andere mochten auch

Improving the Accuracy of Object Based Supervised Image Classification using ...
Improving the Accuracy of Object Based Supervised Image Classification using ...Improving the Accuracy of Object Based Supervised Image Classification using ...
Improving the Accuracy of Object Based Supervised Image Classification using ...CSCJournals
 
Decision trees and random forests
Decision trees and random forestsDecision trees and random forests
Decision trees and random forestsDebdoot Sheet
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnDataRobot
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsGilles Louppe
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnGilles Louppe
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Derek Kane
 

Andere mochten auch (11)

Modul klasifikasi decission tree modul klasifikasi
Modul klasifikasi decission tree modul klasifikasiModul klasifikasi decission tree modul klasifikasi
Modul klasifikasi decission tree modul klasifikasi
 
Improving the Accuracy of Object Based Supervised Image Classification using ...
Improving the Accuracy of Object Based Supervised Image Classification using ...Improving the Accuracy of Object Based Supervised Image Classification using ...
Improving the Accuracy of Object Based Supervised Image Classification using ...
 
Decision trees and random forests
Decision trees and random forestsDecision trees and random forests
Decision trees and random forests
 
R vs Python vs SAS
R vs Python vs SASR vs Python vs SAS
R vs Python vs SAS
 
Conditional trees
Conditional treesConditional trees
Conditional trees
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Bias-variance decomposition in Random Forests
Bias-variance decomposition in Random ForestsBias-variance decomposition in Random Forests
Bias-variance decomposition in Random Forests
 
Random forest
Random forestRandom forest
Random forest
 
Accelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-LearnAccelerating Random Forests in Scikit-Learn
Accelerating Random Forests in Scikit-Learn
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
 

Ähnlich wie Random Forests R vs Python by Linda Uruchurtu

Understanding Music Playlists
Understanding Music PlaylistsUnderstanding Music Playlists
Understanding Music PlaylistsKeunwoo Choi
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Erik Bernhardsson
 
[221]똑똑한 인공지능 dj 비서 clova music
[221]똑똑한 인공지능 dj 비서 clova music[221]똑똑한 인공지능 dj 비서 clova music
[221]똑똑한 인공지능 dj 비서 clova musicNAVER D2
 
Using mashup technology to improve findability
Using mashup technology to improve findabilityUsing mashup technology to improve findability
Using mashup technology to improve findabilitySten Govaerts
 
Horst Goes Pop - Wieviel Musikempfehlung braucht der Mensch
Horst Goes Pop - Wieviel Musikempfehlung braucht der MenschHorst Goes Pop - Wieviel Musikempfehlung braucht der Mensch
Horst Goes Pop - Wieviel Musikempfehlung braucht der MenschStephan Baumann
 
Using search engines for classification: does it still work?
Using search engines for classification: does it still work?Using search engines for classification: does it still work?
Using search engines for classification: does it still work?Sten Govaerts
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Miningebelani
 
Metric Learning for Music Discovery with Source and Target Playlists
Metric Learning for Music Discovery with Source and Target PlaylistsMetric Learning for Music Discovery with Source and Target Playlists
Metric Learning for Music Discovery with Source and Target PlaylistsYing-Shu Kuo
 
Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTwltrimbl
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
And Introduction To R, Presented by Philip best
And Introduction To R, Presented by Philip bestAnd Introduction To R, Presented by Philip best
And Introduction To R, Presented by Philip bestNashvilleTechCouncil
 
Dealing with a search engine in your application - a Solr approach for beginners
Dealing with a search engine in your application - a Solr approach for beginnersDealing with a search engine in your application - a Solr approach for beginners
Dealing with a search engine in your application - a Solr approach for beginnersElaine Naomi
 
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Avkash Chauhan
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 

Ähnlich wie Random Forests R vs Python by Linda Uruchurtu (20)

Understanding Music Playlists
Understanding Music PlaylistsUnderstanding Music Playlists
Understanding Music Playlists
 
Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014Music recommendations @ MLConf 2014
Music recommendations @ MLConf 2014
 
[221]똑똑한 인공지능 dj 비서 clova music
[221]똑똑한 인공지능 dj 비서 clova music[221]똑똑한 인공지능 dj 비서 clova music
[221]똑똑한 인공지능 dj 비서 clova music
 
Data Mining Lecture_2.pptx
Data Mining Lecture_2.pptxData Mining Lecture_2.pptx
Data Mining Lecture_2.pptx
 
Using mashup technology to improve findability
Using mashup technology to improve findabilityUsing mashup technology to improve findability
Using mashup technology to improve findability
 
Project overview eng
Project overview engProject overview eng
Project overview eng
 
Horst Goes Pop - Wieviel Musikempfehlung braucht der Mensch
Horst Goes Pop - Wieviel Musikempfehlung braucht der MenschHorst Goes Pop - Wieviel Musikempfehlung braucht der Mensch
Horst Goes Pop - Wieviel Musikempfehlung braucht der Mensch
 
Using search engines for classification: does it still work?
Using search engines for classification: does it still work?Using search engines for classification: does it still work?
Using search engines for classification: does it still work?
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
 
Metric Learning for Music Discovery with Source and Target Playlists
Metric Learning for Music Discovery with Source and Target PlaylistsMetric Learning for Music Discovery with Source and Target Playlists
Metric Learning for Music Discovery with Source and Target Playlists
 
Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RAST
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
MULHER@AVI2012
MULHER@AVI2012MULHER@AVI2012
MULHER@AVI2012
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
And Introduction To R, Presented by Philip best
And Introduction To R, Presented by Philip bestAnd Introduction To R, Presented by Philip best
And Introduction To R, Presented by Philip best
 
Dealing with a search engine in your application - a Solr approach for beginners
Dealing with a search engine in your application - a Solr approach for beginnersDealing with a search engine in your application - a Solr approach for beginners
Dealing with a search engine in your application - a Solr approach for beginners
 
BoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
BoSUSA18 | Bob Moesta| The 5 Skills Of An InnovatorBoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
BoSUSA18 | Bob Moesta| The 5 Skills Of An Innovator
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)Creating AnswerBot with Keras and TensorFlow (TensorBeat)
Creating AnswerBot with Keras and TensorFlow (TensorBeat)
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 

Mehr von PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 

Mehr von PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Kürzlich hochgeladen

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Kürzlich hochgeladen (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Random Forests R vs Python by Linda Uruchurtu

  • 1. RANDOM FORESTS R vs PYTHONR & PYTHON Having fun when starting out in data analysis
  • 2. WHO LINDA URUCHURTU @lindauruchurtu Consultant at DBi Web Analytics & Data Consultancy Physicist by training
  • 3. OUTLINE OF THIS TALK • Motivation • Random Forests: R & Python • Example: EMI music set • Concluding remarks
  • 5. STARTING OUT IN DATA ANALYSIS • Online: blogs, GitHub, MOOCs, Kaggle, Data Tau, Cross Validated, Stackoverflow... • Books • School work TOO MANY RESOURCES
  • 6. WHICH LANGUAGE SHOULD I USE? POPULAR QUESTION
  • 8. • Programmed in C • Used MATLAB at Uni • Spent a long time playing with symbolic langs Mathematica & Maple START BY WHAT YOU KNOW & ASK YOUR FRIENDS MY EXPERIENCE P.S. I had not met the iPython notebook.
  • 9. BIG REVEAL: I AM AN AVID R USER MY EXPERIENCE (cont) P.S. I had not met the iPython notebook. • Don’t have a web dev background • Surrounded by people doing Stats • Pick the right tool for the task at hand
  • 10. TL;DR - CAN BE CONFUSING FOR A NEWBIE LANGUAGE WARS Too many articles about: • “Python Displacing R As The Programming Language For Data Analysis” • “Is Python really supplanting R for data work?” • “10 Reasons Python Rocks for Research” • “Why Python is steadily eating other languages' lunch” • “Why I’m betting on Julia” • “What are the advantages of using Python over R?” • “Why Python with Coffee is better than R with Ice Cream”
  • 11. [FAVE LANG] is BETTER BECAUSE I SAY SO
  • 12. LANGUAGE WARS However, it is good to have a general understanding of the + and - of the various data analysis tools, in order to pick the right tool for the job. • R has EVERYTHING you need for performing statistical analysis. • R / MATLAB / Python are great for prototyping • Python is a full featured programming language • Easier to incorportate Python outcomes into a full data product workflow
  • 13. DEFINE THE PROBLEM Time better spent defining the problem and determining what is the best way to solve it GOOD TO HAVE A BIG BAG OF TRICKS Re-do R analysis using Python data analysis stack WILL IT PYTHON? CREDIT: SLENDER MEANS
  • 14. PYTHON SCIKIT LEARN IT IS PRETTY AWESOME • Library of Machine Learning Algorithms • Open source • API • Python, Numpy & Co • Accessible, many models, documentation & examples
  • 16. CHOOSING A PROBLEM Always a good idea to look for a data set that is interesting to you. 1 2 Formulate a question 3 Formulate an hypothesis 4 Build Model to answer question and Test SCIENTIFIC METHOD FTW
  • 17. CHOOSING A DATA SET STEP 1
  • 18. EMI MUSIC “ONE MILLION INTERVIEW SET” • One of the largest preference data sets in the world. • Extract used in Data Science London hackaton and available in KAGGLE as four separate data sets.
  • 19. FOUR DATA SETS • TRAIN / TEST - artist, track, userID, time & ratings • WORDS - userID, heard_of, own_artist_music , like_artist, 82 adjectives • USERS - userID, gender, age, working status, region, music, list_own (hours per day), list_back (hours per day), 19 user habits questions (0-100)
  • 20. USERS KEY STRING 1 “Music is important to me but not necessarily most important” 2 “I like music but it does not feature heavily in my life” 3 “Music means a lot to me and it is a passion of mine” 4 “Music has no particular interest to me” 5 “Music is important to me but not necessarily more important than other hobbies” 6 “Music is no longer as important as it used to be”
  • 21. WORDS DATASET UNINSPIRED, AGGRESSIVE, UNATTRACTIVE, BORING, CHEAP, IRRELEVANT, WAY OUT, ANNOYING, CHEESY, UNORIGINAL, OUTDATED, UNAPPROACHABLE... 82 ADJECTIVES
  • 23. USERS 19 MUSIC HABIT QUESTIONS:Rate (0-100) whether user agrees with the statements: “I enjoy actively searching for and discovering music that I have never heard before” “I am not willing to pay for music” “I like to be at the cutting edge of new music” “I love tech”
  • 27. MOTIVATION • PRODUCTION - Cheaper to produce (lower barriers to entry for budding artists). • DISTRIBUTION - Internet has made music more accessible. Artists can decide where and how to sell. • CONSUMPTION - People’s listening habits have changed due to the internet and to the change in devices. TECHNOLOGY HAS BEEN A DISRUPTIVE FORCE IN THE MUSIC INDUSTRY.
  • 28. PROBLEMS • ARTISTS - Easier to produce music, harder to make themselves known or earn a living. • RECORD COMPANIES - People buy per song, easy for listener to consume without paying. Wider competition field. • LISTENERS - Too many choices. Discovery is difficult.
  • 29. QUESTIONS • Can one predict the rating of a song? • What factors are important to determine how much a person likes a song? • What is the minimal set of factors that are needed to determine how much a person likes a song?
  • 31. FIRST ATTEMPT • Regression problem • Turn categorical variables into numeric variables • Consider ALL features and pick machine learning algorithm to do the job. CAN ONE PREDICT THE RATING OF A SONG?
  • 32. FIRST ATTEMPT • Because exploratory analysis revealed ratings are highly clustered, we can look at five different scores and formulate problem as a classification one. CAN ONE PREDICT THE RATING OF A SONG? We split ratings 0-100 in 5 intervals, so each becomes a class and we label these.
  • 35. RANDOM FORESTS • Random Forests are built from aggregating trees. • Can be used for regression & classification problems. • They do not overfit and can handle large amount of features • They also output a list of features that are believed to be important in predicting the variable Highly versatile ensemble method - combines several models into one. A.K.A. BEST “BLACK-BOX” METHOD EVER (BREIMAN / CUTLER)
  • 36. RANDOM FORESTS THE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011) MOVIES 20 QUESTIONS WILL JAMIE LIKE X? BRIENNE IS THE DECISION TREE FOR JAMIE’S MOVIES PREFERENCES
  • 37. RANDOM FORESTS THE LAYMAN’S INTRO (E. CHEN’s BLOG - 2011) Ask Tywin, Cersei, Tyrion...Jamie gives each of them slightly different info. THEY FORM A BAGGED FOREST OF JAMIE’S MOVIES PREFERENCES Jamie demands getting different questions every time. THEY NOW FORM A RANDOM FOREST OF JAMIE’S MOVIES PREFERENCES
  • 38. RANDOM FORESTS • A tree of maximal depth is grown on a bootstrap sample of size m of the training set. There is no pruning. • A number m << p is specified such that at each node, m variables are sampled at random out of p. The best split of these variables is used to split the node into two subnodes. • Final classification is given by majority voting of the ensemble of trees in the forest. • Only two “free” parameters: number of trees and number of variables in random subset at each node.
  • 39. RANDOM FORESTS OUT-OF-BAG (OOB) ERROR Each bootstrap sample not used in the construction of the tree becomes a test set. The oob error estimate is given by the misclassification error (MSE for regression), averaged over all samples. VARIABLE IMPORTANCE Determined by looking at how much prediction error increases when (OOB) data for that variable is permuted while all others are left unchanged.
  • 40. RANDOM FORESTS IN R & PYTHON randomForest PACKAGE • Various implementations - randomForest, CARET, PARTY, BIGRF • We follow the KISS procedure - KEEP IT SIMPLE S. • One can test various values of mtry and the number of trees. Used randomForest package 4.6-7 with R 2.15. Defaults are n=500 trees & mtry= p/3 for regression & sqrt(p) for classification.
  • 41. RANDOM FORESTS IN R & PYTHON SCIKIT LEARN Used SCIKIT LEARN 0.14.1 running Python version 2.7.5. COMPUTER: Macbook Pro 2.53 GHz Intel Core 2 Duo with 4 GB 1067 Mhz DDR3 runnning OS X 10.6.8 • Training Time • RSQ & RMSE (Regression) • Accuracy (Classification) For the comparison we will build “small” forests and focus on the following simple metrics:
  • 42. RANDOM FORESTS IN R RESULTS REGRESSION Split data in training and test sets. Dataframe has 82,714 rows each and 114 columns. Parameters: 60 trees, sample of 50,000. Training time: 39.39 min RMSE: 14.587 RSQ: 0.581 rf  <-­‐  randomForest(training,ratings_train,ntree=60,   sampsize  =  50000,  importance  =  TRUE)
  • 43. RANDOM FORESTS IN PYTHON RESULTS REGRESSION Split data in training and test sets. Dataframe has 82,714 rows each and 114 columns. Parameters: 60 trees, sample of 50,000. Training time: 3 min 7 sec RMSE: 14.687 RSQ: 0.575 rf  =  RandomForestRegressor(n_estimators=60,   max_features='sqrt')
  • 44. RANDOM FORESTS IN R & PYTHON R PYTHON / SCIKIT LEARN
  • 45. RANDOM FORESTS IN R FEATURE IMPORTANCE FEATURE (% INC MSE) FEATURE (% INC NODE PURITY) Beautiful Talented Boring Like Artist Q16 Catchy Catchy Beautiful Talented Boring Q9 Track Q19 Distinctive None of these Cool Age Q11 Track Q12 Q16 - I would be willing to pay for the opp to buy new music pre-release Q9 - I am out of touch with new music Q19 - I like to know about music before other people Q11 -Pop music is fun Q12 - Pop music helps me escape Like artist - To what extent do you like or dislike listening to this artist?
  • 46. RANDOM FORESTS IN R FEATURE IMPORTANCE
  • 47. RANDOM FORESTS IN PYTHON FEATURE IMPORTANCE FEATURE IMPORTANCE IN R RANDOM FOREST Distinctive 7 Catchy 3 Like Artist 2 Fun - Talented 1 Beautiful 4 Original - Unoriginal - Q11 9 Own Artist Music - Own Artist Music - Do you have this artist in your music collection? Q11 -Pop music is fun
  • 48. RANDOM FORESTS IN R & PYTHON Model RMSE R Random Forest 14.587 Python Scikit Learn Random Forest 14.687 Linear Regression 16.23 Multiple Linear Regs 15.53 RESULTS REGRESSION
  • 49. RANDOM FORESTS IN RRESULTS CLASSIFICATION Training time: 8.75 min OOB error rate: 44.01% Accuracy: 0.567 rf  <-­‐  randomForest(training,ratings_train,ntree=60,   sampsize  =  50000,  importance  =  TRUE) ratings_train<-­‐as.factor(ratings_train) 1 2 3 4 5 1 16777 4863 1633 139 37 2 5760 12411 6213 504 89 3 1485 5559 13144 1880 329 4 176 888 4094 2592 625 5 59 204 1008 856 1388
  • 50. RANDOM FORESTS IN PYTHON RESULTS CLASSIFICATION Training time: 2.56 min OOB Score: 0.1964 Accuracy: 0.566 rf  =  sk.RandomForestClassifier(n_estimators=60, compute_importances=True,  oob_score=True) 1 2 3 4 5 1 16930 4682 1758 129 53 2 5517 12369 6475 506 106 3 1500 5367 13448 1737 275 4 186 791 4171 2598 561 5 48 161 999 880 1466 Precision: 0.564 Recall: 0.5653 F1 Score: 0.5611
  • 51. RANDOM FORESTS IN R FEATURE IMPORTANCE FEATURE (% INC MSE) FEATURE (% INC NODE PURITY) Q9 Track Q7 Q11 Q5 Q12 Q6 Age Age Q6 Q10 Q17 listBACK Q9 Q19 Q16 listOWN Q4 Q16 Q13 Q16 - I would be willing to pay for the opp to buy new music pre-release Q9 - I am out of touch with new music Q19 - I like to know about music before other people Q11 -Pop music is fun Q12 - Pop music helps me escape Q7 - I enjoy music primarily from going out to dance Q5 - I used to know where to find music Q6 - I am not willing to pay for music Q10 - My music collection is a source of pride Q4 - I would like to buy new music but I don’t know what to buy Q17 - I find seeing a new artist a useful way of discovering new music
  • 52. RANDOM FORESTS IN PYTHON FEATURE IMPORTANCE FEATURE IMPORTANCE IN R RANDOM FOREST Q11 2 Q12 3 Age 4 Q6 5 Q17 6 Q5 - Q4 9 Q10 - Q16 7 Q7 - Q16 - I would be willing to pay for the opp to buy new music pre-release Q11 -Pop music is fun Q12 - Pop music helps me escape Q5 - I used to know where to find music Q6 - I am not willing to pay for music Q10 - My music collection is a source of pride Q4 - I would like to buy new music but I don’t know what to buy Q17 - I find seeing a new artist a useful way of discovering new music
  • 53. RANDOM FORESTS IN R 1 2 3 4 5 CLASS 1 16777 4863 1633 139 37 28.45% 2 5760 12411 6213 504 89 50.31% 3 1485 5559 13144 1880 329 41.31% 4 176 888 4094 2592 625 69.09% 5 59 204 1008 856 1388 60.51% CONFUSION MATRIX
  • 54. RANDOM FORESTS IN PYTHON 1 2 3 4 5 CLASS 1 16930 4682 1758 129 53 28.12% 2 5517 12369 6475 506 106 50.47% 3 1500 5367 13448 1737 275 39.77% 4 186 791 4171 2598 561 68.73% 5 48 161 999 880 1466 58.75% CONFUSION MATRIX
  • 56. FEATURE SELECTION PRINCIPAL COMPONENT ANALYSIS - WORDS Determine which features account for most of the variance. FEATURE PC1 PC2 Distinctive 0.20 -0.059 Authentic 0.19 -0.046 Talented 0.19 -0.083 Credible 0.19 -0.084 Stylish 0.18 -0.094 Annoying -0.06 -0.065 Intrusive -0.06 -0.058 Irrelevant -0.059 -0.087 Uninspired -0.056 -0.092 Noisy -0.053 -0.13
  • 57. FEATURE SELECTIONMake a simple model choosing meaningful variables WORDS - Annoying, Depressing, Boring, Catchy, Talented, Distinctive, Beautiful, Superstar, Soulful and Popular. QUESTIONS - Q4, Q5, Q6, Q9, Q10 Q11 and Q19. • Running time in R ~ 15 min. • RMSE = 14.791 / Public leader board 13.076
  • 59. COMMENTS It is well known that Random Forests have shown to be biased towards highly correlated variables. Using conditional inference trees, ameliorates that bias (See Party PACKAGE in R) SCIKIT learn’s implementation has n_jobs parameter to parallelise training. For a similar feature in R, see bigRF package.
  • 61. CONCLUDING REMARKS We solved a problem using both R and PYTHON (via Scikit learn). Clearly constraints for addressing a given problem might differ and would dictate the implementation of choice. PICK THE TOOL THAT IS BEST FOR THE JOB WORTH LEARNING ABOUT BOTH IMPLEMENTATIONS Both R and PYTHON (via SCIKIT LEARN) implementations have added functions that allow the user to explore the resulting model and its performance.
  • 62. CONCLUDING REMARKS RANDOM FORESTS ARE GREAT KEEP AN EYE OUT FOR INTERESTING DATA It gives great accuracy, can handle many features, does not require cross validation and it even estimates what variables are important. Having data that you are interested in, leads to more interesting questions and reasons to explore new methods and add a new trick to your bag.
  • 63. CONCLUDING REMARKS EMI DATASET IS GREAT TO TEST RIDE TO DO’s - WILL IT PYTHON? Set has a lot of behavioural information on a subject that everyone has some intuition. Prediction using SVM’s and other Matrix Factorisation techniques. Full factor analysis, etc.