Scaling API-first – The story of a global engineering organization
Decision tree upload
1. Machine Learning
&
Decision Trees
Nithum Thain
January 12th, 2013
2. Overview
• The Data Science Value Chain
• Common Uses for Machine Learning
• The Art of Prediction & Classification
• Introduction to Decision Trees
• The Data Ninja Methodology
3. The Data Science Value Chain
Visualization Strategy,
Storage & Marketing,
Collection &
Maintenance Product,
Analysis Operations
Machine Learning
lives here
4. Overview
• The Data Science Value Chain
• Common Uses for Machine Learning
• The Art of Prediction & Classification
• Introduction to Decision Trees
• The Data Ninja Methodology
5. Machine Learning vs. Artificial Intelligence
• Artificial Intelligence is a set of tools that allow machines to
perform higher order functions. These include natural language
processing, robotics, knowledge representation, etc.
• Machine Learning is a subset of artificial intelligence. It is a set of
(usually statistical) tools that allow machines to detect and extract
patterns from data.
15. Overview
• The Data Science Value Chain
• Common Uses for Machine Learning
• The Art of Prediction & Classification
• Introduction to Decision Trees
• The Data Ninja Methodology
16. What is a Prediction Problem?
• A set of known input variables.
• An unknown output variable.
• A training set of data for which both the inputs and
outputs are known.
18. The Algorithms Are Many
• Regression
• Decision Trees
• Neural Networks Each has it’s own
strengths and
• Support Vector Machines weaknesses.
• Random Forests
• Naive Bayes Classifier
21. Overview
• The Data Science Value Chain
• Common Uses for Machine Learning
• The Art of Prediction & Classification
• Introduction to Decision Trees
• The Data Ninja Methodology
24. I Did!
Internet Friends?
Video Games? XBOX 360
Friends? Friends?
PS3 Wii PC PS3
25. How Our Algorithm Works
1. Start with the “root” node.
2. Check if the data all has the same output variable. If so, then
you are done.
3. Check how every possible output variable splits the data.
4. Choose the one that splits the data MOST
- The one which reduces the variance in the output variable
in the resulting sets.
5. Repeat the process for the resulting “true” node and “false”
node.
33. The Classes and Functions We Will Build:
Classes
• decisionnode: The basic building block of our tree
Functions
• divideset: Splits the tree into two sets based on a variable
• variance: Calculates the variance of the output variable in a set
• buildtree: Builds the tree according to the algorithm described
• classify: For any new data points, uses the tree to predict their value
• printree: Prints a text-based version of the full decision tree
36. Overview
• The Data Science Value Chain
• Common Uses for Machine Learning
• The Art of Prediction & Classification
• Introduction to Decision Trees
• The Data Ninja Methodology
37. The Data Ninja Methodology
1. Find the appropriate data
2. Play with the data (plot, sort, examine)
3. Clean the data
4. Choose the appropriate tool for analysis
5. Apply the tool
6. Repeat steps 2-6 until something works
7. ...
8. Profit!