SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Poo Kuan Hoong
Build an effective
Machine Learning
Model with LightGBM
Agenda
• Introduction
• Decision Tree
• Ensemble Method
• Gradient Boosting
• Motivation for Gradient Boosting on Decision Trees
• LightGBM
• Demo
About Me
Poo Kuan Hoong
• Google Developer Expert (GDE) in Machine
Learning
• Founded and managing Malaysia R User Group &
TensorFlow & Deep Learning Malaysia User
Group
Malaysia R User Group
https://www.facebook.com/groups/MalaysiaRUserGroup/
Questions?
www.sli.do #X490
Introduction
• Everyone is jumping into the hype of
Deep Learning.
• However, Deep Learning is not always
the best model.
• Deep Learning requires a lot of data,
hyperparameters tuning and training
time
• Often, the best model is the simplest
model.
Decision Tree
Goal
1. Partition input space
2. Pure class distribution in each partition
Decision Trees: Guillotine cuts
Decision Trees: Guillotine cuts
Decision Trees: Guillotine cuts
Finding The Best Split
Finding The Best Split
Finding The Best Split
Finding the best split
Greedily Constructing A Decision Tree
Greedily Constructing A Decision Tree
Greedily Constructing A Decision Tree
Greedily Constructing A Decision Tree
Ensemble Methods
1. Weighted combination of
weak learners
2. Prediction is based on
committee votes
3. Boosting:
1. Train ensemble one weak
learner at the time
2. Focus new learners on
wrongly predicted examples
Gradient Boosting
1. Learn a regressor
2. Compute the error residual (Gradient in deep learning)
3. Then build a new model to predict that residual
Motivation for gradient boosting on Decision
Trees
Single decision tree can easily overfit the data
Naïve Gradient Boosting
Gradient boosting on decision trees
• Let’s define our objective functions
Gradient boosting on decision trees –
regularization
Tricks from XGBoost
• The tree is grown in breadth first fashion (as opposed to depth first
like in the original C4.5 implementation). This provides a possibility of
sorting and traversing data only once on each level
• Furthermore, the sorted features can be cached – no need to sort
that many times
LightGBM
• LightGBM is a fast, distributed, high-performance gradient boosting
framework based on decision tree algorithm, used for ranking,
classification and many other machine learning tasks.
• New library, developed by Microsoft, part of Distributed Machine
Learning Toolkit.
• Main idea: make the training faster First release: April, 24th 2017
Gradient Boosting Machine (GBM)
Why LightGBM?
• Light GBM grows tree vertically while
other algorithm grows trees horizontally
meaning that Light GBM grows tree leaf-
wise while other algorithm grows level-
wise.
• It will choose the leaf with max delta
loss to grow. When growing the same
leaf, Leaf-wise algorithm can reduce
more loss than a level-wise algorithm
Features
Speed
• Light GBM is prefixed as ‘Light’ because of its high speed. Light
GBM can handle the large size of data and takes lower memory to
run
Accuracy
• LightGBM focuses on accuracy of results.
Distributed/Parellel Computing
• LGBM also supports GPU learning
Tips to fine tune LightGBM
• Following set of practices can be used to improve your model
efficiency.
• num_leaves: This is the main parameter to control the complexity of the
tree model. Ideally, the value of num_leaves should be less than or equal
to 2^(max_depth). Value more than this will result in overfitting.
• min_data_in_leaf: Setting it to a large value can avoid growing too deep
a tree, but may cause under-fitting. In practice, setting it to hundreds or
thousands is enough for a large dataset.
• max_depth: You also can use max_depth to limit the tree depth
explicitly.
Tips to fine tune LightGBM
• For Faster Speed:
• Use bagging by setting bagging_fraction and
bagging_freq
• Use feature sub-sampling by setting
feature_fraction
• Use small max_bin
• Use save_binary to speed up data loading in future
learning
• Use parallel learning, refer to parallel learning guide.
Tips to fine tune LightGBM
• For better accuracy:
• Use large max_bin (may be slower)
• Use small learning_rate with large num_iterations
• Use large num_leaves (may cause over-fitting)
• Use bigger training data
• Try dart
• Try to use categorical feature directly
Conclusion
• LightGBM works well on
multiple datasets and its
accuracy is as good or even
better than other boosting
algorithms.
• Based on its speed and
accuracy, it is recommended
to try LightGBM
To install LightGBM R Package
• Build and install R-package with the following commands:
git clone --recursive
https://github.com/Microsoft/LightGBM
cd LightGBM
Rscript build_r.R
https://github.com/Microsoft/LightGBM/tree/master/R-package
DEMO
Data
• Porto Seguro’s Safe Driver Prediction
• https://www.kaggle.com/c/porto-seguro-safe-driver-prediction
Poo Kuan Hoong
kuanhoong@gmail.com
http://www.linkedin.com/in/kuanhoong
https://twitter.com/kuanhoong
Build an efficient Machine Learning model with LightGBM

Weitere ähnliche Inhalte

Was ist angesagt?

Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningShubhmay Potdar
 
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAIYurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAILviv Startup Club
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and BoostingMohit Rajput
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learningStanley Wang
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clusteringArshad Farhad
 
Boosting - An Ensemble Machine Learning Method
Boosting - An Ensemble Machine Learning MethodBoosting - An Ensemble Machine Learning Method
Boosting - An Ensemble Machine Learning MethodKirkwood Donavin
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Parth Khare
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Federated Learning
Federated LearningFederated Learning
Federated Learningmiloudiamara
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningﺁﺻﻒ ﻋﻠﯽ ﻣﯿﺮ
 
Introduction to data analysis using python
Introduction to data analysis using pythonIntroduction to data analysis using python
Introduction to data analysis using pythonGuido Luz Percú
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning TechniquesBabu Priyavrat
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Marina Santini
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep LearningYan Xu
 

Was ist angesagt? (20)

Deep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter TuningDeep Dive into Hyperparameter Tuning
Deep Dive into Hyperparameter Tuning
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAIYurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
 
Understanding Bagging and Boosting
Understanding Bagging and BoostingUnderstanding Bagging and Boosting
Understanding Bagging and Boosting
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
Xgboost
XgboostXgboost
Xgboost
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
 
Boosting - An Ensemble Machine Learning Method
Boosting - An Ensemble Machine Learning MethodBoosting - An Ensemble Machine Learning Method
Boosting - An Ensemble Machine Learning Method
 
XGBoost (System Overview)
XGBoost (System Overview)XGBoost (System Overview)
XGBoost (System Overview)
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Federated Learning
Federated LearningFederated Learning
Federated Learning
 
Few shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learningFew shot learning/ one shot learning/ machine learning
Few shot learning/ one shot learning/ machine learning
 
Introduction to data analysis using python
Introduction to data analysis using pythonIntroduction to data analysis using python
Introduction to data analysis using python
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning Techniques
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
 
Optimization in Deep Learning
Optimization in Deep LearningOptimization in Deep Learning
Optimization in Deep Learning
 

Ähnlich wie Build an efficient Machine Learning model with LightGBM

Алексей Натекин (DM Labs, OpenDataScience): «Градиентный бустинг: возможности...
Алексей Натекин (DM Labs, OpenDataScience): «Градиентный бустинг: возможности...Алексей Натекин (DM Labs, OpenDataScience): «Градиентный бустинг: возможности...
Алексей Натекин (DM Labs, OpenDataScience): «Градиентный бустинг: возможности...Mail.ru Group
 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfAdityaSoraut
 
Evolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comEvolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comSimon Hughes
 
transferlearning.pptx
transferlearning.pptxtransferlearning.pptx
transferlearning.pptxAmit Kumar
 
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...Edge AI and Vision Alliance
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganSpark Summit
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningRebecca Bilbro
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMsSylvainGugger
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Alok Singh
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Lucidworks
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Lucidworks
 
Boosting Algorithms Omar Odibat
Boosting Algorithms Omar Odibat Boosting Algorithms Omar Odibat
Boosting Algorithms Omar Odibat omarodibat
 
Presentation 7.pptx
Presentation 7.pptxPresentation 7.pptx
Presentation 7.pptxShivam327815
 
Advanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarAdvanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarSigOpt
 
Algorithm strategies in c++
Algorithm strategies in c++Algorithm strategies in c++
Algorithm strategies in c++Jawad Khan
 

Ähnlich wie Build an efficient Machine Learning model with LightGBM (20)

Алексей Натекин (DM Labs, OpenDataScience): «Градиентный бустинг: возможности...
Алексей Натекин (DM Labs, OpenDataScience): «Градиентный бустинг: возможности...Алексей Натекин (DM Labs, OpenDataScience): «Градиентный бустинг: возможности...
Алексей Натекин (DM Labs, OpenDataScience): «Градиентный бустинг: возможности...
 
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdfMachine Learning Unit-5 Decesion Trees & Random Forest.pdf
Machine Learning Unit-5 Decesion Trees & Random Forest.pdf
 
Random Forest
Random ForestRandom Forest
Random Forest
 
Evolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comEvolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.com
 
transferlearning.pptx
transferlearning.pptxtransferlearning.pptx
transferlearning.pptx
 
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f..."Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
"Solving Vision Tasks Using Deep Learning: An Introduction," a Presentation f...
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
random forest.pptx
random forest.pptxrandom forest.pptx
random forest.pptx
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
tensorflow.pptx
tensorflow.pptxtensorflow.pptx
tensorflow.pptx
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
XgBoost.pptx
XgBoost.pptxXgBoost.pptx
XgBoost.pptx
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Boosting Algorithms Omar Odibat
Boosting Algorithms Omar Odibat Boosting Algorithms Omar Odibat
Boosting Algorithms Omar Odibat
 
Presentation 7.pptx
Presentation 7.pptxPresentation 7.pptx
Presentation 7.pptx
 
Advanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarAdvanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise Webinar
 
Algorithm strategies in c++
Algorithm strategies in c++Algorithm strategies in c++
Algorithm strategies in c++
 
Unit 2-ML.pptx
Unit 2-ML.pptxUnit 2-ML.pptx
Unit 2-ML.pptx
 

Mehr von Poo Kuan Hoong

Tensor flow 2.0 what's new
Tensor flow 2.0  what's newTensor flow 2.0  what's new
Tensor flow 2.0 what's newPoo Kuan Hoong
 
The future outlook and the path to be Data Scientist
The future outlook and the path to be Data ScientistThe future outlook and the path to be Data Scientist
The future outlook and the path to be Data ScientistPoo Kuan Hoong
 
Data Driven Organization and Data Commercialization
Data Driven Organization and Data CommercializationData Driven Organization and Data Commercialization
Data Driven Organization and Data CommercializationPoo Kuan Hoong
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewPoo Kuan Hoong
 
Explore and Have Fun with TensorFlow: Transfer Learning
Explore and Have Fun with TensorFlow: Transfer LearningExplore and Have Fun with TensorFlow: Transfer Learning
Explore and Have Fun with TensorFlow: Transfer LearningPoo Kuan Hoong
 
Explore and have fun with TensorFlow: An introductory to TensorFlow
Explore and have fun with TensorFlow: An introductory	to TensorFlowExplore and have fun with TensorFlow: An introductory	to TensorFlow
Explore and have fun with TensorFlow: An introductory to TensorFlowPoo Kuan Hoong
 
The path to be a Data Scientist
The path to be a Data ScientistThe path to be a Data Scientist
The path to be a Data ScientistPoo Kuan Hoong
 
Deep Learning with Microsoft R Open
Deep Learning with Microsoft R OpenDeep Learning with Microsoft R Open
Deep Learning with Microsoft R OpenPoo Kuan Hoong
 
Microsoft APAC Machine Learning & Data Science Community Bootcamp
Microsoft APAC Machine Learning & Data Science Community BootcampMicrosoft APAC Machine Learning & Data Science Community Bootcamp
Microsoft APAC Machine Learning & Data Science Community BootcampPoo Kuan Hoong
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenPoo Kuan Hoong
 
Machine Learning and Deep Learning with R
Machine Learning and Deep Learning with RMachine Learning and Deep Learning with R
Machine Learning and Deep Learning with RPoo Kuan Hoong
 
The path to be a data scientist
The path to be a data scientistThe path to be a data scientist
The path to be a data scientistPoo Kuan Hoong
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerPoo Kuan Hoong
 
Big Data Malaysia - A Primer on Deep Learning
Big Data Malaysia - A Primer on Deep LearningBig Data Malaysia - A Primer on Deep Learning
Big Data Malaysia - A Primer on Deep LearningPoo Kuan Hoong
 
Handwritten Recognition using Deep Learning with R
Handwritten Recognition using Deep Learning with RHandwritten Recognition using Deep Learning with R
Handwritten Recognition using Deep Learning with RPoo Kuan Hoong
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep LearningPoo Kuan Hoong
 
Machine learning and big data
Machine learning and big dataMachine learning and big data
Machine learning and big dataPoo Kuan Hoong
 
DSRLab seminar Introduction to deep learning
DSRLab seminar   Introduction to deep learningDSRLab seminar   Introduction to deep learning
DSRLab seminar Introduction to deep learningPoo Kuan Hoong
 
Context Aware Road Traffic Speech Information System from Social Media
Context Aware Road Traffic Speech Information System from Social MediaContext Aware Road Traffic Speech Information System from Social Media
Context Aware Road Traffic Speech Information System from Social MediaPoo Kuan Hoong
 

Mehr von Poo Kuan Hoong (20)

Tensor flow 2.0 what's new
Tensor flow 2.0  what's newTensor flow 2.0  what's new
Tensor flow 2.0 what's new
 
The future outlook and the path to be Data Scientist
The future outlook and the path to be Data ScientistThe future outlook and the path to be Data Scientist
The future outlook and the path to be Data Scientist
 
Data Driven Organization and Data Commercialization
Data Driven Organization and Data CommercializationData Driven Organization and Data Commercialization
Data Driven Organization and Data Commercialization
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
 
Explore and Have Fun with TensorFlow: Transfer Learning
Explore and Have Fun with TensorFlow: Transfer LearningExplore and Have Fun with TensorFlow: Transfer Learning
Explore and Have Fun with TensorFlow: Transfer Learning
 
Deep Learning with R
Deep Learning with RDeep Learning with R
Deep Learning with R
 
Explore and have fun with TensorFlow: An introductory to TensorFlow
Explore and have fun with TensorFlow: An introductory	to TensorFlowExplore and have fun with TensorFlow: An introductory	to TensorFlow
Explore and have fun with TensorFlow: An introductory to TensorFlow
 
The path to be a Data Scientist
The path to be a Data ScientistThe path to be a Data Scientist
The path to be a Data Scientist
 
Deep Learning with Microsoft R Open
Deep Learning with Microsoft R OpenDeep Learning with Microsoft R Open
Deep Learning with Microsoft R Open
 
Microsoft APAC Machine Learning & Data Science Community Bootcamp
Microsoft APAC Machine Learning & Data Science Community BootcampMicrosoft APAC Machine Learning & Data Science Community Bootcamp
Microsoft APAC Machine Learning & Data Science Community Bootcamp
 
Customer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R OpenCustomer Churn Analytics using Microsoft R Open
Customer Churn Analytics using Microsoft R Open
 
Machine Learning and Deep Learning with R
Machine Learning and Deep Learning with RMachine Learning and Deep Learning with R
Machine Learning and Deep Learning with R
 
The path to be a data scientist
The path to be a data scientistThe path to be a data scientist
The path to be a data scientist
 
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A PrimerMDEC Data Matters Series: machine learning and Deep Learning, A Primer
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
 
Big Data Malaysia - A Primer on Deep Learning
Big Data Malaysia - A Primer on Deep LearningBig Data Malaysia - A Primer on Deep Learning
Big Data Malaysia - A Primer on Deep Learning
 
Handwritten Recognition using Deep Learning with R
Handwritten Recognition using Deep Learning with RHandwritten Recognition using Deep Learning with R
Handwritten Recognition using Deep Learning with R
 
An Introduction to Deep Learning
An Introduction to Deep LearningAn Introduction to Deep Learning
An Introduction to Deep Learning
 
Machine learning and big data
Machine learning and big dataMachine learning and big data
Machine learning and big data
 
DSRLab seminar Introduction to deep learning
DSRLab seminar   Introduction to deep learningDSRLab seminar   Introduction to deep learning
DSRLab seminar Introduction to deep learning
 
Context Aware Road Traffic Speech Information System from Social Media
Context Aware Road Traffic Speech Information System from Social MediaContext Aware Road Traffic Speech Information System from Social Media
Context Aware Road Traffic Speech Information System from Social Media
 

Kürzlich hochgeladen

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 

Kürzlich hochgeladen (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

Build an efficient Machine Learning model with LightGBM

  • 1. Poo Kuan Hoong Build an effective Machine Learning Model with LightGBM
  • 2. Agenda • Introduction • Decision Tree • Ensemble Method • Gradient Boosting • Motivation for Gradient Boosting on Decision Trees • LightGBM • Demo
  • 3. About Me Poo Kuan Hoong • Google Developer Expert (GDE) in Machine Learning • Founded and managing Malaysia R User Group & TensorFlow & Deep Learning Malaysia User Group
  • 4. Malaysia R User Group https://www.facebook.com/groups/MalaysiaRUserGroup/
  • 6. Introduction • Everyone is jumping into the hype of Deep Learning. • However, Deep Learning is not always the best model. • Deep Learning requires a lot of data, hyperparameters tuning and training time • Often, the best model is the simplest model.
  • 8. Goal 1. Partition input space 2. Pure class distribution in each partition
  • 16. Greedily Constructing A Decision Tree
  • 17. Greedily Constructing A Decision Tree
  • 18. Greedily Constructing A Decision Tree
  • 19. Greedily Constructing A Decision Tree
  • 20. Ensemble Methods 1. Weighted combination of weak learners 2. Prediction is based on committee votes 3. Boosting: 1. Train ensemble one weak learner at the time 2. Focus new learners on wrongly predicted examples
  • 21. Gradient Boosting 1. Learn a regressor 2. Compute the error residual (Gradient in deep learning) 3. Then build a new model to predict that residual
  • 22. Motivation for gradient boosting on Decision Trees Single decision tree can easily overfit the data
  • 24. Gradient boosting on decision trees • Let’s define our objective functions
  • 25. Gradient boosting on decision trees – regularization
  • 26. Tricks from XGBoost • The tree is grown in breadth first fashion (as opposed to depth first like in the original C4.5 implementation). This provides a possibility of sorting and traversing data only once on each level • Furthermore, the sorted features can be cached – no need to sort that many times
  • 27. LightGBM • LightGBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks. • New library, developed by Microsoft, part of Distributed Machine Learning Toolkit. • Main idea: make the training faster First release: April, 24th 2017
  • 29. Why LightGBM? • Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf- wise while other algorithm grows level- wise. • It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm
  • 30. Features Speed • Light GBM is prefixed as ‘Light’ because of its high speed. Light GBM can handle the large size of data and takes lower memory to run Accuracy • LightGBM focuses on accuracy of results. Distributed/Parellel Computing • LGBM also supports GPU learning
  • 31. Tips to fine tune LightGBM • Following set of practices can be used to improve your model efficiency. • num_leaves: This is the main parameter to control the complexity of the tree model. Ideally, the value of num_leaves should be less than or equal to 2^(max_depth). Value more than this will result in overfitting. • min_data_in_leaf: Setting it to a large value can avoid growing too deep a tree, but may cause under-fitting. In practice, setting it to hundreds or thousands is enough for a large dataset. • max_depth: You also can use max_depth to limit the tree depth explicitly.
  • 32. Tips to fine tune LightGBM • For Faster Speed: • Use bagging by setting bagging_fraction and bagging_freq • Use feature sub-sampling by setting feature_fraction • Use small max_bin • Use save_binary to speed up data loading in future learning • Use parallel learning, refer to parallel learning guide.
  • 33. Tips to fine tune LightGBM • For better accuracy: • Use large max_bin (may be slower) • Use small learning_rate with large num_iterations • Use large num_leaves (may cause over-fitting) • Use bigger training data • Try dart • Try to use categorical feature directly
  • 34. Conclusion • LightGBM works well on multiple datasets and its accuracy is as good or even better than other boosting algorithms. • Based on its speed and accuracy, it is recommended to try LightGBM
  • 35. To install LightGBM R Package • Build and install R-package with the following commands: git clone --recursive https://github.com/Microsoft/LightGBM cd LightGBM Rscript build_r.R https://github.com/Microsoft/LightGBM/tree/master/R-package
  • 36. DEMO
  • 37. Data • Porto Seguro’s Safe Driver Prediction • https://www.kaggle.com/c/porto-seguro-safe-driver-prediction