SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Ruifeng Zheng, JD.COM
Yanbo Liang, Hortonworks
Apache Spark and Machine
Learning Boosts Revenue
Growth for Online Retailers
#MLSAIS16
About us
Ruifeng Zheng ruifengz@foxmail.com
– Senior Software Engineer in Intelligent Advertising Lab at JD.COM
– Apache Spark, Scikit-Learn & XGBoost contributor
– SparkLibFM & SparkGBM Author
Yanbo Liang ybliang8@gmail.com
– Staff Software Engineer at Hortonworks
– Apache Spark PMC member
– Tensorflow & XGBoost contributor
2#MLSAIS16
Outline
• What are the problems?
• How we solve it?
• The lessons learned.
• Where is the gap?
• Enhancements
– ALS with warm start
– SparkGBM – a new GBM impl atop Spark
• Future work
3#MLSAIS16
About JD.com & Wiwin
JD.com
China’s largest online retailer
China’s largest e-commerce delivery system
300+ million active users
Billions of SKUs on shelves, in thousands of categories
WiWin Team in Business Growth Dept.
Supply Data-Mining Services for Top Brands
4#MLSAIS16
Business scenarios
– User Segmentation
– Cross Selling
– Purchase Prediction
5#MLSAIS16
User Segmentation
Demand: Help Brands to measure marketing
campaigns (beyond ROI and GMV)
6#MLSAIS16
Impression,
Click, View,
Search, Ask,
Follow, Order,
Comment,
Cart, etc
Input
4A Model
4A status of
each user
Aware
Appeal Act
Advocate
User Segmentation
7#MLSAIS16
V1: RDD only
V2: DataFrame + RDD(only used in complexed operations)
~3.2x speedup
save 40% memory footprint
Cross-Selling
Demand: Help Brands to find potential co-operators
among millions of brands
8#MLSAIS16
Cross-Selling
9#MLSAIS16
Flexible pattern #SPARK-13385
rules like {A,B,C} -> {X,Y}
Measure how many more
times X and Y occur together
than expected
Purchase Prediction
Demand: Help Brands to better target potential
users
10#MLSAIS16
Different from tradition recommendation:
– For each user, select several items
– For several items, select millions of
users
Purchase Prediction - Ranking
11#MLSAIS16
Pipeline
12#MLSAIS16
CRISP-DM
In-house data
processing
toolchain
Spark-Shell (Jupyter)
Spark-SQL
MLlib
GraphX
MLlib
Spark-Streaming
Lessons Learned - 1
Multi-Column processing
– Imputer
#SPARK-21690
– ApproxQuantile
#SPARK-14352
– Bucketizer
#SPARK-22797
13#MLSAIS16
0
1
2
3
4
5
6
7
8
1 col 10 cols 100 cols
Training Time of Imputer(sec)
One-Col Multi-Col
Lessons Learned - 2
RDD & DataFrame are Complementary
ETL and data transformation -> DataFrame
Complex logic containing lots of aggregation -> RDD
14#MLSAIS16
Lessons Learned - 3
Parallelized Cross-Validation
15#MLSAIS16
https://bryancutler.github.io/cv-parallel/
0
10000
20000
30000
40000
50000
60000
0M 1M 2M 3M 4M 5M 6M 7M 8M
Serial Parallel=5
GAP Warm Start
– Resume training
– Accelerate convergence
– Stable solution
Callback after each iteration
– Early stop
– Model checkpoint
Compact Numeric Format
16#MLSAIS16
ALS
GBM
ALS
17#MLSAIS16
ALS – Warm start
18#MLSAIS16
Item factors
in solution T
Initial Item factors to
train solution T+1
Randomized
C off shelves
D on shelves
ALS – Warm start
0.8
1.6
1 2 3 4 5 6 7 8 9 10
RMSE
ALS ALS (warm start)
19#MLSAIS16
0.83
0.80
Save ~40% training time
GBM - Life is short, you need GBM
Objective	in	t-th Iteration:
2345 6 = 8
9:;
5<;
= >9, @>9
5<;
+ B5 C9 + Ω(B5)
20#MLSAIS16
previous
prediction
training loss: how well
model fit on training
data
regularization:
control model
complexity
base model
to be added
in Iteration t
GBM - Impls
21#MLSAIS16
•Tree as base model
•First-order approximation
GBT
•Second-order
approximation
•L1 & L2 regularization
•Shrinkage
•Column sampling
•Sparsity-aware split
finding
XGBoost •Histogram subtraction
•Discrete bins
LightGBM
DMLC-Rabit
MS-DMTK
Dedicated ML frameworks result in extra costs in
Deployment, Maintenance & Monitoring
SparkGBM https://github.com/zhengruifeng/SparkGBM
To be a scalable and efficient GBM atop Spark
22#MLSAIS16
Second-order approximation
L1 & L2 regularization
Shrinkage
Column sampling
Sparsity-aware
Binned data
Histogram subtraction
Codebase legacy:
Model save/load
Periodic checkpointer
…
SparkGBM - Features
Compatible with MLLib pipeline
Warm start
Early stop
User-defined functions (RDD only)
– Objection
– Evaluation
– Callback: Early stopping, Model checkpoint
23#MLSAIS16
SparkGBM – API 1
24#MLSAIS16
GBMRegressor & GBMClassifier
val gbmr = new GBMRegressor
gbmr.setBoostType("dart")
.setObjectiveFunc("square")
.setEvaluateFunc(Array("rmse", "mae"))
.setRegAlpha(0.1)
.setRegLambda(0.5)
.setDropRate(0.1)
.setEarlyStopIters(10)
.setInitialModelPath(path)
Gradient boosting & DART
Objective
Regularization
Early stop
Warm start
SparkGBM – API 2
25#MLSAIS16
GBMRegressionModel & GBMClassificationModel
val model1 = gbmr.fit(train)
val model2 = gbmr.fit(train, test)
model2.setFirstTrees(5)
model2.transform(test)
model2.setEnableOneHot(true)
model2.leaf(test)
Train without validation,
early stop is disabled
Train with validation, early
stop is enabled
Using first 5 trees for
following computation
Prediction
Feature transformation
by index of leaf/path
SparkGBM – Performance
26#MLSAIS16
0
500
1000
1500
2000
2500
3000
MLlib-GBT SparkGBM XGboost4J
Training Time(sec)
Training Time(sec) Allreduce in Rabit
Reduce &
Broadcast
Future work
• Warm start in other algorithms
– Use K-Means to initialize GMM
• ALS enhancements
– Improve the solution stability
• SparkGBM enhancements
– Add features from XGBoost & LightGBM, i.e. softmax to
support multi-class classification
27#MLSAIS16
Thank you!
28#MLSAIS16

Weitere ähnliche Inhalte

Was ist angesagt?

Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Databricks
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Databricks
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Spark Summit
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
Databricks
 

Was ist angesagt? (20)

Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Building A Feature Factory
Building A Feature FactoryBuilding A Feature Factory
Building A Feature Factory
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Spark
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
Real-Time Fraud Detection at Scale—Integrating Real-Time Deep-Link Graph Anal...
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
Lessons Learned from Using Spark for Evaluating Road Detection at BMW Autonom...
 
AutoML Toolkit – Deep Dive
AutoML Toolkit – Deep DiveAutoML Toolkit – Deep Dive
AutoML Toolkit – Deep Dive
 
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017
 
AI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data ModelingAI-Assisted Feature Selection for Big Data Modeling
AI-Assisted Feature Selection for Big Data Modeling
 
Scaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflowScaling Ride-Hailing with Machine Learning on MLflow
Scaling Ride-Hailing with Machine Learning on MLflow
 
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis RoosBuilding a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...No REST till Production – Building and Deploying 9 Models to Production in 3 ...
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
 
AI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with DatabricksAI Modernization at AT&T and the Application to Fraud with Databricks
AI Modernization at AT&T and the Application to Fraud with Databricks
 
Conversational AI with Transformer Models
Conversational AI with Transformer ModelsConversational AI with Transformer Models
Conversational AI with Transformer Models
 
Spark ML Pipeline serving
Spark ML Pipeline servingSpark ML Pipeline serving
Spark ML Pipeline serving
 

Ähnlich wie Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers with Ruifeng Zheng and Yanbo Liang

Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...
Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...
Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...
Odoo
 

Ähnlich wie Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers with Ruifeng Zheng and Yanbo Liang (20)

StudySapuri Data Analytics Platform with Treasure Data
StudySapuri Data Analytics Platform with Treasure DataStudySapuri Data Analytics Platform with Treasure Data
StudySapuri Data Analytics Platform with Treasure Data
 
MLSEV. Use Case: Robotic Process Automation and Machine Learning
MLSEV. Use Case: Robotic Process Automation and Machine LearningMLSEV. Use Case: Robotic Process Automation and Machine Learning
MLSEV. Use Case: Robotic Process Automation and Machine Learning
 
Open ETL for Real-Time Decision Making with Shuai Yuan
Open ETL for Real-Time Decision Making with Shuai YuanOpen ETL for Real-Time Decision Making with Shuai Yuan
Open ETL for Real-Time Decision Making with Shuai Yuan
 
AMPL Workshop, part 2: From Formulation to Deployment
AMPL Workshop, part 2: From Formulation to DeploymentAMPL Workshop, part 2: From Formulation to Deployment
AMPL Workshop, part 2: From Formulation to Deployment
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...
Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...
Tips to get the most out of OpenERP. Jean Luc Delsaute & Coralie Girardet, Au...
 
Tips to get the most out of OpenERP
Tips to get the most out of OpenERPTips to get the most out of OpenERP
Tips to get the most out of OpenERP
 
MLSD18. Real-World Use Case I
MLSD18. Real-World Use Case IMLSD18. Real-World Use Case I
MLSD18. Real-World Use Case I
 
R4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning FrameworkR4ML: An R Based Scalable Machine Learning Framework
R4ML: An R Based Scalable Machine Learning Framework
 
(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce(BDT316) Offloading ETL to Amazon Elastic MapReduce
(BDT316) Offloading ETL to Amazon Elastic MapReduce
 
Intro to Quantitative Investment (Lecture 2 of 6)
Intro to Quantitative Investment (Lecture 2 of 6)Intro to Quantitative Investment (Lecture 2 of 6)
Intro to Quantitative Investment (Lecture 2 of 6)
 
eCommerce Case Studies - A Little Book of Success
eCommerce Case Studies - A Little Book of SuccesseCommerce Case Studies - A Little Book of Success
eCommerce Case Studies - A Little Book of Success
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark application
 
London measure camp12-konstantinos-papadopoulos
London measure camp12-konstantinos-papadopoulosLondon measure camp12-konstantinos-papadopoulos
London measure camp12-konstantinos-papadopoulos
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Transforming Direct Spend Sourcing Processes in Discrete Manufacturing - 56558
Transforming Direct Spend Sourcing Processes in Discrete Manufacturing - 56558Transforming Direct Spend Sourcing Processes in Discrete Manufacturing - 56558
Transforming Direct Spend Sourcing Processes in Discrete Manufacturing - 56558
 
Sap infosys fico
Sap infosys ficoSap infosys fico
Sap infosys fico
 
Presentation on erp by Khurram Waseem Khan mba 2nd semester hu
Presentation on erp by Khurram Waseem Khan mba 2nd semester   huPresentation on erp by Khurram Waseem Khan mba 2nd semester   hu
Presentation on erp by Khurram Waseem Khan mba 2nd semester hu
 
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
“Performance” - Dallas Oracle Users Group 2019-01-29 presentation
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

Kürzlich hochgeladen (20)

Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 

Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers with Ruifeng Zheng and Yanbo Liang

  • 1. Ruifeng Zheng, JD.COM Yanbo Liang, Hortonworks Apache Spark and Machine Learning Boosts Revenue Growth for Online Retailers #MLSAIS16
  • 2. About us Ruifeng Zheng ruifengz@foxmail.com – Senior Software Engineer in Intelligent Advertising Lab at JD.COM – Apache Spark, Scikit-Learn & XGBoost contributor – SparkLibFM & SparkGBM Author Yanbo Liang ybliang8@gmail.com – Staff Software Engineer at Hortonworks – Apache Spark PMC member – Tensorflow & XGBoost contributor 2#MLSAIS16
  • 3. Outline • What are the problems? • How we solve it? • The lessons learned. • Where is the gap? • Enhancements – ALS with warm start – SparkGBM – a new GBM impl atop Spark • Future work 3#MLSAIS16
  • 4. About JD.com & Wiwin JD.com China’s largest online retailer China’s largest e-commerce delivery system 300+ million active users Billions of SKUs on shelves, in thousands of categories WiWin Team in Business Growth Dept. Supply Data-Mining Services for Top Brands 4#MLSAIS16
  • 5. Business scenarios – User Segmentation – Cross Selling – Purchase Prediction 5#MLSAIS16
  • 6. User Segmentation Demand: Help Brands to measure marketing campaigns (beyond ROI and GMV) 6#MLSAIS16 Impression, Click, View, Search, Ask, Follow, Order, Comment, Cart, etc Input 4A Model 4A status of each user Aware Appeal Act Advocate
  • 7. User Segmentation 7#MLSAIS16 V1: RDD only V2: DataFrame + RDD(only used in complexed operations) ~3.2x speedup save 40% memory footprint
  • 8. Cross-Selling Demand: Help Brands to find potential co-operators among millions of brands 8#MLSAIS16
  • 9. Cross-Selling 9#MLSAIS16 Flexible pattern #SPARK-13385 rules like {A,B,C} -> {X,Y} Measure how many more times X and Y occur together than expected
  • 10. Purchase Prediction Demand: Help Brands to better target potential users 10#MLSAIS16 Different from tradition recommendation: – For each user, select several items – For several items, select millions of users
  • 11. Purchase Prediction - Ranking 11#MLSAIS16
  • 13. Lessons Learned - 1 Multi-Column processing – Imputer #SPARK-21690 – ApproxQuantile #SPARK-14352 – Bucketizer #SPARK-22797 13#MLSAIS16 0 1 2 3 4 5 6 7 8 1 col 10 cols 100 cols Training Time of Imputer(sec) One-Col Multi-Col
  • 14. Lessons Learned - 2 RDD & DataFrame are Complementary ETL and data transformation -> DataFrame Complex logic containing lots of aggregation -> RDD 14#MLSAIS16
  • 15. Lessons Learned - 3 Parallelized Cross-Validation 15#MLSAIS16 https://bryancutler.github.io/cv-parallel/ 0 10000 20000 30000 40000 50000 60000 0M 1M 2M 3M 4M 5M 6M 7M 8M Serial Parallel=5
  • 16. GAP Warm Start – Resume training – Accelerate convergence – Stable solution Callback after each iteration – Early stop – Model checkpoint Compact Numeric Format 16#MLSAIS16 ALS GBM
  • 18. ALS – Warm start 18#MLSAIS16 Item factors in solution T Initial Item factors to train solution T+1 Randomized C off shelves D on shelves
  • 19. ALS – Warm start 0.8 1.6 1 2 3 4 5 6 7 8 9 10 RMSE ALS ALS (warm start) 19#MLSAIS16 0.83 0.80 Save ~40% training time
  • 20. GBM - Life is short, you need GBM Objective in t-th Iteration: 2345 6 = 8 9:; 5<; = >9, @>9 5<; + B5 C9 + Ω(B5) 20#MLSAIS16 previous prediction training loss: how well model fit on training data regularization: control model complexity base model to be added in Iteration t
  • 21. GBM - Impls 21#MLSAIS16 •Tree as base model •First-order approximation GBT •Second-order approximation •L1 & L2 regularization •Shrinkage •Column sampling •Sparsity-aware split finding XGBoost •Histogram subtraction •Discrete bins LightGBM DMLC-Rabit MS-DMTK Dedicated ML frameworks result in extra costs in Deployment, Maintenance & Monitoring
  • 22. SparkGBM https://github.com/zhengruifeng/SparkGBM To be a scalable and efficient GBM atop Spark 22#MLSAIS16 Second-order approximation L1 & L2 regularization Shrinkage Column sampling Sparsity-aware Binned data Histogram subtraction Codebase legacy: Model save/load Periodic checkpointer …
  • 23. SparkGBM - Features Compatible with MLLib pipeline Warm start Early stop User-defined functions (RDD only) – Objection – Evaluation – Callback: Early stopping, Model checkpoint 23#MLSAIS16
  • 24. SparkGBM – API 1 24#MLSAIS16 GBMRegressor & GBMClassifier val gbmr = new GBMRegressor gbmr.setBoostType("dart") .setObjectiveFunc("square") .setEvaluateFunc(Array("rmse", "mae")) .setRegAlpha(0.1) .setRegLambda(0.5) .setDropRate(0.1) .setEarlyStopIters(10) .setInitialModelPath(path) Gradient boosting & DART Objective Regularization Early stop Warm start
  • 25. SparkGBM – API 2 25#MLSAIS16 GBMRegressionModel & GBMClassificationModel val model1 = gbmr.fit(train) val model2 = gbmr.fit(train, test) model2.setFirstTrees(5) model2.transform(test) model2.setEnableOneHot(true) model2.leaf(test) Train without validation, early stop is disabled Train with validation, early stop is enabled Using first 5 trees for following computation Prediction Feature transformation by index of leaf/path
  • 26. SparkGBM – Performance 26#MLSAIS16 0 500 1000 1500 2000 2500 3000 MLlib-GBT SparkGBM XGboost4J Training Time(sec) Training Time(sec) Allreduce in Rabit Reduce & Broadcast
  • 27. Future work • Warm start in other algorithms – Use K-Means to initialize GMM • ALS enhancements – Improve the solution stability • SparkGBM enhancements – Add features from XGBoost & LightGBM, i.e. softmax to support multi-class classification 27#MLSAIS16