"The advent of pre-trained language models such as Google’s BERT promises a high performance transfer learning (HPTL) paradigm for many natural language understanding tasks. One such task is email classification. Given the complexity of content and context of sales engagement, lack of standardized large corpus and benchmarks, limited labeled examples and heterogenous context of intent, this real-world use case poses both a challenge and an opportunity for adopting an HPTL approach. This talk presents an experimental investigation to evaluate transfer learning with pre-trained language models and embeddings for classifying sales engagement emails arising from digital sales engagement platforms (e.g., Outreach.io).
We will present our findings on evaluating BERT, ELMo, Flair and GloVe embeddings with both feature-based and fine-tuning based transfer learning implementation strategies and their scalability on a GPU cluster with progressively increasing number of labeled samples. Databricks’ MLFlow was used to track hundreds of experiments with different parameters, metrics and models (tensorflow, pytorch etc.). While in this talk we focus on email classification task, the approach described is generic and can be used to evaluate applicability of HPTL to other machine learnings tasks. We hope our findings will help practitioners better understand capabilities and limitations of transfer learning and how to implement transfer learning at scale with Databricks for their scenarios."
2. Yong Liu
Corey Zumar
High Performance Transfer Learning for
Classifying Intent of Sales Engagement
Emails: An Experimental Study
#UnifiedAnalytics #SparkAISummit
3. Outline
• Data Science Research Objectives
• Sales Engagement Platform (SEP)
• Use Cases and Technical Challenges
• Experiments and Datasets
• Results
• MLflow Integration and Experiments Tracking
• Summary and Future Work
3#UnifiedAnalytics #SparkAISummit
4. Data Science Research Objectives
• Establish a high performance transfer learning
evaluation framework for email classification
• Three research questions:
– Which embeddings and pre-trained LMs are to be used?
– Which transfer learning implementation strategies (feature-
based vs. fine-tuning) are to be used?
– How many labeled samples are needed?
4#UnifiedAnalytics #SparkAISummit
5. Sales Engagement Platform (SEP)
• A new category of software
5#UnifiedAnalytics #SparkAISummit
CRMs (e.g., Salesforce, Microsoft Dynamics, SAP)
Sales Engagement Platform (SEP)
(e.g., Outreach)
Sales Reps
6. SEP Encodes and Automates Sales
Activities into Workflows/Pipelines
6#UnifiedAnalytics #SparkAISummit
Ø Automates execution and capture of
activities (e.g., emails) and records in a
CRM.
Ø Schedules and reminds the rep when it is
the right time to do the manual tasks (e.g.
phone call, custom manual email)
Ø Enables reps to perform one-on-one
personalized outreach to up to 10x more
prospects than before.
7. Why Email Intent Classification Is
Needed
• Email content is critical for driving results for
prospecting and other stages of the sales process
• A replier’s email intent-based metric (e.g., positive,
objection, unsubscription) is much better than a
simple “reply rate”
• A/B testing using a better metric can pick winners of
the email content/template more confidently
7#UnifiedAnalytics #SparkAISummit
8. Why Email Intent Classification is
Challenging @ SEP
• Different context and players: different roles of players
are involved throughout the sales processes and at
different orgs
• Limited labeled sales engagement domain emails:
GDPR and privacy/compliance-constraints; time-
consuming and even not possible to label emails in
many orgs on a SEP
8#UnifiedAnalytics #SparkAISummit
9. Why Transfer Learning?
• Using pretrained language models opens doors for high
performance transfer learning (HPTL):
– Fewer training samples
– Better accuracy
– Reduced model training time and engineering complexity
• Pretrained language models such as BERT have achieved
state-of-the-art scores in the NLP GLUE leaderboard
(https://gluebenchmark.com/)
– However, whether such benchmark success can be readily
translated to practical application is still unknown
9#UnifiedAnalytics #SparkAISummit
10. A List of Pretrained LMs and
Embeddings for Experiments
• GloVe
– count-based context-free word embeddings released in 2014
• ELMo
– context-aware character-based embeddings that is based on a
recurrent neural network (RNN) architecture released in 2018
• Flair
– contextual string embedding released in 2018
• BERT
– state-of-the-art transformer-based deep bidirectional language
model released in late 2018 by Google
10#UnifiedAnalytics #SparkAISummit
12. Example Intents and Emails
• Positive: "Actually, I'd be interested in
talking Friday. Do you have some time
around 10am?”
• Objection: “Thanks for reaching out. This
is not something I am interested in at
this time.”
• Unsubscribe: “Please remove me from your
email list.”
• Not-sure: “Mike, in regards to? John”
12#UnifiedAnalytics #SparkAISummit
13. Two Sets of Experiment Runs
• Using different pretrained language models (LMs)
and embeddings: feature-based vs. fine-tuning
– Using the full training examples
• Different labeled training size with feature-based
and fine-tuning Approach
– Increasingly larger training size: 50, 100, 200, 300, 500,
1000, 2000, 3000
13#UnifiedAnalytics #SparkAISummit
14. Result (1): Different Embeddings
14#UnifiedAnalytics #SparkAISummit
Ø BERT-finetuning has the best f1 score
Ø When using feature-based approaches, GloVe performs slightly better
Ø Classical MLs such as LightGBM+TF-IDF underperform BERT-finetuing
feature-based
15. Result (2): Scaling Effect with
Different Training Sample Sizes
15#UnifiedAnalytics #SparkAISummit
Ø BERT-finetuning outperforms all other
Feature-based approaches when training
example size is greater than 300
Ø When training size is small (< 100),
BERT+Flair performs better
Ø To achieve an f1-score > 0.8,
BERT-finetuning needs at least 500 training
examples, while feature-based approach
needs at least 2000 training examples
16. Introducing
Open machine learning platform
• Works with any ML library & language
• Runs the same way anywhere (e.g. any cloud)
• Designed to be useful for 1 or 1000+ person orgs
• Integrates with Databricks
16#UnifiedAnalytics #SparkAISummit
18. Key Concepts in Tracking
Parameters: key-value inputs to your code
Metrics: numeric values (can update over time)
Artifacts: arbitrary files, including data and models
Source: training code that ran
Version: version of the training code
Tags and Notes: any additional info
18#UnifiedAnalytics #SparkAISummit
19. MLflow Tracking: Example Code
19#UnifiedAnalytics #SparkAISummit
Tracking
Record and query
experiments: code,
configs, results,
…etc
import mlflow
with mlflow.start_run():
mlflow.log_param("layers", layers)
mlflow.log_param("alpha", alpha)
# train model
mlflow.log_metric("mse", model.mse())
mlflow.log_artifact("plot", model.plot(test_df))
mlflow.tensorflow.log_model(model)
21. MLflow to Manage Hundreds of
Experiments
• Pytorch models for the feature-based approach
– Using the Flair framework
• Tensorflow for BERT fine-tuning
– Using the bert-tensorhub framework
21#UnifiedAnalytics #SparkAISummit
24. Images Can Be Logged as Artifacts
24#UnifiedAnalytics #SparkAISummit
mlflow.log_artifact(tSNE_img, 'run_{0}'.format(run_id))
25. Summary
• Transfer learning using fine-tuning BERT outperforms all feature-based
approaches using different embeddings/pretrained LMs when training
example size is greater than 300
• Pretrained language models solve the cold start problem when there is very
little training data
– E.g., with as little as 50 labeled examples, the f1 score reaches 0.67 with BERT+Flair using
the feature-based approach).
• However, to get to f1-score >0.8, it may still need one to two thousand
examples for a feature-based approach or 500 examples for fine-tuning a
pre-trained BERT language model.
• MLFlow is proven to be useful and powerful for tracking all experiments
25#UnifiedAnalytics #SparkAISummit
26. Future Work
• MLflow: from experimentation to production
– Pick the best model for deployment
• Extend to cross-org transfer learning
– Using one or multiple orgs data for training and then
applying to other orgs
26#UnifiedAnalytics #SparkAISummit