Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions

Meet TransmogrifAI, Open
Source AutoML That Powers
Einstein Predictions
mtovbin@salesforce.com, @tovbinm
Matthew Tovbin, Principal Engineer, Einstein

Forward Looking Statement
Statement under the Private Securities Litigation Reform Act of 1995:
This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties
materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results
expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed
forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items
and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning
new, planned, or upgraded services or technology developments and customer contracts or use of our services.
The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new
functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our
operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any
litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our
relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our
service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger
enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in
our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for the most recent fiscal quarter.
These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section
of our Web site.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently
available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based
upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these
forward-looking statements.

1. Customer-specific models beat global models
2. Majority of business data is structured
3. Too many use cases, too few data scientists
Machine Learning is Hard and Even Harder for the Enterprise
Lessons our Data Scientists Learned
while Building Einstein

1. Customer-specific Models Beat Global Models
● Customers care about data privacy
● Every customer’s data is different
Enterprise Machine Learning

2. Majority of Business Data is Structured
https://www.kaggle.com/surveys/2017

Data
Prep
Feature
Engineering
Feature
Selection
Model
Training Model
The standard approach to
building an ML model
3. Too Many Use Cases, Too Few Data Scientists

ML is exponentially harder in the Enterprise with
many, customer-specific models
3. Too Many Use Cases, Too Few Data Scientists
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model

TransmogrifAI
Introducing TransmogrifAI
Customer
specific
models
Structured,
transactional
data
Data
science
at scale
+ +
Automated Machine Learning for Structured Data

● Automated feature engineering, feature
selection & model selection
● ML abstractions that improve developer
productivity & collaboration
● Model explainability to improve
debuggability and transparency
>90% accuracy with 100x reduction in time
Introducing TransmogrifAI
Automated Machine Learning for Structured Data

Transform in a surprising or magical manner
What’s in a name?
transmogrify

5B+
predictions
per day
Einstein
Platform
Compute
Orchestration
Data Store
Model Lifecycle
Management
Data Science
Experience
Configuration
Services
Infrastructure
Metrics
Health Monitoring
ETL/GDPR/
Data
Processing
DL TransmogrifAI
Machine Learning
The AutoML Engine in the Einstein Platform
Lead Scoring Engagement ScoringCase Classification Prediction Builder
...

Einstein Prediction Builder
• Product: Point.
Click. Predict.
• Engineering: any
customer can create
any number of ML
applications on any
data?! Impossible!

Under the Hood
● Automated Feature Engineering
● Automated Feature Selection
● Automated Model Selection

Type Hierarchy For Machine Learning
FeatureType
OPNumeric OPCollection
OPSetOPList
NonNullableText
Email
Base64
Phone
ID
URL
ComboBox
PickList
TextArea
OPVector OPMap
BinaryMap
IntegralMap
DateList
DateTimeList
Integral
Real
Binary
Percent
Currency
Date
DateTime
MultiPickList TextMap
TextListCity
Street
Country
PostalCode
Location
State
Geolocation
StateMap
SingleResponse
RealNN
Categorical
MultiResponse
Legend: bold - abstract type, regular - concrete type, italic - trait, solid line - inheritance, dashed line - trait mixin
...
RealMap
https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/field_types.htm
Prediction

Automatic Feature Engineering
transmogrify()
Lat LonSubjectPhoneEmail Age
Age
[0-15]
Age
[15-35]
Age
[>35]
Email Is
Spammy
Top Email
Domains
Country
Code
Phone Is
Valid
Top
TF-IDF
Terms
City, State
Feature Vector

Feature
34,200.03
14.001.02
22,430.11
47,895.66
Feature Null Indicator
34,200.03 0
14.001.02 0
16,045.21 1
22,430.11 0
16,045.21 1
47,895.66 0
Numeric – Imputation and Null value tracking

Temporal: Circular Statistics
Circular distributions are those that
have no true zero. Great for temporal
features and deals with seasonality:
● Hours of the Day
● Weeks on the Month
● Months of the Year

Numeric Categorical SpatialTemporal
Reverse Geocoding
Nearest POI
Text
Time difference
Circular Statistics
Time extraction (day,
week, month, year)
Language Detection
Language-wise
Tokenization
Hash Encoding
Tf-Idf
Word2Vec
Name Entity
Resolution
Smart Categorical
Imputation
Track null value
One Hot Encoding
Dynamic Top K pivot
Imputation
Track null value
Scaling - zNormalize,
log, linear
Smart Binning
Automatic Feature Engineering

Problems with doing Machine
Learning on Enterprise Data
1. Hindsight Bias
2. Field Usage Changes
3. Bulk Uploads
4. Field Type Abuse
5. More...

Lead Before Conversion Lead At Conversion
Problem #1 – Hindsight Bias (aka Label Leakage)

In layman terms, it is like Marty McFly traveling to the future, getting his hands on
the Sports Almanac, and using it to bet on the games of the present.

Problem #2 – Field Usage Changes Over Time

Problem #3 – Bulk Upload by Business Workflow
A business process updated records having different
distribution - biased towards negative outcome

The quick, brown fox jumps over a lazy dog. DJs flock by when MTV ax quiz prog. Junk
MTV quiz graced by fox whelps. Bawds jog, flick quartz, vex nymphs. Waltz, bad nymph,
for quick jigs vex! Fox nymphs grab quick-jived waltz. Brick quiz whangs jumpy veldt fox.
Bright vixens jump; dozy fowl quack. Quick wafting zephyrs vex bold Jim. Quick zephyrs
blow, vexing daft Jim. Sex-charged fop blew my junk TV quiz. How quickly daft jumping
zebras vex. Two driven jocks help fax my big quiz. Quick, Baz, get my woven flax
jodhpurs! "Now fax quiz Jack!" my brave ghost pled. Five quacking zephyrs jolt my wax
bed. Flummoxed by job, kvetching W. zaps Iraq. Cozy sphinx waves quart jug of bad milk.
A very bad quack might jinx zippy fowls. Few quips galvanized the mock jury box. Quick
brown dogs jump over the lazy fox. The jay, pig, fox, zebra, and my wolves quack! Blowzy
red vixens fight for a quick jump. Joaquin Phoenix was gazed by MTV for luck. A wizard’s
job is to vex chumps quickly in fog. Watch "Jeopardy!", Alex Trebek's fun TV quiz game.
Woven silk pyjamas exchanged for blue quartz. Brawny gods just
Typical Text Feature ‘Last Open Stage’ Text Feature
align
answer
collect
contracting
negotiate
opportunity won
qualify
qualify/align
Problem #4 – Feature types abused

outcome/label
Opportunity Won value of this feature is a leaker
Problem #4 – Feature types abused

● Analyze every feature and output descriptive statistics
○ Mean
○ Min
○ Max
○ Variance
○ Number of Nulls
● Ensure Features have acceptable ranges
Automatic Feature Selection

● Analyse each feature
correlation to the label, who
has the most and least
predictive power?
● Drop features with low
predictive power
Automatic Feature Selection

Auto Bucketize
training vs scoring
Feature Lineage

Need to know the true label to evaluate the model
● Usually do a random train/holdout split on the labeled data and use cross-validation on
training set
Evaluating Models
Training set
Holdout set

● Time-based evaluation dataset is the true test of
how well a model is performing
○ Wait for existing (or new) records to have their
label determined
○ Predict from older state of that record and
compare to the true label
● Biggest problem is usually waiting for enough data to
be available
● We can also switch over to constructing the model
from the true event sequence rather than a snapshot
Evaluating Models

What does label leakage look like?

Leakers removed by
AutoML: 73
Leakers removed by
data scientist hand tuning: 42
Department
mkto_si__Last_Interesting_Moment__c
Description OtherPostalCode
et4ae5__Mobile_Country_Code__c Title
mkto2__Acquisition_Program_Id__c
JigsawContactId ReportsToId OtherCity
pi__last_activity__c MailingLongitude
pi__first_activity__c AssistantPhone HomePhone
Fax OtherStreet Partner_Last_Name__c
mkto_si__Last_Interesting_Moment_Desc__c
mkto2__Acquisition_Program__c Jigsaw
Company__c OtherLongitude AssistantName
Salutation OtherLatitude Purchase_Motivation__c
Secondary_Email__c TimetoPurchase__c
mkto_si__Last_Interesting_Moment_Source__c
MailingGeocodeAccuracy MailingLatitude
pi__created_date__c CommentCapture__c
Preferred_Communication_Method__c
TopPriorityValue__c
mkto_si__Last_Interesting_Moment_Type__c
OtherState TopPriorityProcess__c OtherCountry
MasterRecordId OtherGeocodeAccuracy
TopPriorityProduct__c
emailbounceddate
lastcurequestdate lastcuupdatedate
lastreferenceddate lastvieweddate
mkto2__acquisition_date__c
mkto_si__hidedate__c pi__grade__c
pi__notes__c pi__utm_content__c
account_link_easy_closets__c
csat_survey_completed_date__c
csat_survey_net_promoter_score__c
csat_survey_results_link__c birthdate
mkto_si__last_interesting_moment_date__c
pi__campaign__c pi__comments__c
pi__first_search_term__c
pi__first_search_type__c
pi__first_touch_url__c pi__score__c
pi__url__c pi__utm_campaign__c
pi__utm_medium__c pi__utm_source__c
historical_lead_score__c pi__utm_term__c
first_activity_timestamp__c
predicted_likelihood_to_purchase_2__c
best_time_to_call_date__
c total_lead_score__c
csat_customer_service_s
urvey_disallowed__c
referral_credit_applied__c
referral_days_til_purchas
e__c
predicted_likelihood_to_p
urchase__c createdbyid
createddate
lastactivitydate
lastmodifieddate
last_activity_date__c
systemmodstamp
AutoML vs Hand Tuned – Showdown

Live Prediction Results
AutoML vs Hand Tuned – Showdown

Automated Model Selection
● Many hyperparameters for each algorithm
● Automated Hyperparameter tuning
○ Faster model creation with improved metrics
○ Search algorithms to find the optimal hyperparameters,
e.g grid search, random search
Grid Search Bayesian SearchRandom Search

Random Forests
Decision Trees
Logistic Regression w/ ElasticNet Regularization
Naive Bayes
Gradient Boosted trees
Decision Trees
Random Forests
Linear Regression w/ ElasticNet Regularization
Random Forests
Decision Trees
Multinomial Logistic Regression w/ ElasticNet
Naive Bayes
Compete Algorithms
RMSE
AccuracyAuROC
Regression
Binary Classification Multi-Class Classification
Automated Model Selection

Different Permutation of Thresholds Leads to Different Results

How well does it work?
• TransmogrifAI empowers:
• Predictive Journeys
• Lead Scoring
• Prediction Builder
• Case Classification
• Most of the models deployed in
production are completely hands free
• Serves 3B+ 5B+ predictions per day

Where do WE go next?
• Deeper model & score insights – LOCO, LIME
• Hyper parameter search strategies – Bayesian, Bandit-based
• Feature engineering – text embeddings, model specific
• Model portability
• Enable more applications – recommenders, unsupervised learning
• Perf tuning, bug fixes, docs, examples
• <Your requirements / feedback>

Where do YOU go next?
• Read the blog post - https://www.sfdc.co/open-sourcing-transmogrifai
• Try it out - https://transmogrif.ai
• Reach out and contribute - https://sfdc.co/transmogrifai-contributing
• Student? Apply to Google Summer of Code (GSoC) 2019 to work with us!
• Feeling creative? We need a logo.

Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions

Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions

Ähnlich wie Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions