Despite huge progress in machine learning over the past decade, building production-ready machine learning systems is still hard. Three years ago when we set out to build machine learning capabilities into the Salesforce platform we learned that building enterprise-scale machine learning systems is even harder.To solve the problems we encountered, we built TransmogrifAI (https://transmogrif.ai) (pronounced trans-mog-ri-phi), an end-to-end automated machine learning library for structured data, that is used in production today to help power our Salesforce Einstein AI platform. This talk highlights key capabilities of TransmogrifAI library and demonstrates them in action on a real-life machine learning application.
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
Â
Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
1. Meet TransmogrifAI, Open
Source AutoML That Powers
Einstein Predictions
mtovbin@salesforce.com, @tovbinm
Matthew Tovbin, Principal Engineer, Einstein
2. Forward Looking Statement
Statement under the Private Securities Litigation Reform Act of 1995:
This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties
materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results
expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed
forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items
and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning
new, planned, or upgraded services or technology developments and customer contracts or use of our services.
The risks and uncertainties referred to above include â but are not limited to â risks associated with developing and delivering new
functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our
operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of any
litigation, risks associated with completed and any possible mergers and acquisitions, the immature market in which we operate, our
relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our
service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger
enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in
our annual report on Form 10-K for the most recent fiscal year and in our quarterly report on Form 10-Q for the most recent fiscal quarter.
These documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section
of our Web site.
Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently
available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based
upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these
forward-looking statements.
4. 1. Customer-specific models beat global models
2. Majority of business data is structured
3. Too many use cases, too few data scientists
Machine Learning is Hard and Even Harder for the Enterprise
Lessons our Data Scientists Learned
while Building Einstein
5. 1. Customer-specific Models Beat Global Models
â Customers care about data privacy
â Every customerâs data is different
Enterprise Machine Learning
6. 2. Majority of Business Data is Structured
https://www.kaggle.com/surveys/2017
8. ML is exponentially harder in the Enterprise with
many, customer-specific models
3. Too Many Use Cases, Too Few Data Scientists
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
Data
Prep
Feat.
Eng
Feat.
Selection
Model
Training
Model
10. â Automated feature engineering, feature
selection & model selection
â ML abstractions that improve developer
productivity & collaboration
â Model explainability to improve
debuggability and transparency
>90% accuracy with 100x reduction in time
Introducing TransmogrifAI
Automated Machine Learning for Structured Data
11. Transform in a surprising or magical manner
Whatâs in a name?
transmogrify
12. 5B+
predictions
per day
Einstein
Platform
Compute
Orchestration
Data Store
Model Lifecycle
Management
Data Science
Experience
Configuration
Services
Infrastructure
Metrics
Health Monitoring
ETL/GDPR/
Data
Processing
DL TransmogrifAI
Machine Learning
The AutoML Engine in the Einstein Platform
Lead Scoring Engagement ScoringCase Classification Prediction Builder
...
13. Einstein Prediction Builder
⢠Product: Point.
Click. Predict.
⢠Engineering: any
customer can create
any number of ML
applications on any
data?! Impossible!
14. Under the Hood
â Automated Feature Engineering
â Automated Feature Selection
â Automated Model Selection
16. Type Hierarchy For Machine Learning
FeatureType
OPNumeric OPCollection
OPSetOPList
NonNullableText
Email
Base64
Phone
ID
URL
ComboBox
PickList
TextArea
OPVector OPMap
BinaryMap
IntegralMap
DateList
DateTimeList
Integral
Real
Binary
Percent
Currency
Date
DateTime
MultiPickList TextMap
TextListCity
Street
Country
PostalCode
Location
State
Geolocation
StateMap
SingleResponse
RealNN
Categorical
MultiResponse
Legend: bold - abstract type, regular - concrete type, italic - trait, solid line - inheritance, dashed line - trait mixin
...
RealMap
https://developer.salesforce.com/docs/atlas.en-us.api.meta/api/field_types.htm
Prediction
17. Automatic Feature Engineering
transmogrify()
Lat LonSubjectPhoneEmail Age
Age
[0-15]
Age
[15-35]
Age
[>35]
Email Is
Spammy
Top Email
Domains
Country
Code
Phone Is
Valid
Top
TF-IDF
Terms
City, State
Feature Vector
21. Temporal: Circular Statistics
Circular distributions are those that
have no true zero. Great for temporal
features and deals with seasonality:
â Hours of the Day
â Weeks on the Month
â Months of the Year
22. Numeric Categorical SpatialTemporal
Reverse Geocoding
Nearest POI
Text
Time difference
Circular Statistics
Time extraction (day,
week, month, year)
Language Detection
Language-wise
Tokenization
Hash Encoding
Tf-Idf
Word2Vec
Name Entity
Resolution
Smart Categorical
Imputation
Track null value
One Hot Encoding
Dynamic Top K pivot
Imputation
Track null value
Scaling - zNormalize,
log, linear
Smart Binning
Automatic Feature Engineering
24. Problems with doing Machine
Learning on Enterprise Data
1. Hindsight Bias
2. Field Usage Changes
3. Bulk Uploads
4. Field Type Abuse
5. More...
25. Lead Before Conversion Lead At Conversion
Problem #1 â Hindsight Bias (aka Label Leakage)
26. In layman terms, it is like Marty McFly traveling to the future, getting his hands on
the Sports Almanac, and using it to bet on the games of the present.
28. Problem #3 â Bulk Upload by Business Workflow
A business process updated records having different
distribution - biased towards negative outcome
29. The quick, brown fox jumps over a lazy dog. DJs flock by when MTV ax quiz prog. Junk
MTV quiz graced by fox whelps. Bawds jog, flick quartz, vex nymphs. Waltz, bad nymph,
for quick jigs vex! Fox nymphs grab quick-jived waltz. Brick quiz whangs jumpy veldt fox.
Bright vixens jump; dozy fowl quack. Quick wafting zephyrs vex bold Jim. Quick zephyrs
blow, vexing daft Jim. Sex-charged fop blew my junk TV quiz. How quickly daft jumping
zebras vex. Two driven jocks help fax my big quiz. Quick, Baz, get my woven flax
jodhpurs! "Now fax quiz Jack!" my brave ghost pled. Five quacking zephyrs jolt my wax
bed. Flummoxed by job, kvetching W. zaps Iraq. Cozy sphinx waves quart jug of bad milk.
A very bad quack might jinx zippy fowls. Few quips galvanized the mock jury box. Quick
brown dogs jump over the lazy fox. The jay, pig, fox, zebra, and my wolves quack! Blowzy
red vixens fight for a quick jump. Joaquin Phoenix was gazed by MTV for luck. A wizardâs
job is to vex chumps quickly in fog. Watch "Jeopardy!", Alex Trebek's fun TV quiz game.
Woven silk pyjamas exchanged for blue quartz. Brawny gods just
Typical Text Feature âLast Open Stageâ Text Feature
align
answer
collect
contracting
negotiate
opportunity won
qualify
qualify/align
Problem #4 â Feature types abused
31. â Analyze every feature and output descriptive statistics
â Mean
â Min
â Max
â Variance
â Number of Nulls
â Ensure Features have acceptable ranges
Automatic Feature Selection
32. â Analyse each feature
correlation to the label, who
has the most and least
predictive power?
â Drop features with low
predictive power
Automatic Feature Selection
34. Need to know the true label to evaluate the model
â Usually do a random train/holdout split on the labeled data and use cross-validation on
training set
Evaluating Models
Training set
Holdout set
35. â Time-based evaluation dataset is the true test of
how well a model is performing
â Wait for existing (or new) records to have their
label determined
â Predict from older state of that record and
compare to the true label
â Biggest problem is usually waiting for enough data to
be available
â We can also switch over to constructing the model
from the true event sequence rather than a snapshot
Evaluating Models
41. Automated Model Selection
â Many hyperparameters for each algorithm
â Automated Hyperparameter tuning
â Faster model creation with improved metrics
â Search algorithms to find the optimal hyperparameters,
e.g grid search, random search
Grid Search Bayesian SearchRandom Search
42. Random Forests
Decision Trees
Logistic Regression w/ ElasticNet Regularization
Naive Bayes
Gradient Boosted trees
Decision Trees
Random Forests
Linear Regression w/ ElasticNet Regularization
Random Forests
Decision Trees
Multinomial Logistic Regression w/ ElasticNet
Naive Bayes
Compete Algorithms
RMSE
AccuracyAuROC
Regression
Binary Classification Multi-Class Classification
Automated Model Selection
45. How well does it work?
⢠TransmogrifAI empowers:
⢠Predictive Journeys
⢠Lead Scoring
⢠Prediction Builder
⢠Case Classification
⢠Most of the models deployed in
production are completely hands free
⢠Serves 3B+ 5B+ predictions per day
46. Where do WE go next?
⢠Deeper model & score insights â LOCO, LIME
⢠Hyper parameter search strategies â Bayesian, Bandit-based
⢠Feature engineering â text embeddings, model specific
⢠Model portability
⢠Enable more applications â recommenders, unsupervised learning
⢠Perf tuning, bug fixes, docs, examples
⢠<Your requirements / feedback>
47. Where do YOU go next?
⢠Read the blog post - https://www.sfdc.co/open-sourcing-transmogrifai
⢠Try it out - https://transmogrif.ai
⢠Reach out and contribute - https://sfdc.co/transmogrifai-contributing
⢠Student? Apply to Google Summer of Code (GSoC) 2019 to work with us!
⢠Feeling creative? We need a logo.