MongoDB .local London 2019: Fast Machine Learning Development with MongoDB

Fast Machine Learning
Development with MongoDB
Jane Fine
Director of Product Marketing, Analytics - MongoDB

Spoke is a modern ticketing system to manage workplace requests that uses
Machine Learning to automatically answer questions and assign requests to
right teams.
● Started in August, 2016.
● Based on SF. Funded by Greylock and Accel.
● Small, fast-moving engineering team.
What is Spoke

● Problems: natural language processing problems
● Challenge: customized machine learning models for every client
○ Need to learn quickly (near real time) from user interactions
○ 1000s of ML models
● MongoDB: very useful in scaling up our ML
Spoke: Overview of Challenges

Machine Learning Problem: Team Triaging
Problem: pick right team based on
the text and context of the request
Challenge: Each client has different
teams so pretraining not possible;
must learn from demonstration
Implication => Separate ML model
for each client

Traditional ML vs Adaptive Approach
Claim: most ml-driven
early teams and
startups are in the
second bucket
Low data and low
query volume domain
Startups must build
quickly and adapt to
users to show utility
Traditional ML Pipeline Adaptive ML Pipeline

Adapting with Online Machine Learning
● Online learning: Update the model at each time step as the data
sequentially arrives
● For the first year Spoke built quickly by using online learning to deliver a
slick product experience
○ Users see the utility because the system learns in real time!
● Easy to serve and scale using MongoDB

Serving flow
Online Learning with MongoDB
Training flow

Storing ML models in Mongo
● Simple schema for storing client ML models
for team routing and other product features
● Sub 500 ms fetch for models upto 5 MB
● Tip: keep a separate shard for this
collection to isolate rest of application DB
from
clientMLModelSchema:
{
client: {
ref: <clientId>
},
onlineModels: {
teamRouting: {
model: {}
},
…
},
}

Online Learning: Tips
● Use feature hashing to get bounded model size
○ E.g. a linear model with a hash of size 10k for a 5 way classification problem
=> model size = 200KB.
● Load test your setup to ensure it works for your QPS
● Gotchas:
○ Concurrency. Possible that for two training events arriving at the same time,
one will be ignored (Possible to avoid using queues)
○ No guarantees for deep neural nets

Augmenting TensorFlow with MongoDB
● Years later, we developed a batch training environment with Tensorflow
○ Batch learning is maintainable, retrainable, and allows deep NNs
● Still, online learning provides better UX due to immediate update
● Achieved a good compromise: use Mongo-based online learning model
for first few hundred responses and then silently switch to batch

Example: Text Classification
Data Model 1:
key-value
raw text input
whole corpus
0b917217ae
7fef14c0b3
cb9eadad9a

Data Model 1:
key-value
raw text input
whole corpus
0b917217ae
7fef14c0b3
cb9eadad9a
Data Model 2: tabular
matrix: one row per
article, one column
per word in article
word1 word2 word3
article1 1 0 2
article2 0 1 0
article3 0 1 1
TF-IDF
Vectorization

Data Model 2: tabular
matrix: one row per
article, one column
per word in article
Data Model 3:
JSON documents
extract keywords and
topics and enrich
word1 word2 word3
article1 1 0 2
article2 0 1 0
article3 0 1 1
LDA Topic
extraction
{
"_id" : “0b917217ae”,
"title" : "Document Model Design Patterns",
“text”: blob,
"topics" : [ "Models", "MVC" ],
“top_words”: [“join”, “embed”, “one-to-many”]
“model”:
{
“location”:
“last_updated”: Timestamp(“05-29-19
00:00:00”)
“confidence”: Decimal128("0.9123")
...
}
...
}

Example: Text Classification & Graph Traversal
Data Model 3:
JSON documents
extract keywords and
topics and enrich
Data Model 4: graph
tree/hierarchy of
topics modeled as a
graph
Hierarchical
Clustering
{
"_id" : “0b917217ae”,
"title" : "Document Model Design Patterns",
“text”: blob,
“parent”: “Databases”,
"topics" : [ "Models", "MVC" ],
“top_words”: [“join”, “embed”, “one-to-many”]
“model”:
{
“location”:
“last_updated”: Timestamp(“05-29-19
00:00:00”)
“confidence”: Decimal128("0.9123")
...
}
...
db.topics.insert( { _id: "Models", parent: "Databases" } )
db.topics.insert( { _id: "Storage", parent: "Databases" } )
db.topic.insert( { _id: "MVCC", parent: "Databases" } )
db.topic.insert( { _id: "Databases", parent: "Programming" } )
db.topic.insert( { _id: "Languages", parent: "Programming" } )
db.topic.insert( { _id: "Programming", parent: null } )
Programming
Languages Databases
ModelsStorage
MVCC
$graphlookup

Indexing in MongoDB
• Primary Index
– Every Collection has a primary key index
• Compound Index
– Index against multiple keys in the document
• MultiKey Index
– Index into arrays
• Text Indexes
– Support for text searches
• GeoSpatial Indexes
– 2d & 2dSphere indexes for spatial geometries
• Hashed Indexes
– Hashed based values for sharding
Index Types
• TTL Indexes
– Single Field indexes, when expired delete the
document
• Unique Indexes
– Ensures value is not duplicated
• Partial Indexes
– Expression based indexes, allowing indexes on
subsets of data
• Case Insensitive Indexes
– Supports text search using case insensitive search
• Sparse Indexes
– Only index documents which have the given field
Index Features

Scalability & Distributed Processing
Process large volumes of data in parallel
queries and
aggregations
run in parallel
data is
returned
in parallel
• Automatically scale beyond
the constraints of a single
node
• Optimized for query patterns
and data locality
• Transparent to applications
and tools
≤ ∑
⟕ "
sharded cluster

Intelligent Data Distribution: Workload Isolation
Enable different workloads on the same data
ANALYTICAL
ML & AI
A single replica set
• Combine operational and
analytical workloads on a
single platform
• No data movement or
duplication
• Extract insights in real-time to
enrich applications
• MongoDB Atlas - Analytics
Nodes
TRANSACTIONAL
Operational Analytics
S
S
S
Application

MongoDB Text Search
db.restaurants.find( { $text: { $search: "java coffee shop" } } )
db.restaurants.find(
{ $text: { $search: "java coffee shop" } },
{ score: { $meta: "textScore" } }
).sort( { score: { $meta: "textScore" } } )
Match Text
Score and Sort
Results
Create Text Index1
2
3
db.restaurants.createIndex( { description: "text" } )

Index any field whose value is a string or an array of string elements
A collection can only have one text search index, but that index can cover
multiple fields
Optionally Specify Language:
Text Indexing

Text Matching
$text will tokenize the search string using whitespace and most punctuation as
delimiters, and perform a logical OR of all such tokens in the search string.
Search for a Single Word
Match Any of the Words
Search for a Phrase
Negations

Scoring and Sorting: Control Search Results with Weights
Weight is the significance of the field (default = 1)
For each indexed field, MongoDB multiplies the number of matches by the
weight and sums the results → score of the document
Use “textScore" metadata for projections, sorts, and conditions subsequent
the $match stage that includes the $text operation.

Spoke: Knowledge Base Search with ML
● User asks a question to Spoke and expects real time response
○ Search best knowledge answer from 1000s of answers
● We use a combination of ML algorithms in determining the right answer
○ Scoring each answer independently is not an option due to latency
● Candidate generation to rescue!

How Spoke uses Text Search
● Use MongoDB text search to select top k highest scoring articles
● Only run extensive ML-based search on k articles
○ Works as long as the right answer is in top k
○ Allows us to build latest ML algos without worrying too much about latency
● Tip: set your MongoDB text index weights carefully by fine tuning
{ title: 10,
body: 2,
keywords: 6...}

What Spoke is working on
● Understand user queries and take actions
○ “I need access to Salesforce” => “issue_license(user, software=salesforce)”
● Assign custom labels to user questions
○ Allow customers to add labels to their requests
■ {“hardware”, “software”, “licensing”, “urgent”},
■ {“benefits”, “payroll”, “immigration”}
○ Specific custom labels for each client stored in MongoDB
○ Automatically predict the right labels for requests

What MongoDB is working on: Full Text Search (Beta)
● Based on Apache Lucene 8
● Integrated into MongoDB Atlas
● Separate process co-located with mongod
● Shard-aware
● Indexing = collection scan -> steady state
How Do I use it?
Create a cluster on MongoDB Atlas using 4.2 RC (M30+)
Create an Full Text Index via the MongoDB Atlas UI or API
Query Index via $searchBeta operator using MongoDB Compass or shell,
add to your existing aggregation pipelines

What MongoDB is working on: Atlas Data Lake (beta)
● Serverless: no infrastructure to set up and manage
● Usage-based pricing: only pay for the queries your run
● On-demand: no need to load data; bring your own S3 bucket
● Auto-scalable: parallel execution delivers performance for large and
complex queries across multiple user sessions
● Multi-format: JSON, BSON, CSV, TSV, Avro, Parquet
● Integrated with Atlas: users are managed by Atlas, enabled via Atlas
console
● The best tools to work with your data: MongoDB Query language
enable flexible and efficient data access; integrates with Compass,
MongoDB Shell and MongoDB drivers

What MongoDB is working on: Atlas Data Lake (beta)
Operational Analytics
Aggregations
Machine
Learning and AIData Lake
in-app analytics
Transactional
in-app analytics
Primary Secondary Secondary AnalyticsAnalytics

MongoDB .local London 2019: Fast Machine Learning Development with MongoDB

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie MongoDB .local London 2019: Fast Machine Learning Development with MongoDB

Ähnlich wie MongoDB .local London 2019: Fast Machine Learning Development with MongoDB (20)

Mehr von Lisa Roth, PMP

Mehr von Lisa Roth, PMP (10)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MongoDB .local London 2019: Fast Machine Learning Development with MongoDB