Serverless machine learning architectures at Helixa

Serverless machine learning architectures at Helixa
Data Science Milan meetup
15th December, 2020
Gianmario Spacagna, Luc Mioulet
AI team at Helixa

About Us
Gianmario Spacagna
@gm_spacagna
MBA, MSc in Software Engineering of
Distributed Systems
Chief Scientist, Helixa
Luc Mioulet
@lmioulet
PhD Signal Processing
Machine Learning Engineer, Helixa

In the next hour you will learn
about
1. Overview of serverless services in AWS
2. The Helixa ML system powering a platform used
by thousands of marketers around the globe
3. Map/reduce serverless architectures

Cloud Providers Disclaimer
The following examples will focus on AWS stack but consider that other cloud providers offers
similar services.
It is not part of this talk to compare different cloud solutions.

The content of this talk were
updated a month ago, before the
recent changes introduced by
Amazon
AWS Disclaimer

Serverless:
How to build and run applications without
thinking about servers

Dynamic allocation of resources by the cloud
provider
Traditional Serverful Way:
Serverless Way:
Source: https://serverless-stack.com/chapters/what-is-serverless.html

Philosophy behind serverless
"If a tree falls in a forest and no one is around
to hear it, does it make a sound?"
“If a server runs in the cloud and no
one is around to use it, does it need to
incur any costs?”
WinterClouds

Major serverless services available in AWS
Docker container
execution.
Script execution in
response of events.
Full list available at https://aws.amazon.com/serverless/
Orchestration of
components and
microservices
Queuing +
publisher/subscriber
message services.
NoSQL Key-Value
database.
REST API
management
service.
Query service to
analyze data at scale
using standard SQL
(like PrestoDB).
ETL service to crawl and
process large datasets on a
fully managed Spark
environment.

Lambda function:
listing files in a specified S3 directory
Event object Result objectPython script
Lambda cost: $1.04 / million requests
S3 LIST request cost: $5 / million

Serverless.com application framework
Hybrid solution for:

Benefits of Serverless architectures
Secure Scalable Cheap
Always available Worry free Low maintenance

The Helixa Market Research Platform

About Helixa
Helixa is an audience
intelligence platform that
uses Machine Learning to
provide accurate, and
timely, consumers
insights for modern
market research

Audience
:
Size: 1.5M / 223M represented population
François CholletBen Hamner George Hotz
Top Influencers
201x 114x 106x
Cifar News
Top Media
The Hacker News AngelList
65x 31x 28x
Tensorflow
Top Products and Companies
Waymo Airbnb Engineering
107x 66x 55x
Demographics
18-40 years old
Male
U.S. and India

Platform Requirements
Multiple Datasets Accurate consumers insights Real-time analytics quickly
Always available Minimum infrastructure
maintenance
Cost effective

Helixa end-to-end pipeline
Insights Engine
Other Analytics
Tools
Audience
Projection
Real-time
analytics
applications
Common
Data Model
Data
Processing Data IntegrationsData
Contents
Embedding
Entity
Resolution
Taxonomy
Categorization
Users Digital
DNA
Traits
Classifiers
Latent Interests
Augmentation
Machine Learning
jobs

Helixa architecture
Data Ingestions
ML Cloud Services
Pre-trained models External APIs
ML LibrariesML pipelines
Model repository
Production
DB
Microservices
Data Lake
Batch Jobs
Analytics
applications

Batch inference
Model repository and evaluation metrics
Training and hyper-parameters tuning
Analysis and Research
ML libraries
Data Labeling
Feature Store
Feature Engineering
Data Lake
Tech stack and tools
In this talk we
will focus on

Native Cloud Object (Data) Storage
Benefits:
● Cheaper
● Elastic
● Highly available
● Performant
Hadoop HDFS

Artifacts are saved in S3 and crawled by Glue
Athena is used to build logical views on top of them such as:
▪ Retrieve the latest version of the artifact
▪ Aggregate multiple partitions of the same artifact
▪ Filter and merge with other tables
▪ Export snapshot of the views as versioned parquet datasets
Data Lake(house) using Glue and Athena

Feature Store Partitions (X)
S3 bucket
❏ users
❏ features
❏ feature_family=text_embedding
❏ timestamp=2020-10-14-12-58
❏ _metadata.json
❏ part000.parquet
❏ part001.parquet
❏ …
❏ timestamp=2020-09-18-18-35
❏ ...
❏ feature_family=picture_embedding
❏ ...
❏ feature_family=category_counts
❏ ...
❏ items
❏ other entities
Parquet data indexed by user_id
Metadata containing info on how
the features were created
Partition by set of features
generated by the same job
Creation time

Label Store Partitions (y)
S3 bucket
❏ users
❏ labels
❏ variable=gender
❏ source=first_name
❏ timestamp=2020-10-14-12-58
❏ _metadata.json
❏ part000.parquet
❏ part001.parquet
❏ …
❏ source=public_profile
❏ ...
❏ variable=age
❏ items
❏ other entities
Partition by the variable we are
trying to predict
Partition by the source of ground
truth
Label management for weak learning done via

Prediction Store Partitions (y_pred)
S3 bucket
❏ users
❏ predictions
❏ variable=gender
❏ model=xgbc
❏ timestamp=2020-11-05-17-22
❏ _metadata.json
❏ part000.parquet
❏ part001.parquet
❏ …
❏ model=cnn
❏ ...
❏ variable=age
❏ ...
❏ items
❏ other entities
Partition by the identifier of the
model used to predict

Platforms for managing the ML lifecycle
● Training
● Predictions
● Model serving
● Model repository
● Experiments
tracking
● Evaluation metrics
Production
● Dev data
versioning and
linkage
● Automated
evaluation reports
● Collaborative
experiments
● Deep Learning
computing
environment
R&D

R&D workflow
Pull
Notebooks and data stored
and shared in S3
Data cache
Dev unix
machine in
the cloud
Notebook name matching branch ID
Install the latest version of
the code
Develop code locally using
professional IDEs
Feature branches
matching Jira key
Gitflow
branching model
Commit and
push

EC2 memory-optimized machines (r4 or r5 family)
EBS volume of 250GB of storage
Alluxio and Jupyter services to start at boot time
200GB reserved for the Alluxio cache
S3 buckets mounted locally in --readonly mode using fuse API
Read parquet data in multi-processing using Dask directly from the local file system instead of
using the S3 boto API
cache configuration

Research & Development data: ~1TB
We only focus on 15% of data every month (~150GB)
Re-access of the data for every kernel restart (~5 times a day)
Data science team members (~5 people)
Datasets spread into files of ~120MB each
=> roughly 1.2k files and 500k read requests every month
We observed a speed-up between 3x to 5x using Alluxio
+ all of the benefits of accessing the S3 data from the POSIX API
benefits for the R&D

Processing large datasets with EMR
Picture source: https://dimensionless.in/different-ways-to-manage-apache-spark-applications-on-amazon-emr/
Ephemeral clusters on spot instances can dramatically reduce the cost of operations
+ SparkMagic
SUBMIT JOB

Automate code with a task-oriented
containerized jobs
Picture source: https://medium.com/@davidstevens_16424/make-my-day-ta-science-easier-e16bc50e719c
All of the analysis findings are moved into a production-quality
modules and entry points declared in makefiles for tasks such as:
● Data preparations
● Feature extractions
● Model selection / tuning
● Evaluations
● Model Inference
● Predictions post-processing

Automate tasks execution using Continuous
Integration (CI)
Picture source: https://deploybot.com/blog/the-expert-guide-to-continuous-integration
On commit
Code tests
Evaluation reportsBuilds & Deployment
On release

Automated code testing pyramid
Unit tests
● Single methods of data
processing utils and
major components
● Replace “assertEqual”
with uncertainty ranges
on predictions
70%
Integration tests
● Black-box testing single
jobs
● Subset of component
integrations (e.g.
transformers followed
by model predictions)
20%
End-to-end tests
● Static and small dataset
● Dry runs of the execution
plan
● Check APIs work
seamlessly through every
stage of the pipeline
10%

Blue/Green Deployment
GREEN
INFRASTRUCTURE
BLUE
INFRASTRUCTURE
PRODUCTION APIs
1. Deploy new
release
2. Update pointer
and reset state
No downtime, safe production rollback, and easy A/B testing

Different model prediction serving patterns,
different architectures
Offline training Online training
On-demand Microservices,
REST API
Real time
streaming analysis
Online Learning
Batch Batch AutomatedML
We will only cover serverless offline training patterns in this presentation.

Batch model serving: Embarrassingly parallel
data processing with AWS batch
Source: https://spotinst.com/blog/cost-efficient-batch-computing-on-spot-instances-aws-batch-integration/
JobsData batches
~ a few GBs
each
Output storage

Model serving via microservices
SERVERLESS CHOICE
Cheap and simple solution for
deploying containers without have
to care about the infrastructure
Limits as of today:
Max 4 vCPUs and 30GB of RAM
OR
SERVERFUL CHOICE
Advanced, customizable, powerful,
widespread solutions for containers
orchestration on pools of EC2
instances
Requires infrastructure management
AWS EC2

How do containers scale for real-time
varying requests load?
Number of requests per second
capacity
unexpected sudden burst
Over-provisioning cost

Training pipeline
Real-time serverless model serving
Lookup user
and model info
Get users
features
trigger
Update
metainfo
and configs
REST
request
Get model
Package requirements
EFS
read libraries
predictionsreturn
save model
Build and deploy

Comparison for real-time applications
Horizontal scaling Autoscaling rules based on predicted
load and capacity
Elastic, based on real-time demand
Provisioning time Minutes Immediately or seconds if cold start
Burst concurrency Depends on available resources 3000 + additional 500 every minute
Cost efficiency Pay for the over-provisioning Only pay for what you use (10x
cheaper in our use cases)
Vertical scaling Limited by instance types Limited to 3GB and 2 CPUs
Execution timeout Unlimited 15 minutes

Orchestrating functions and microservices
with Step Functions
Workflows defined as a finite states
machine and plug-and-play integration
with most of the AWS services:
AWS Batch ECS
Sagemaker

Data sanity checks
What to check:
- Value ranges
- Null values
- Anomalies
- Data distribution
Tools:
- Apache Griffin
- Amazon deequ
- Great expectations

Centralized logging with the ELK stack
Generate Logs Aggregation &
Transformation
Storage & Indexing Visualization & Analysis

Infrastructure Monitoring and Alerting
Basic Monitoring
AWS resources and custom
metrics generated by your
applications and services
General Infra Monitoring
Cloud-scale monitoring of
logs, metrics and traces from
distributed, dynamic and
hybrid infrastructure.
Serverless Monitoring
All-in-one performance
management tool down to
the single lines of code
specifically designed for
serverless applications.

KPIs and Metrics Dashboard
KPIs over time such as:
● Distribution shifts
● Model drift
● Utilization
● Coverage
Analytics dashboard on top of
athena SQL queries
Custom programmatic dashboards
with interactive charts

Design patterns for Map/Reduce in
serverless fashion

MapReduce with PyWren futures
PyWren will serialize and ship local Python code to be executed in lambda functions in the cloud
and return the list of deserialized results back to the driver

MapReduce with SFN parallel sync tasks
* A single Lambda function only supports up to 10 concurrent executions when invoked synchronously
As many parallel mappers we want
but
Maximum 10 concurrent synchronous
lambda invocations!

MapReduce with SFN queue polling
...
Mapper2
Mapper1
Mapper n
SQS queue
Poll the queue
Driver
* StepFunctions has a limit of 1000 transitions/second and a max execution history size of 25k events.
No limitations on async
lambda invocations
but
max 1000 transitions/second
max 25k events in the
execution history

MapReduce with SFN activity callbacks
Source: https://semantive.com/part-2-asynchronous-actions-within-aws-step-functions-without-servers/
...
...
mapper1 mapper n
Get activity token and wait for
mapper activity to complete
Start mapper activity asynchronously
with the corresponding token
Send activity task success
s
driver
unlimited parallel executions
without limits

MapReduce with S3 events
No limitations on async
lambda invocations
but
Relies on IO side effects

MapReduce with DynamoDB events
job_id task_id task_status task_type task_depe
ndencies
function_name payload
mr_example init lambda [] lambda_init {}
mr_example map_1 lambda [init] lambda_map {input_path: chunk1,
output_path: dir}
mr_example map_2 lambda [init] lambda_map {input_path: chunk2,
output_path: dir}
mr_example reduce lambda [map_1,
map_2]
lambda_reduce {dir_path: dir}
New or update events will trigger
Coordinator
InitUpdate
Map_1
Update
Map_2
Update
Reduce
Update
submitted
submitted
submitted
submitted
completed
completed
completed
completed
No limitations on
async invocations
but
Dynamodb
read/write
throttled
Job metadata
limited to 400kb
Job DynamoDB table
Fill with job meta-information and
dependencies
Run job entry point
(external service)
Job completed callback

Source: https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems
Only a small fraction of real-world ML systems is composed of the ML Code.
The required surrounding infrastructure is vast and complex.

Facebook new motto in 2014Facebook original motto

Embrace the serverless paradigm

Download the Non-Technical Guide
Topics covered:
✅ Getting started with understanding the technology
✅ Designing the right ML product
✅ Planning under uncertainty
✅ Building a balanced ML team
www.helixa.ai/machine-learning-guide-2020

Software and Machine Learning Engineer
We are looking for engineers:
● Interested in deploying ML to production
● Willing to learn about cloud, code optimization and serverless
technologies
Requisites:
● Bachelor’s degree or above in computer science or
software/computer/IT engineering fields.
● Knowledge of Pydata stack
● Knowledge of SQL
● Understanding of ML concepts
Contact: lmioulet@helixa.ai

Steps to Managing the ML Product Lifecycle
1. Familiarize with the whole lifecycle and most popular tools and libraries.
2. Adopt a platform such as MLflow to track and version models and experiments.
3. Notebooks are good for explorations but the implementation should be in a codebase.
4. Make analysis, code and infrastructure, reproducible and avoid manual operations.
5. Communicate analysis results effectively summarizing only what is relevant.
6. Invest on automated tests at different integration levels.
7. Exploit Continuous Integration (CI) for automating builds and releases.
8. Deliver models and components inside Docker containers, when possible.
9. Centralize the logs collection for debugging and troubleshooting.
10.Monitor the infrastructure health using specific tools.
11.Consider a strategy for implementing Governance and Auditability.

Steps to migrate to Serverless architectures
1. Reverse Conway’s law: “Organizations produce software that resemble their organizational
communication structures”.
2. Divide your architecture in separate and simple services.
3. Adopt the serverless.com framework to make easier to develop lambda functions.
4. Pick the most suitable serverless MapReduce architecture for your needs.
5. Enjoy your team having fun with simplified and scalable deployments.
6. Make a report to your boss showing the consistent amount of saved costs.

Serverless machine learning architectures at Helixa

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Serverless machine learning architectures at Helixa

Ähnlich wie Serverless machine learning architectures at Helixa (20)

Mehr von Data Science Milan

Mehr von Data Science Milan (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Serverless machine learning architectures at Helixa

Hinweis der Redaktion