Helixa uses serverless machine learning architectures to power an audience intelligence platform. It ingests large datasets and uses machine learning models to provide insights. Helixa's machine learning system is built on AWS serverless services like Lambda, Glue, Athena and S3. It features a data lake for storage, a feature store for preprocessed data, and uses techniques like map-reduce to parallelize tasks. Helixa aims to build scalable and cost-effective machine learning pipelines without having to manage servers.
How to Troubleshoot Apps for the Modern Connected Worker
Â
Serverless machine learning architectures at Helixa
1. Serverless machine learning architectures at Helixa
Data Science Milan meetup
15th December, 2020
Gianmario Spacagna, Luc Mioulet
AI team at Helixa
2. About Us
Gianmario Spacagna
@gm_spacagna
MBA, MSc in Software Engineering of
Distributed Systems
Chief Scientist, Helixa
Luc Mioulet
@lmioulet
PhD Signal Processing
Machine Learning Engineer, Helixa
3. In the next hour you will learn
about
1. Overview of serverless services in AWS
2. The Helixa ML system powering a platform used
by thousands of marketers around the globe
3. Map/reduce serverless architectures
4. Cloud Providers Disclaimer
The following examples will focus on AWS stack but consider that other cloud providers offers
similar services.
It is not part of this talk to compare different cloud solutions.
5. The content of this talk were
updated a month ago, before the
recent changes introduced by
Amazon
AWS Disclaimer
7. Dynamic allocation of resources by the cloud
provider
Traditional Serverful Way:
Serverless Way:
Source: https://serverless-stack.com/chapters/what-is-serverless.html
8. Philosophy behind serverless
"If a tree falls in a forest and no one is around
to hear it, does it make a sound?"
âIf a server runs in the cloud and no
one is around to use it, does it need to
incur any costs?â
WinterClouds
9. Major serverless services available in AWS
Docker container
execution.
Script execution in
response of events.
Full list available at https://aws.amazon.com/serverless/
Orchestration of
components and
microservices
Queuing +
publisher/subscriber
message services.
NoSQL Key-Value
database.
REST API
management
service.
Query service to
analyze data at scale
using standard SQL
(like PrestoDB).
ETL service to crawl and
process large datasets on a
fully managed Spark
environment.
10. Lambda function:
listing files in a specified S3 directory
Event object Result objectPython script
Lambda cost: $1.04 / million requests
S3 LIST request cost: $5 / million
14. About Helixa
Helixa is an audience
intelligence platform that
uses Machine Learning to
provide accurate, and
timely, consumers
insights for modern
market research
15. Audience
:
Size: 1.5M / 223M represented population
François CholletBen Hamner George Hotz
Top Influencers
201x 114x 106x
Cifar News
Top Media
The Hacker News AngelList
65x 31x 28x
Tensorflow
Top Products and Companies
Waymo Airbnb Engineering
107x 66x 55x
Demographics
18-40 years old
Male
U.S. and India
16. Platform Requirements
Multiple Datasets Accurate consumers insights Real-time analytics quickly
Always available Minimum infrastructure
maintenance
Cost effective
18. Helixa end-to-end pipeline
Insights Engine
Other Analytics
Tools
Audience
Projection
Real-time
analytics
applications
Common
Data Model
Data
Processing Data IntegrationsData
Contents
Embedding
Entity
Resolution
Taxonomy
Categorization
Users Digital
DNA
Traits
Classifiers
Latent Interests
Augmentation
Machine Learning
jobs
19. Helixa architecture
Data Ingestions
ML Cloud Services
Pre-trained models External APIs
ML LibrariesML pipelines
Model repository
Production
DB
Microservices
Data Lake
Batch Jobs
Analytics
applications
20. Batch inference
Model repository and evaluation metrics
Training and hyper-parameters tuning
Analysis and Research
ML libraries
Data Labeling
Feature Store
Feature Engineering
Data Lake
Tech stack and tools
In this talk we
will focus on
23. Artifacts are saved in S3 and crawled by Glue
Athena is used to build logical views on top of them such as:
âȘ Retrieve the latest version of the artifact
âȘ Aggregate multiple partitions of the same artifact
âȘ Filter and merge with other tables
âȘ Export snapshot of the views as versioned parquet datasets
Data Lake(house) using Glue and Athena
24. Feature Store Partitions (X)
S3 bucket
â users
â features
â feature_family=text_embedding
â timestamp=2020-10-14-12-58
â _metadata.json
â part000.parquet
â part001.parquet
â âŠ
â timestamp=2020-09-18-18-35
â ...
â feature_family=picture_embedding
â ...
â feature_family=category_counts
â ...
â items
â other entities
Parquet data indexed by user_id
Metadata containing info on how
the features were created
Partition by set of features
generated by the same job
Creation time
25. Label Store Partitions (y)
S3 bucket
â users
â labels
â variable=gender
â source=first_name
â timestamp=2020-10-14-12-58
â _metadata.json
â part000.parquet
â part001.parquet
â âŠ
â source=public_profile
â ...
â variable=age
â items
â other entities
Partition by the variable we are
trying to predict
Partition by the source of ground
truth
Label management for weak learning done via
26. Prediction Store Partitions (y_pred)
S3 bucket
â users
â predictions
â variable=gender
â model=xgbc
â timestamp=2020-11-05-17-22
â _metadata.json
â part000.parquet
â part001.parquet
â âŠ
â model=cnn
â ...
â variable=age
â ...
â items
â other entities
Partition by the identifier of the
model used to predict
28. Platforms for managing the ML lifecycle
â Training
â Predictions
â Model serving
â Model repository
â Experiments
tracking
â Evaluation metrics
Production
â Dev data
versioning and
linkage
â Automated
evaluation reports
â Collaborative
experiments
â Deep Learning
computing
environment
R&D
29. R&D workflow
Pull
Notebooks and data stored
and shared in S3
Data cache
Dev unix
machine in
the cloud
Notebook name matching branch ID
Install the latest version of
the code
Develop code locally using
professional IDEs
Feature branches
matching Jira key
Gitflow
branching model
Commit and
push
30. EC2 memory-optimized machines (r4 or r5 family)
EBS volume of 250GB of storage
Alluxio and Jupyter services to start at boot time
200GB reserved for the Alluxio cache
S3 buckets mounted locally in --readonly mode using fuse API
Read parquet data in multi-processing using Dask directly from the local file system instead of
using the S3 boto API
cache configuration
31. Research & Development data: ~1TB
We only focus on 15% of data every month (~150GB)
Re-access of the data for every kernel restart (~5 times a day)
Data science team members (~5 people)
Datasets spread into files of ~120MB each
=> roughly 1.2k files and 500k read requests every month
We observed a speed-up between 3x to 5x using Alluxio
+ all of the benefits of accessing the S3 data from the POSIX API
benefits for the R&D
32. Processing large datasets with EMR
Picture source: https://dimensionless.in/different-ways-to-manage-apache-spark-applications-on-amazon-emr/
Ephemeral clusters on spot instances can dramatically reduce the cost of operations
+ SparkMagic
SUBMIT JOB
34. Automate code with a task-oriented
containerized jobs
Picture source: https://medium.com/@davidstevens_16424/make-my-day-ta-science-easier-e16bc50e719c
All of the analysis findings are moved into a production-quality
modules and entry points declared in makefiles for tasks such as:
â Data preparations
â Feature extractions
â Model selection / tuning
â Evaluations
â Model Inference
â Predictions post-processing
35. Automate tasks execution using Continuous
Integration (CI)
Picture source: https://deploybot.com/blog/the-expert-guide-to-continuous-integration
On commit
Code tests
Evaluation reportsBuilds & Deployment
On release
36. Automated code testing pyramid
Unit tests
â Single methods of data
processing utils and
major components
â Replace âassertEqualâ
with uncertainty ranges
on predictions
70%
Integration tests
â Black-box testing single
jobs
â Subset of component
integrations (e.g.
transformers followed
by model predictions)
20%
End-to-end tests
â Static and small dataset
â Dry runs of the execution
plan
â Check APIs work
seamlessly through every
stage of the pipeline
10%
38. Different model prediction serving patterns,
different architectures
Offline training Online training
On-demand Microservices,
REST API
Real time
streaming analysis
Online Learning
Batch Batch AutomatedML
We will only cover serverless offline training patterns in this presentation.
39. Batch model serving: Embarrassingly parallel
data processing with AWS batch
Source: https://spotinst.com/blog/cost-efficient-batch-computing-on-spot-instances-aws-batch-integration/
JobsData batches
~ a few GBs
each
Output storage
40. Model serving via microservices
SERVERLESS CHOICE
Cheap and simple solution for
deploying containers without have
to care about the infrastructure
Limits as of today:
Max 4 vCPUs and 30GB of RAM
OR
SERVERFUL CHOICE
Advanced, customizable, powerful,
widespread solutions for containers
orchestration on pools of EC2
instances
Requires infrastructure management
AWS EC2
41. How do containers scale for real-time
varying requests load?
Number of requests per second
capacity
unexpected sudden burst
Over-provisioning cost
42. Training pipeline
Real-time serverless model serving
Lookup user
and model info
Get users
features
trigger
Update
metainfo
and configs
REST
request
Get model
Package requirements
EFS
read libraries
predictionsreturn
save model
Build and deploy
43. Comparison for real-time applications
Horizontal scaling Autoscaling rules based on predicted
load and capacity
Elastic, based on real-time demand
Provisioning time Minutes Immediately or seconds if cold start
Burst concurrency Depends on available resources 3000 + additional 500 every minute
Cost efficiency Pay for the over-provisioning Only pay for what you use (10x
cheaper in our use cases)
Vertical scaling Limited by instance types Limited to 3GB and 2 CPUs
Execution timeout Unlimited 15 minutes
45. Orchestrating functions and microservices
with Step Functions
Workflows defined as a finite states
machine and plug-and-play integration
with most of the AWS services:
AWS Batch ECS
Sagemaker
47. Data sanity checks
What to check:
- Value ranges
- Null values
- Anomalies
- Data distribution
Tools:
- Apache Griffin
- Amazon deequ
- Great expectations
48. Centralized logging with the ELK stack
Generate Logs Aggregation &
Transformation
Storage & Indexing Visualization & Analysis
49. Infrastructure Monitoring and Alerting
Basic Monitoring
AWS resources and custom
metrics generated by your
applications and services
General Infra Monitoring
Cloud-scale monitoring of
logs, metrics and traces from
distributed, dynamic and
hybrid infrastructure.
Serverless Monitoring
All-in-one performance
management tool down to
the single lines of code
specifically designed for
serverless applications.
50. KPIs and Metrics Dashboard
KPIs over time such as:
â Distribution shifts
â Model drift
â Utilization
â Coverage
Analytics dashboard on top of
athena SQL queries
Custom programmatic dashboards
with interactive charts
52. MapReduce with PyWren futures
PyWren will serialize and ship local Python code to be executed in lambda functions in the cloud
and return the list of deserialized results back to the driver
53. MapReduce with SFN parallel sync tasks
* A single Lambda function only supports up to 10 concurrent executions when invoked synchronously
As many parallel mappers we want
but
Maximum 10 concurrent synchronous
lambda invocations!
54. MapReduce with SFN queue polling
...
Mapper2
Mapper1
Mapper n
SQS queue
Poll the queue
Driver
* StepFunctions has a limit of 1000 transitions/second and a max execution history size of 25k events.
No limitations on async
lambda invocations
but
max 1000 transitions/second
max 25k events in the
execution history
55. MapReduce with SFN activity callbacks
Source: https://semantive.com/part-2-asynchronous-actions-within-aws-step-functions-without-servers/
...
...
mapper1 mapper n
Get activity token and wait for
mapper activity to complete
Start mapper activity asynchronously
with the corresponding token
Send activity task success
s
driver
unlimited parallel executions
without limits
56. MapReduce with S3 events
No limitations on async
lambda invocations
but
Relies on IO side effects
57. MapReduce with DynamoDB events
job_id task_id task_status task_type task_depe
ndencies
function_name payload
mr_example init lambda [] lambda_init {}
mr_example map_1 lambda [init] lambda_map {input_path: chunk1,
output_path: dir}
mr_example map_2 lambda [init] lambda_map {input_path: chunk2,
output_path: dir}
mr_example reduce lambda [map_1,
map_2]
lambda_reduce {dir_path: dir}
New or update events will trigger
Coordinator
InitUpdate
Map_1
Update
Map_2
Update
Reduce
Update
submitted
submitted
submitted
submitted
completed
completed
completed
completed
No limitations on
async invocations
but
Dynamodb
read/write
throttled
Job metadata
limited to 400kb
Job DynamoDB table
Fill with job meta-information and
dependencies
Run job entry point
(external service)
Job completed callback
63. Download the Non-Technical Guide
Topics covered:
â Getting started with understanding the technology
â Designing the right ML product
â Planning under uncertainty
â Building a balanced ML team
www.helixa.ai/machine-learning-guide-2020
64. Software and Machine Learning Engineer
We are looking for engineers:
â Interested in deploying ML to production
â Willing to learn about cloud, code optimization and serverless
technologies
Requisites:
â Bachelorâs degree or above in computer science or
software/computer/IT engineering fields.
â Knowledge of Pydata stack
â Knowledge of SQL
â Understanding of ML concepts
Contact: lmioulet@helixa.ai
67. Steps to Managing the ML Product Lifecycle
1. Familiarize with the whole lifecycle and most popular tools and libraries.
2. Adopt a platform such as MLflow to track and version models and experiments.
3. Notebooks are good for explorations but the implementation should be in a codebase.
4. Make analysis, code and infrastructure, reproducible and avoid manual operations.
5. Communicate analysis results effectively summarizing only what is relevant.
6. Invest on automated tests at different integration levels.
7. Exploit Continuous Integration (CI) for automating builds and releases.
8. Deliver models and components inside Docker containers, when possible.
9. Centralize the logs collection for debugging and troubleshooting.
10.Monitor the infrastructure health using specific tools.
11.Consider a strategy for implementing Governance and Auditability.
68. Steps to migrate to Serverless architectures
1. Reverse Conwayâs law: âOrganizations produce software that resemble their organizational
communication structuresâ.
2. Divide your architecture in separate and simple services.
3. Adopt the serverless.com framework to make easier to develop lambda functions.
4. Pick the most suitable serverless MapReduce architecture for your needs.
5. Enjoy your team having fun with simplified and scalable deployments.
6. Make a report to your boss showing the consistent amount of saved costs.
Hinweis der Redaktion
Secure
Scalable
Cheaper
Always available
Worry Free
No maintenance needed
Helixa is a platform using machine learning to provide audience insights for modern market
Helixa is a forward-thinking audience intelligence platform that uses ethical AI and Machine Learning technology to connect data sources.
We build research platforms for the 21st century with more depth and detail than you would ever get from
a single source research platform.
The results are incredibly nuanced, timely and meaningful insights about the audiences that matter to your business.
Helixa is a forward-thinking audience intelligence platform that uses ethical AI and Machine Learning technology to connect data sources.
We build research platforms for the 21st century with more depth and detail than you would ever get from
a single source research platform.
The results are incredibly nuanced, timely and meaningful insights about the audiences that matter to your business.
Ben hamner - co founder and cto of kaggle
George Hotz - american security hacker (ios and playstation jailbreaks)
Francois chollet - deep learning expert and creator of keras
Cifar - canadian research organization tackling science and humanity problems including with the use of AI
The hacker news: cybersecurity news
Waymo: self-driving transportation startup
Multiple datasets (source agnostic)
Anonymous privacy-preserving (no personal identifiers)
Ethical and responsible design (mitigate biases and fairness)
Accurate estimations of consumers behaviour (statistical significance)
Fast analytics calculation and aggregation of insights (a few seconds)
Always available, no downtime
Common Data Model
Twitter Data Processing
ML System Overview
Entity Resolution
Content Embeddings
Automated Categorization
Usersâ Digital DNA
Universal Trait Classifiers
Latent Interests Augmentation
Audience projection model
Insights Engine
Data lake: unstructured data, raw data, DS
Data warehouse: highly structured, BI
Data lakehouse: ACID, BI support, easily accessible
HDFS: apache data storage system,name node manages nodes that run data nodes,. The name node keeps track of the data and the data node manages loading and storing. Requires managing cluster and costly EBS volumes
S3: AWS object storage. Managed scaling.
Only drawback of s3 is on some limitations: file size (5TB) and
Ingestion, crawl to discover partitions, update of catalog, Partitions are exactly like Hive
In helixa we care about various objects, such as users, items, entities⊠users have multiple features describing. Every time the feature extraction process is ran, it generates a new timestamp. Each time stamp contains...
Next iteration of snorkel is snorkel flow: E2E ML platform integrating the concepts of snorkel.
Luc
Not a serverless part, but reduces costs compared to using sagemaker. Also reduces boundaries from notebook to code, because we use code from a repo within the notebook
EMR recommended configurations:
Favor r-family instance types
Use a dedicated instance for the driver and spot instances for workers
Set the maximizeResourceAllocation":"trueâ property (calculates the maximum compute and memory resources available for an executor on an instance in the core instance group. It then sets the corresponding spark-defaults settings based on this information)
Avoid dynamic allocation, one job per time
Luc
Makefiles makes your code task-oriented, explicitly stating what the steps are necessary to perform a given task and making easy for any user to run those tasks without have to copy and paste verbose shell commands.
Make commands should have as few arguments as possible, if not none
Alternative to this is to use DVC pipelines.
Leveraging commands in the Makefile, the Continuous Integration (CI) system can automate the build and test at every commit and the release and deployment at every pull request.
What we mean by code tests areâŠ.
Online training is not part of the talk.
In the case of Offline batch forecasting a good deployment model is the use of aws batch
In the case of offline in demand prediction, using microservices
Docker runtime support will be dropped by kubernetes in favor of containers that use the container runtime interface
Docker can now replace the EFS volume
This architecture requires smart orchestration and development of map-reduce architecture specific to a task
Docker jobs will be for long running âmemoryâ intensive jobs.
AWS lambdas can do equivalent job if well written.
Luc
Split in half drifts and shifts, utilization
serializes and run local Python code and return results back to the driver
Unfortunately a single Lambda function only supports up to 10 concurrent executions when invoked synchronously.
* StepFunctions has a limit of 1000 transitions/second and a max execution history size of 25k events.