SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Julien Simon
Principal Technical Evangelist, AI and Machine Learning
@julsimon
Using Apache Spark
with Amazon SageMaker
October 2018
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Apache Spark on AWS
• Amazon SageMaker
• Combining Spark and SageMaker
• Demos with the SageMaker SDK for Spark
• Getting started
Services covered: Amazon EMR, Amazon SageMaker
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark on AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark
• Open-source, distributed processing system.
• In-memory caching and optimized execution for fast performance
(typically 100x faster than Hadoop).
• Batch processing, streaming analytics, machine learning, graph
databases and ad hoc queries.
• API for Java, Scala, Python, R, and SQL.
https://spark.apache.org/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark – DataFrame
• Distributed collection of data organized into named columns.
• Conceptually equivalent to a table in a relational database.
• Wide array of sources: structured files, databases.
• Wide array of formats: text, CSV, JSON, Avro, ORC, Parquet.
{"name": "Jeff"}
{"name": "Boaz", "age":72}
{"name": "Julien", "age":12}
df = spark.read.json("people.json")
df.show()
+----+-------+
| age| name |
+----+-------+
|null| Jeff |
| 72 | Boaz |
| 12 | Julien|
+----+-------+
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
MLlib – Machine learning library
• ML algorithms: classification, regression, clustering, collaborative filtering.
• Featurization: feature extraction, transformation, dimensionality reduction.
• Tools for constructing, evaluating and tuning ML pipelines
• Transformer – a transform function that maps a DataFrame into a new one
• Adding a column, changing the rows of a specific column, etc.
• Predicting the label based on the feature vector.
• Estimator – an algorithm that trains on data
• Consists of a fit() function that maps a DataFrame into a Model.
https://spark.apache.org/docs/latest/ml-guide.html
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Spark ML on Amazon EMR: spam detector
Adapted from https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala
ETL
Train
Predict
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark on Amazon EMR
• Spark is natively supported in Amazon EMR.
• Amazon S3 connectivity using the EMR File System (EMRFS).
• Amazon Kinesis, Redshift and DynamoDB as data sources.
• Integration with the AWS Glue Data Catalog.
• Auto Scaling to add or remove instances from your cluster.
• Integration with the Amazon EC2 Spot market.
https://aws.amazon.com/emr/
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
Collect and prepare
training data
Choose and optimize
your ML algorithm
Set up and manage
environments for
training
Train and tune model
(trial and error)
Deploy model
in production
Scale and manage
the production
environment
Easily build, train, and deploy Machine Learning models
FREE
TIER
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
Pre-built
notebooks for
common
problems
K-Means Clustering
Principal Component Analysis
Neural Topic Modelling
Factorization Machines
Linear Learner
XGBoost
Latent Dirichlet Allocation
Image Classification
Seq2Seq,
And more!
ALGORITHMS
Apache MXNet, Chainer
TensorFlow, PyTorch
Caffe2, CNTK,
Torch
FRAMEWORKS
S e t u p a n d m a n a g e
e n v i r o n m e n t s f o r
t r a i n i n g
T r a i n a n d t u n e
m o d e l ( t r i a l a n d
e r r o r )
D e p l o y m o d e l
i n p r o d u c t i o n
S c a l e a n d m a n a g e t h e
p r o d u c t i o n e n v i r o n m e n t
Built-in, high-
performance
algorithms
Build
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
Pre-built
notebooks for
common
problems
Built-in, high-
performance
algorithms
One-click
training
Hyperparameter
optimization
Build Train
Deploy model
in production
Scale and manage
the production
environment
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker
Fully managed
hosting with auto-
scaling
One-click
deployment
Pre-built
notebooks for
common
problems
Built-in, high-
performance
algorithms
One-click
training
Hyperparameter
optimization
Build Train Deploy
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon ECR
Model Training (on EC2)
Model Hosting (on EC2)
Trainingdata
Modelartifacts
Training code Helper code
Helper codeInference code
GroundTruth
Client application
Inference code
Training code
Inference requestInference response
Inference Endpoint
Amazon SageMaker
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Combining Spark and SageMaker
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Decouple ETL and Machine Learning
• Different workloads require different instance types.
• Say, M4 for ETL, P3 for training and C5 for prediction?
• If you need GPUs for training, running your EMR cluster on GPU
instances wouldn’t be cost-efficient.
• Scale them independently.
• Avoid oversizing your Spark cluster.
• Avoid time-consuming resizing operations on EMR.
• Run ETL once, train many models in parallel.
• SageMaker terminates training instances automatically.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Run any ML algorithm in any language
• Spark MLlib is great, but you may need something else.
• SageMaker built-in algorithms (ML, DL, NLP).
• Deep Learning libraries, like TensorFlow or Apache MXNet.
• Your own custom code in any language.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Deploy ML models in production
• Perform ML predictions without using Spark.
• Save the overhead of the Spark framework.
• Save loading your data in a DataFrame.
• Improve latency for small-batch predictions.
• It can be difficult to achieve low-latency predictions with Spark ML models.
• You can get real-time predictions with models hosted in SageMaker.
• You can use very powerful instances for prediction endpoints.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sample use cases for Spark+SageMaker
• Data preparation and feature engineering before training.
• Data transformation before batch prediction (model reuse).
• Data enrichment with predictions.
• Predict missing values instead of using median.
• Add new predicted features.
• Train on extremely large datasets with built-in algos.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SageMaker SDK for Spark
• Python and Scala SDK, for Apache Spark 2.1.1 and 2.2.
• Pre-installed on EMR 5.11 and later.
• Train, import, deploy and predict with SageMaker models directly from
your Spark application.
• Standalone,
• Integration in Spark MLlib pipelines.
• DataFrames in, DataFrames out:
automatic conversion to and from protobuf (crowd goes wild!)
https://github.com/aws/sagemaker-spark
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SageMaker SDK for Spark – built-in algorithms
• High-level API for:
• Linear Learner
• Factorization Machines
• K-Means
• PCA
• LDA
• XGBoost
• The SageMakerEstimator object lets you use other built-in containers
as well any containerized code stored in Amazon ECR (just like the
regular SageMaker SDK).
https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
Infinitely scalable algorithms:
no limit to the amount of data that they can process
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demos
https://gitlab.com/juliensimon/dlnotebooks
1 – Classifying MNIST in Python with XGBoost (SageMaker)
2 – Clustering MNIST in Scala with K-Means (SageMaker)
3 – Clustering MNIST in Scala with a Pipeline: PCA (MLlib) + K-Means (SageMaker)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting started
https://ml.aws
https://aws.amazon.com/sagemaker
https://aws.amazon.com/emr
https://github.com/aws/sagemaker-python-sdk
https://github.com/aws/sagemaker-spark
https://medium.com/@julsimon
https://gitlab.com/juliensimon/dlnotebooks
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!
Julien Simon
Principal Technical Evangelist, AI and Machine Learning
@julsimon

Weitere ähnliche Inhalte

Was ist angesagt?

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Introducing AWS AppSync: serverless data driven apps with real-time and offli...
Introducing AWS AppSync: serverless data driven apps with real-time and offli...Introducing AWS AppSync: serverless data driven apps with real-time and offli...
Introducing AWS AppSync: serverless data driven apps with real-time and offli...Amazon Web Services
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLSeveralnines
 
How to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDPHow to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDPTrieu Nguyen
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDatabricks
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connectconfluent
 
Big Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media AnalyticsBig Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media Analyticshafeeznazri
 
Dynatrace Cloud-Native Workshop Slides
Dynatrace Cloud-Native Workshop SlidesDynatrace Cloud-Native Workshop Slides
Dynatrace Cloud-Native Workshop SlidesVMware Tanzu
 
Modern real-time streaming architectures
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architecturesArun Kejariwal
 
Supercharging Applications with GraphQL and AWS AppSync
Supercharging Applications with GraphQL and AWS AppSyncSupercharging Applications with GraphQL and AWS AppSync
Supercharging Applications with GraphQL and AWS AppSyncAmazon Web Services
 
Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
 
Introduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOSIntroduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOSAmazon Web Services
 
Build a Serverless Application using GraphQL & AWS AppSync
Build a Serverless Application using GraphQL & AWS AppSyncBuild a Serverless Application using GraphQL & AWS AppSync
Build a Serverless Application using GraphQL & AWS AppSyncAmazon Web Services
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...Amazon Web Services
 
GCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and ProcessingGCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and Processingconfluent
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 

Was ist angesagt? (20)

Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Introducing AWS AppSync: serverless data driven apps with real-time and offli...
Introducing AWS AppSync: serverless data driven apps with real-time and offli...Introducing AWS AppSync: serverless data driven apps with real-time and offli...
Introducing AWS AppSync: serverless data driven apps with real-time and offli...
 
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQLWebinar slides: An Introduction to Performance Monitoring for PostgreSQL
Webinar slides: An Introduction to Performance Monitoring for PostgreSQL
 
How to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDPHow to track and improve Customer Experience with LEO CDP
How to track and improve Customer Experience with LEO CDP
 
Apache Kafka at LinkedIn
Apache Kafka at LinkedInApache Kafka at LinkedIn
Apache Kafka at LinkedIn
 
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark ClustersDownscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connect
 
Big Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media AnalyticsBig Query - Utilizing Google Data Warehouse for Media Analytics
Big Query - Utilizing Google Data Warehouse for Media Analytics
 
Dynatrace Cloud-Native Workshop Slides
Dynatrace Cloud-Native Workshop SlidesDynatrace Cloud-Native Workshop Slides
Dynatrace Cloud-Native Workshop Slides
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
Modern real-time streaming architectures
Modern real-time streaming architecturesModern real-time streaming architectures
Modern real-time streaming architectures
 
Supercharging Applications with GraphQL and AWS AppSync
Supercharging Applications with GraphQL and AWS AppSyncSupercharging Applications with GraphQL and AWS AppSync
Supercharging Applications with GraphQL and AWS AppSync
 
Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
Introduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOSIntroduction to GraphQL and AWS Appsync on AWS - iOS
Introduction to GraphQL and AWS Appsync on AWS - iOS
 
Build a Serverless Application using GraphQL & AWS AppSync
Build a Serverless Application using GraphQL & AWS AppSyncBuild a Serverless Application using GraphQL & AWS AppSync
Build a Serverless Application using GraphQL & AWS AppSync
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
 
AWS API Gateway
AWS API GatewayAWS API Gateway
AWS API Gateway
 
GCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and ProcessingGCP for Apache Kafka® Users: Stream Ingestion and Processing
GCP for Apache Kafka® Users: Stream Ingestion and Processing
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 

Ähnlich wie Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS Floor28

Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...AWS Summits
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Amazon Web Services
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Julien SIMON
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Julien SIMON
 
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...Amazon Web Services
 
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...Amazon Web Services
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopJulien SIMON
 
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...Amazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...Amazon Web Services
 
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018Amazon Web Services
 
End to End Model Development to Deployment using SageMaker
End to End Model Development to Deployment using SageMakerEnd to End Model Development to Deployment using SageMaker
End to End Model Development to Deployment using SageMakerAmazon Web Services
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Amazon Web Services
 
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...Amazon Web Services
 
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...Amazon Web Services
 
Automate all your EMR related activities
Automate all your EMR related activitiesAutomate all your EMR related activities
Automate all your EMR related activitiesEitan Sela
 
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMakerAWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMakerAmazon Web Services
 
Enabling Deep Learning in IoT Applications with Apache MXNet
Enabling Deep Learning in IoT Applications with Apache MXNetEnabling Deep Learning in IoT Applications with Apache MXNet
Enabling Deep Learning in IoT Applications with Apache MXNetAmazon Web Services
 
Apache MXNet and Gluon
Apache MXNet and GluonApache MXNet and Gluon
Apache MXNet and GluonSoji Adeshina
 
Build, Train, & Deploy ML Models Using SageMaker
Build, Train, & Deploy ML Models Using SageMakerBuild, Train, & Deploy ML Models Using SageMaker
Build, Train, & Deploy ML Models Using SageMakerAmazon Web Services
 

Ähnlich wie Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS Floor28 (20)

Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...
 
Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)Building machine learning inference pipelines at scale (March 2019)
Building machine learning inference pipelines at scale (March 2019)
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
 
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
Build, Deploy, and Serve Machine-Learning Models on Streaming Data Using Amaz...
 
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
Build, Train, and Deploy ML Models with Amazon SageMaker (AIM410-R2) - AWS re...
 
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker WorkshopAWS re:Invent 2018 - ENT321 - SageMaker Workshop
AWS re:Invent 2018 - ENT321 - SageMaker Workshop
 
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
Big Data Meets Machine Learning: Architecting Spark Environment for Data Scie...
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
Build, Train, and Deploy Machine Learning for the Enterprise with Amazon Sage...
 
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
Serverless AI with Scikit-Learn (GPSWS405) - AWS re:Invent 2018
 
End to End Model Development to Deployment using SageMaker
End to End Model Development to Deployment using SageMakerEnd to End Model Development to Deployment using SageMaker
End to End Model Development to Deployment using SageMaker
 
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
BDA301 Working with Machine Learning in Amazon SageMaker: Algorithms, Models,...
 
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências -  MCL...
Machine Learning e Amazon SageMaker: Algoritmos, Modelos e Inferências - MCL...
 
Automate all your EMR related activities
Automate all your EMR related activitiesAutomate all your EMR related activities
Automate all your EMR related activities
 
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMakerAWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
 
Enabling Deep Learning in IoT Applications with Apache MXNet
Enabling Deep Learning in IoT Applications with Apache MXNetEnabling Deep Learning in IoT Applications with Apache MXNet
Enabling Deep Learning in IoT Applications with Apache MXNet
 
Apache MXNet and Gluon
Apache MXNet and GluonApache MXNet and Gluon
Apache MXNet and Gluon
 
Build, Train, & Deploy ML Models Using SageMaker
Build, Train, & Deploy ML Models Using SageMakerBuild, Train, & Deploy ML Models Using SageMaker
Build, Train, & Deploy ML Models Using SageMaker
 

Mehr von Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Building Machine Learning models with Apache Spark and Amazon SageMaker | AWS Floor28

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Julien Simon Principal Technical Evangelist, AI and Machine Learning @julsimon Using Apache Spark with Amazon SageMaker October 2018
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Apache Spark on AWS • Amazon SageMaker • Combining Spark and SageMaker • Demos with the SageMaker SDK for Spark • Getting started Services covered: Amazon EMR, Amazon SageMaker
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark on AWS
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark • Open-source, distributed processing system. • In-memory caching and optimized execution for fast performance (typically 100x faster than Hadoop). • Batch processing, streaming analytics, machine learning, graph databases and ad hoc queries. • API for Java, Scala, Python, R, and SQL. https://spark.apache.org/
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark – DataFrame • Distributed collection of data organized into named columns. • Conceptually equivalent to a table in a relational database. • Wide array of sources: structured files, databases. • Wide array of formats: text, CSV, JSON, Avro, ORC, Parquet. {"name": "Jeff"} {"name": "Boaz", "age":72} {"name": "Julien", "age":12} df = spark.read.json("people.json") df.show() +----+-------+ | age| name | +----+-------+ |null| Jeff | | 72 | Boaz | | 12 | Julien| +----+-------+
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. MLlib – Machine learning library • ML algorithms: classification, regression, clustering, collaborative filtering. • Featurization: feature extraction, transformation, dimensionality reduction. • Tools for constructing, evaluating and tuning ML pipelines • Transformer – a transform function that maps a DataFrame into a new one • Adding a column, changing the rows of a specific column, etc. • Predicting the label based on the feature vector. • Estimator – an algorithm that trains on data • Consists of a fit() function that maps a DataFrame into a Model. https://spark.apache.org/docs/latest/ml-guide.html
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Spark ML on Amazon EMR: spam detector Adapted from https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala ETL Train Predict
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Apache Spark on Amazon EMR • Spark is natively supported in Amazon EMR. • Amazon S3 connectivity using the EMR File System (EMRFS). • Amazon Kinesis, Redshift and DynamoDB as data sources. • Integration with the AWS Glue Data Catalog. • Auto Scaling to add or remove instances from your cluster. • Integration with the Amazon EC2 Spot market. https://aws.amazon.com/emr/
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker Collect and prepare training data Choose and optimize your ML algorithm Set up and manage environments for training Train and tune model (trial and error) Deploy model in production Scale and manage the production environment Easily build, train, and deploy Machine Learning models FREE TIER
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker Pre-built notebooks for common problems K-Means Clustering Principal Component Analysis Neural Topic Modelling Factorization Machines Linear Learner XGBoost Latent Dirichlet Allocation Image Classification Seq2Seq, And more! ALGORITHMS Apache MXNet, Chainer TensorFlow, PyTorch Caffe2, CNTK, Torch FRAMEWORKS S e t u p a n d m a n a g e e n v i r o n m e n t s f o r t r a i n i n g T r a i n a n d t u n e m o d e l ( t r i a l a n d e r r o r ) D e p l o y m o d e l i n p r o d u c t i o n S c a l e a n d m a n a g e t h e p r o d u c t i o n e n v i r o n m e n t Built-in, high- performance algorithms Build
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker Pre-built notebooks for common problems Built-in, high- performance algorithms One-click training Hyperparameter optimization Build Train Deploy model in production Scale and manage the production environment
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker Fully managed hosting with auto- scaling One-click deployment Pre-built notebooks for common problems Built-in, high- performance algorithms One-click training Hyperparameter optimization Build Train Deploy
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon ECR Model Training (on EC2) Model Hosting (on EC2) Trainingdata Modelartifacts Training code Helper code Helper codeInference code GroundTruth Client application Inference code Training code Inference requestInference response Inference Endpoint Amazon SageMaker
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Combining Spark and SageMaker
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Decouple ETL and Machine Learning • Different workloads require different instance types. • Say, M4 for ETL, P3 for training and C5 for prediction? • If you need GPUs for training, running your EMR cluster on GPU instances wouldn’t be cost-efficient. • Scale them independently. • Avoid oversizing your Spark cluster. • Avoid time-consuming resizing operations on EMR. • Run ETL once, train many models in parallel. • SageMaker terminates training instances automatically.
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Run any ML algorithm in any language • Spark MLlib is great, but you may need something else. • SageMaker built-in algorithms (ML, DL, NLP). • Deep Learning libraries, like TensorFlow or Apache MXNet. • Your own custom code in any language.
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Deploy ML models in production • Perform ML predictions without using Spark. • Save the overhead of the Spark framework. • Save loading your data in a DataFrame. • Improve latency for small-batch predictions. • It can be difficult to achieve low-latency predictions with Spark ML models. • You can get real-time predictions with models hosted in SageMaker. • You can use very powerful instances for prediction endpoints.
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sample use cases for Spark+SageMaker • Data preparation and feature engineering before training. • Data transformation before batch prediction (model reuse). • Data enrichment with predictions. • Predict missing values instead of using median. • Add new predicted features. • Train on extremely large datasets with built-in algos.
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SageMaker SDK for Spark • Python and Scala SDK, for Apache Spark 2.1.1 and 2.2. • Pre-installed on EMR 5.11 and later. • Train, import, deploy and predict with SageMaker models directly from your Spark application. • Standalone, • Integration in Spark MLlib pipelines. • DataFrames in, DataFrames out: automatic conversion to and from protobuf (crowd goes wild!) https://github.com/aws/sagemaker-spark
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SageMaker SDK for Spark – built-in algorithms • High-level API for: • Linear Learner • Factorization Machines • K-Means • PCA • LDA • XGBoost • The SageMakerEstimator object lets you use other built-in containers as well any containerized code stored in Amazon ECR (just like the regular SageMaker SDK). https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html Infinitely scalable algorithms: no limit to the amount of data that they can process
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demos https://gitlab.com/juliensimon/dlnotebooks 1 – Classifying MNIST in Python with XGBoost (SageMaker) 2 – Clustering MNIST in Scala with K-Means (SageMaker) 3 – Clustering MNIST in Scala with a Pipeline: PCA (MLlib) + K-Means (SageMaker)
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting started https://ml.aws https://aws.amazon.com/sagemaker https://aws.amazon.com/emr https://github.com/aws/sagemaker-python-sdk https://github.com/aws/sagemaker-spark https://medium.com/@julsimon https://gitlab.com/juliensimon/dlnotebooks
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you! Julien Simon Principal Technical Evangelist, AI and Machine Learning @julsimon