Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Clipper: A Low-Latency Online Prediction Serving System

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 70 Anzeige

Clipper: A Low-Latency Online Prediction Serving System

Herunterladen, um offline zu lesen

Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy serving loads. However, most machine learning frameworks and systems only address model training and not deployment.

Clipper is a general-purpose model-serving system that addresses these challenges. Interposing between applications that consume predictions and the machine-learning models that produce predictions, Clipper simplifies the model deployment process by isolating models in their own containers and communicating with them over a lightweight RPC system. This architecture allows models to be deployed for serving in the same runtime environment as that used during training. Further, it provides simple mechanisms for scaling out models to meet increased throughput demands and performing fine-grained physical resource allocation for each model.

In this talk, I will provide an overview of the Clipper serving system and then discuss how to get started using Clipper to serve Spark and TensorFlow models in a production serving environment.

Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy serving loads. However, most machine learning frameworks and systems only address model training and not deployment.

Clipper is a general-purpose model-serving system that addresses these challenges. Interposing between applications that consume predictions and the machine-learning models that produce predictions, Clipper simplifies the model deployment process by isolating models in their own containers and communicating with them over a lightweight RPC system. This architecture allows models to be deployed for serving in the same runtime environment as that used during training. Further, it provides simple mechanisms for scaling out models to meet increased throughput demands and performing fine-grained physical resource allocation for each model.

In this talk, I will provide an overview of the Clipper serving system and then discuss how to get started using Clipper to serve Spark and TensorFlow models in a production serving environment.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Clipper: A Low-Latency Online Prediction Serving System (20)

Anzeige

Weitere von Databricks (20)

Aktuellste (20)

Anzeige

Clipper: A Low-Latency Online Prediction Serving System

  1. 1. Dan Crankshaw crankshaw@cs.berkeley.edu http://clipper.ai https://github.com/ucbrise/clipper June 5, 2017 A Low-Latency Online Prediction Serving System Clipper
  2. 2. Collaborators Joseph Gonzalez Ion Stoica Michael Franklin Xin Wang Giulio Zhou Alexey Tumanov Corey Zumar 2 Nishad Singh Yujia Luo
  3. 3. Big Data Training Learning Complex Model
  4. 4. Learning Produces a Trained Model “CAT” Query Decision Model
  5. 5. Big Data Training Learning Application Decision Query ? Serving Model
  6. 6. Big Data Training Learning Application Decision Query Model Prediction-Serving for interactive applications Timescale: ~10s of milliseconds Serving
  7. 7. q Challenges of prediction serving q Clipper overview and architecture q Deploy and serve Spark models with Clipper In this talk…
  8. 8. Prediction-Serving Raises New Challenges
  9. 9. Prediction-Serving Challenges Support low-latency, high- throughput serving workloads ??? Create VWCaffe 9 Large and growing ecosystem of ML models and frameworks
  10. 10. Models getting more complex Ø 10s of GFLOPs [1] Support low-latency, high-throughput serving workloads Using specialized hardware for predictions Deployed on critical path Ø Maintain SLOs under heavy load [1] Deep Residual Learning for Image Recognition. He et al. CVPR 2015.
  11. 11. Google Translate Serving 82,000 GPUs running 24/7 [1] https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html 140 billion words a day1 Invented New Hardware! Tensor Processing Unit (TPU)
  12. 12. Big Companies Build One-Off Systems Problems: Ø Expensive to build and maintain Ø Highly specialized and require ML and systems expertise Ø Tightly-coupled model and application Ø Difficult to change or update model Ø Only supports single ML framework
  13. 13. Prediction-Serving Challenges Support low-latency, high- throughput serving workloads ??? Create VWCaffe 13 Large and growing ecosystem of ML models and frameworks
  14. 14. Large and growing ecosystem of ML models and frameworks ??? Fraud Detection 14
  15. 15. Large and growing ecosystem of ML models and frameworks ??? Fraud Detection 15 Ø Different programming languages: Ø TensorFlow: Python and C++ Ø Spark: JVM-based Ø Different hardware: Ø CPUs and GPUs Ø Different costs Ø DNNs in TensorFlow are generally much slower
  16. 16. Large and growing ecosystem of ML models and frameworks ??? Fraud Detection Create VW Caffe 16 Content Rec. Personal Asst. Robotic Control Machine Translation
  17. 17. But building new serving systems for each framework is expensive…
  18. 18. Big Data Batch Analytics Model Training Use existing systems: Offline Scoring
  19. 19. Big Data Model Training Batch Analytics Scoring X Y Datastore Use existing systems: Offline Scoring
  20. 20. Application Decision Query Look up decision in datastore Low-Latency Serving X Y Use existing systems: Offline Scoring
  21. 21. Application Decision Query Look up decision in datastore Low-Latency Serving X Y Problems: Ø Requires full set of queries ahead of time Ø Small and bounded input domain Ø Wasted computation and space Ø Can render and store unneeded predictions Ø Costly to update Ø Re-run batch job Use existing systems: Offline Scoring
  22. 22. Prediction-Serving Challenges Support low-latency, high- throughput serving workloads ??? Create VWCaffe 22 Large and growing ecosystem of ML models and frameworks
  23. 23. Prediction-Serving Today Highly specialized systems for specific problems Decision QueryX Y Offline scoring with existing frameworks and systems Clipper aims to unify these approaches
  24. 24. What would we like this to look like? ??? Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation Create VW Caffe 24
  25. 25. Encapsulate each system in isolated environment ??? Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation Create VW Caffe 25
  26. 26. Decouple models and applications Content Rec. Fraud Detection Personal Asst. Robotic Control Machine Translation 26 Caffe ???
  27. 27. Clipper Predict MC MC RPC RPC RPC RPC Clipper Decouples Applications and Models Applications Model Container (MC) MC Caffe
  28. 28. Clipper Implementation Clipper MC MC MC RPC RPC RPC RPC Model Abstraction Layer Caching Adaptive Batching Model Container (MC) Core system: 35,000 lines of C++ Open source system: https://github.com/ucbrise/clipper Goal: Community-owned platform for model deployment and serving Applications Predict
  29. 29. Clipper Query Processor Redis Clipper Management Applications Model Container Model Container Predict Clipper Implementation clipper admin
  30. 30. Driver Program SparkContext Worker Node Executor Task Task Worker Node Executor Task Task Worker Node Executor Task Task Web Server Database Cache Clipper Clipper Connects Analytics and Serving
  31. 31. Getting Started with Clipper Docker images available on DockerHub Clipper admin is distributed as pip package: pip install clipper_admin Get up and running without cloning or compiling!
  32. 32. q Containerized frameworks: unified abstraction and isolation q Cross-framework caching and batching: optimize throughput and latency q Cross-framework model composition: improved accuracy through ensembles and bandits Technologies inside Clipper
  33. 33. Clipper Predict FeedbackRPC/REST Interface Caffe MC MC MC RPC RPC RPC RPC Clipper Decouples Applications and Models Applications Model Container (MC)
  34. 34. Clipper Caffe MC MC MC RPC RPC RPC RPC Model Container (MC)
  35. 35. Clipper Caffe MC MC MC RPC RPC RPC RPC Model Container (MC) Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems
  36. 36. Container-based Model Deployment class ModelContainer: def __init__(model_data) def predict_batch(inputs) Implement Model API:
  37. 37. class ModelContainer: def __init__(model_data) def predict_batch(inputs) Implement Model API: Ø API support for many programming languages Ø Python Ø Java Ø C/C++ Container-based Model Deployment
  38. 38. Package model implementation and dependencies Model Container (MC) Container-based Model Deployment class ModelContainer: def __init__(model_data) def predict_batch(inputs)
  39. 39. Clipper Caffe MC MC RPC RPC RPC RPC Model Container (MC) Container-based Model Deployment MC
  40. 40. Clipper Caffe MC MC MC RPC RPC RPC RPC Model Container (MC) Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes as Docker containers Ø Resource isolation: Cutting edge ML frameworks can be buggy
  41. 41. Clipper Caffe MC RPC RPC RPC Container (MC) Common Interface à Simplifies Deployment: Ø Evaluate models using original code & systems Ø Models run in separate processes as Docker containers Ø Resource isolation: Cutting edge ML frameworks can be buggy Ø Scale-out MC RPCRPC RPC MC MC MC
  42. 42. TensorFlow- Serving Predict RPC Interface Applications Overhead of decoupled architecture Clipper Predict FeedbackRPC/REST Interface Caffe MC MC RPC RPC RPC RPC Applications MC MC
  43. 43. Overhead of decoupled architecture Throughput (QPS) Better P99 Latency (ms) Better Model: AlexNet trained on CIFAR-10
  44. 44. q Containerized frameworks: unified abstraction and isolation q Cross-framework caching and batching: optimize throughput and latency q Cross-framework model composition: improved accuracy through ensembles and bandits Technologies inside Clipper
  45. 45. Clipper Architecture Clipper Applications Predict ObserveRPC/REST Interface MC MC MC RPC RPC RPC RPC Model Abstraction Layer Caching Latency-Aware Batching Model Container (MC)
  46. 46. A single page load may generate many queries Batching to Improve Throughput Ø Optimal batch depends on: Ø hardware configuration Ø model and framework Ø system load Ø Why batching helps: Throughput Throughput-optimized frameworks Batch size
  47. 47. A single page load may generate many queries Ø Optimal batch depends on: Ø hardware configuration Ø model and framework Ø system load Ø Why batching helps: Throughput Throughput-optimized frameworks Batch size Batching to Improve ThroughputLatency-aware Clipper Solution: Adaptively tradeoff latency and throughput… Ø Inc. batch size until the latency objective is exceeded (Additive Increase) Ø If latency exceeds SLO cut batch size by a fraction (Multiplicative Decrease)
  48. 48. Throughput (QPS) 1R-2S 5andRP )RreVt (6KOearn) LLnear 690 (3y6SarN) LLnear 690 (6KLearn) KerneO 690 (6KLearn) LRg 5egreVVLRn (6KLearn) 0 20000 40000 60000 48386 22859 29350 48934 197 47219 8963 317 7206 1920 203 1921 AdaStLve 1R BatFKLng Better Batching to Improve ThroughputLatency-aware
  49. 49. Overhead of decoupled architecture Clipper Predict FeedbackRPC/REST Interface Caffe MC MC MC RPC RPC RPC RPC Applications MC
  50. 50. q Containerized frameworks: unified abstraction and isolation q Cross-framework caching and batching: optimize throughput and latency q Cross-framework model composition: improved accuracy through ensembles and bandits Technologies inside Clipper
  51. 51. Clipper Architecture Clipper Applications Predict ObserveRPC/REST Interface MC MC MC RPC RPC RPC RPC Model Abstraction Layer Model Container (MC) Model Selection LayerSelection Policy
  52. 52. Clipper Version 1 Version 2 Version 3 Periodic retraining Experiment with new models and frameworks Model Selection LayerSelection Policy
  53. 53. “CAT” “CAT” “CAT” “CAT” “CAT” CONFIDENT Selection Policy: Estimate confidence Policy Version 2 Version 3
  54. 54. “CAT” “MAN” “CAT” “SPACE” “CAT” UNSURE Selection Policy: Estimate confidence Policy
  55. 55. Clipper Model Selection LayerSelection Policy ØExploit multiple models to estimate confidence Ø Use multi-armed bandit algorithms to learn optimal model-selection online Ø Online personalization across ML frameworks *See paper for details [NSDI 2017]
  56. 56. Serving Spark Models with Clipper
  57. 57. Driver Program SparkContext Worker Node Executor Task Task Worker Node Executor Task Task Worker Node Executor Task Task Web Server Database Cache Clipper Deploy Models from Spark to Clipper
  58. 58. You can always implement your own model container… Model Container (MC) class ModelContainer: def __init__(model_data): def predict_batch(inputs):
  59. 59. Clipper admin will automatically deploy Spark models Clipper.deploy_pyspark_model(model, predict_function)
  60. 60. Clipper admin will automatically deploy Spark models Clipper.deploy_pyspark_model(model, predict_function) Instance of PySpark model: pyspark.mllib.* pyspark.ml.pipeline.PipelineModel
  61. 61. Clipper admin will automatically deploy Spark models Clipper.deploy_pyspark_model(model, predict_function) Python closure def predict_function(spark, model, inputs) Params: spark : SparkSession model : ml or mllib model instance inputs : list of inputs to predict Returns: list of str Can be used for business logic, featurization, post-processing…
  62. 62. pyspark.mllib example: Clipper.deploy_pyspark_model(model, predict_function)
  63. 63. pyspark.ml.pipeline example: Clipper.deploy_pyspark_model(model, predict_function)
  64. 64. Driver Program SparkContext Worker Node Executor Task Task Worker Node Executor Task Task Worker Node Executor Task Task Web Server Database Cache Clipper Deploy Models from Spark to Clipper
  65. 65. Driver Program SparkContext Worker Node Executor Task Task Worker Node Executor Task Task Worker Node Executor Task Task Web Server Database Cache Deploy Models from Spark to Clipper Clipper
  66. 66. MC Driver Program SparkContext Worker Node Executor Task Task Worker Node Executor Task Task Worker Node Executor Task Task Web Server Database Cache Deploy Models from Spark to Clipper Clipper
  67. 67. Clipper can automatically deploy other models: ØArbitrary Python closures: ØScikit-Learn models Øpred_function can contain any Python code that is pickle-able ØDeploy Spark models from Scala or Java: ØSupports both MLLib and Pipelines APIs
  68. 68. DEMO http://clipper.ai/examples/spark-demo/
  69. 69. Check out Clipper Ø Alpha release (0.1.3) available Ø First class support for Scikit-Learn and Spark models Ø Deploy directly from training job without writing Docker containers Ø Actively working with early users Ø Post issues and questions on GitHub and subscribe to our mailing list clipper-dev@googlegroups.com http://clipper.ai
  70. 70. What’s next Ø Ongoing efforts to support for additional frameworks Ø TensorFlow (supported but requires writing Docker containers) Ø R (open pull request using Python-R interop library) Ø Caffe2 Ø Kubernetes integration (early internal prototype) Ø Bandit methods, online personalization, and cascaded predictions Ø Studied in research prototypes but need to integrate into system Ø Actively looking for contributors: https://github.com/ucbrise/clipper

×