Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Deploying Models

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige

Hier ansehen

1 von 12 Anzeige

Deploying Models

Herunterladen, um offline zu lesen

Models are designed to help decision making through predictions, so they're only useful when deployed and available for an application to consume. In this module learn how to deploy models for real-time inferencing, and for batch inferencing.

Models are designed to help decision making through predictions, so they're only useful when deployed and available for an application to consume. In this module learn how to deploy models for real-time inferencing, and for batch inferencing.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Ähnlich wie Deploying Models (20)

Weitere von Eng Teong Cheah (20)

Anzeige

Aktuellste (20)

Deploying Models

  1. 1. Deploying Models
  2. 2. Eng Teong Cheah Microsoft MVP
  3. 3. Inferencing? In machine learning, inferencing refers to the use of a trained model to predict labels for new data on which the model has not been trained. Often, the model is deployed as part of a service that enables applications to request immediate, or real-time, predictions for individual, or small numbers of data observations. In Azure Machine Learning, you can create real-time inferencing solutions by deploying a model as a service, hosted in a containerized platform, such as Azure Kubernetes Services (AKS).
  4. 4. Deploying a Real-Time Inferencing Service
  5. 5. Machine learning inference during deployment When deploying your AI model during production, you need to consider how it will make predictions. The two main processes for AI models are: •Batch inference: An asynchronous process that bases its predictions on a batch of observations. The predictions are stored as files or in a database for end users or business applications. •Real-time (or interactive) inference: Frees the model to make predictions at any time and trigger an immediate response. This pattern can be used to analyze streaming and interactive application data.
  6. 6. Machine learning inference during deployment Consider the following questions to evaluate your model, compare the two processes, and select the one that suits your model: •How often should predictions be generated? •How soon are the results needed? •Should predictions be generated individually, in small batches, or in large batches? •Is latency to be expected from the model? •How much compute power is needed to execute the model? •Are there operational implications and costs to maintain the model?
  7. 7. Batch inference Batch inference, sometimes called offline inference, is a simpler inference process that helps models to run in timed intervals and business applications to store predictions. Consider the following best practices for batch inference: •Trigger batch scoring: Use Azure Machine Learning pipelines and the ParallelRunStep feature in Azure Machine Learning to set up a schedule or event- based automation. •Compute options for batch inference: Since batch inference processes don't run continuously, it's recommended to automatically start, stop, and scale reusable clusters that can handle a range of workloads. Different models require different environments, and your solution needs to be able to deploy a specific environment and remove it when inference is over for the compute to be available for the next model.
  8. 8. Real-time inference Real-time, or interactive, inference is architecture where model inference can be triggered at any time, and an immediate response is expected. This pattern can be used to analyze streaming data, interactive application data, and more. This mode allows you to take advantage of your machine learning model in real time and resolves the cold-start problem outlined above in batch inference. The following considerations and best practices are available if real-time inference is right for your model: •The challenges of real-time inference: Latency and performance requirements make real-time inference architecture more complex for your model. A system might need to respond in 100 milliseconds or less, during which it needs to retrieve the data, perform inference, validate and store the model results, run any required business logic, and return the results to the system or application.
  9. 9. Real-time inference •Compute options for real-time inference: The best way to implement real-time inference is to deploy the model in a container form to Docker or Azure Kubernetes Service (AKS) cluster and expose it as a web service with a REST API. This way, the model runs in its own isolated environment and can be managed like any other web service. Docker and AKS capabilities can then be used for management, monitoring, scaling, and more. The model can be deployed on-premises, in the cloud, or on the edge. The preceding compute decision outlines real-time inference.
  10. 10. Real-time inference •Multiregional deployment and high availability: Regional deployment and high availability architectures need to be considered in real-time inference scenarios, as latency and the model's performance will be critical to resolve. To reduce latency in multiregional deployments, it's recommended to locate the model as close as possible to the consumption point. The model and supporting infrastructure should follow the business' high availability and DR principles and strategy.
  11. 11. Create a real-time inference service https://ceteongvanness.wordpress.com/2022/11/01/ create-a-real-time-inference-service/
  12. 12. References Microsoft Docs

×