Models are designed to help decision making through predictions, so they're only useful when deployed and available for an application to consume. In this module learn how to deploy models for real-time inferencing, and for batch inferencing.
3. Inferencing?
In machine learning, inferencing refers to the use of a trained model to predict labels for
new data on which the model has not been trained. Often, the model is deployed as part
of a service that enables applications to request immediate, or real-time, predictions for
individual, or small numbers of data observations.
In Azure Machine Learning, you can create real-time inferencing solutions by deploying a
model as a service, hosted in a containerized platform, such as Azure Kubernetes Services
(AKS).
5. Machine learning inference during deployment
When deploying your AI model during production, you need to consider how it will make
predictions. The two main processes for AI models are:
•Batch inference: An asynchronous process that bases its predictions on a batch of
observations. The predictions are stored as files or in a database for end users or business
applications.
•Real-time (or interactive) inference: Frees the model to make predictions at any time
and trigger an immediate response. This pattern can be used to analyze streaming and
interactive application data.
6. Machine learning inference during deployment
Consider the following questions to evaluate your model, compare the two processes,
and select the one that suits your model:
•How often should predictions be generated?
•How soon are the results needed?
•Should predictions be generated individually, in small batches, or in large batches?
•Is latency to be expected from the model?
•How much compute power is needed to execute the model?
•Are there operational implications and costs to maintain the model?
7. Batch inference
Batch inference, sometimes called offline inference, is a simpler inference process that
helps models to run in timed intervals and business applications to store predictions.
Consider the following best practices for batch inference:
•Trigger batch scoring: Use Azure Machine Learning pipelines and
the ParallelRunStep feature in Azure Machine Learning to set up a schedule or event-
based automation.
•Compute options for batch inference: Since batch inference processes don't run
continuously, it's recommended to automatically start, stop, and scale reusable clusters
that can handle a range of workloads. Different models require different environments,
and your solution needs to be able to deploy a specific environment and remove it when
inference is over for the compute to be available for the next model.
8. Real-time inference
Real-time, or interactive, inference is architecture where model inference can be triggered
at any time, and an immediate response is expected. This pattern can be used to analyze
streaming data, interactive application data, and more. This mode allows you to take
advantage of your machine learning model in real time and resolves the cold-start
problem outlined above in batch inference.
The following considerations and best practices are available if real-time inference is right
for your model:
•The challenges of real-time inference: Latency and performance requirements make
real-time inference architecture more complex for your model. A system might need to
respond in 100 milliseconds or less, during which it needs to retrieve the data, perform
inference, validate and store the model results, run any required business logic, and
return the results to the system or application.
9. Real-time inference
•Compute options for real-time inference: The best way to implement real-time
inference is to deploy the model in a container form to Docker or Azure Kubernetes
Service (AKS) cluster and expose it as a web service with a REST API. This way, the model
runs in its own isolated environment and can be managed like any other web service.
Docker and AKS capabilities can then be used for management, monitoring, scaling, and
more. The model can be deployed on-premises, in the cloud, or on the edge. The
preceding compute decision outlines real-time inference.
10. Real-time inference
•Multiregional deployment and high availability: Regional deployment and high
availability architectures need to be considered in real-time inference scenarios, as latency
and the model's performance will be critical to resolve. To reduce latency in multiregional
deployments, it's recommended to locate the model as close as possible to the
consumption point. The model and supporting infrastructure should follow the business'
high availability and DR principles and strategy.
11. Create a real-time
inference service
https://ceteongvanness.wordpress.com/2022/11/01/
create-a-real-time-inference-service/