From the course: MLOps Essentials: Model Deployment and Monitoring

Model serving patterns

- [Instructor] We will discuss model serving considerations and best practices for ML Ops in this chapter. Let's start with model serving patterns. There are multiple ways to serve ML, depending upon the use case. We will explore some popular patterns in this video. We first start with the batch inference job pattern, which is a simple implementation. In these cases, the features needed in production for inference are uploaded by clients into a production features database. The pending data records, with features, would be collected over time. A batch ML job would run periodically, say, every hour or every day. This job will read the pending records from the database and perform predictions. The predictions will be uploaded into a predictions database. The pending records are then cleared from the features database. Clients will then read the results later from the predictions database. This is a simple implementation that is applicable for historical use cases only. The next pattern is the inference API for real-time use. The API can be a REST API, served by a web server or an embedded function. Clients call the API when inference is needed in real time. The clients pass the input features, which the ML service would process, perform predictions, and return the results to the clients in real time. This pattern is useful for delivering third party ML services as packages or in the cloud. It's also useful in a microservices architecture for real-time predictions. Inference APA may suffer from load issues in real time, when they have to concurrently service multiple requests. A third advanced pattern is the real-time stream processor. In this case, the clients would push the prediction request, along with the input features, into a queue, like Kafka, in real time. An ML stream processor will service these requests in real time. It will read from the queue, perform predictions, and push the outputs into a predictions queue. Clients can then watch the prediction queue in real time and pick up the results when available. This system can manage real-time traffic, but can also handle load spikes by using the queues as a back-pressure buffer. It can also scale well by creating parallel instances of the stream processors. How do we select the right serving pattern? Patterns should be selected based on the specific use case, whether it is batch or real time, and whether it is synchronous or asynchronous, and whether we want to optimize for average or peak loads. Each of these patterns have specific advantages and drawbacks when it comes to latency, scaling and costs. Choose a pattern that is simple to implement, while satisfying the performance and operational requirements of the ML solution.

Contents