From the course: MLOps Essentials: Model Deployment and Monitoring

Unlock the full course today

Join today to access over 24,400 courses taught by industry experts.

Scaling model serving

Scaling model serving

- [Instructor] Scaling ML in a cost effective manner is a critical success factor for MLOps. Let's discuss some scaling options for ML in this video. Scaling for batch inference is different from scaling for real time inference. The table here shows several considerations where batch and real time are different when it comes to scaling. To begin with, the goal of batch inference is throughput, like the total number of predictions done in a given time period. Real time inference, on the other hand, focuses on concurrency and latency. The number of concurrent requests that can be processed by a given compute unit and the time the client waits to obtain the results are key measures. When it comes to resource provisioning, average loads are used as the capacity benchmark for batch processing, while peak loads are used for real time inference. Auto scaling is used in real time to optimize resource allocation. How does…

Contents