From the course: MLOps Essentials: Model Deployment and Monitoring
Unlock the full course today
Join today to access over 24,400 courses taught by industry experts.
Scaling model serving
From the course: MLOps Essentials: Model Deployment and Monitoring
Scaling model serving
- [Instructor] Scaling ML in a cost effective manner is a critical success factor for MLOps. Let's discuss some scaling options for ML in this video. Scaling for batch inference is different from scaling for real time inference. The table here shows several considerations where batch and real time are different when it comes to scaling. To begin with, the goal of batch inference is throughput, like the total number of predictions done in a given time period. Real time inference, on the other hand, focuses on concurrency and latency. The number of concurrent requests that can be processed by a given compute unit and the time the client waits to obtain the results are key measures. When it comes to resource provisioning, average loads are used as the capacity benchmark for batch processing, while peak loads are used for real time inference. Auto scaling is used in real time to optimize resource allocation. How does…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.