adporn.net Scaling model serving - MLOps Essentials: Model Deployment and Monitoring Video Tutorial | LinkedIn Learning, formerly Lynda.com

LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings.

Start free trial Sign in

From the course: MLOps Essentials: Model Deployment and Monitoring

Unlock the full course today

Join today to access over 24,400 courses taught by industry experts.

Scaling model serving

Scaling model serving

From the course: MLOps Essentials: Model Deployment and Monitoring

Start my 1-month free trial Buy for my team

Scaling model serving

“

- [Instructor] Scaling ML in a cost effective manner is a critical success factor for MLOps. Let's discuss some scaling options for ML in this video. Scaling for batch inference is different from scaling for real time inference. The table here shows several considerations where batch and real time are different when it comes to scaling. To begin with, the goal of batch inference is throughput, like the total number of predictions done in a given time period. Real time inference, on the other hand, focuses on concurrency and latency. The number of concurrent requests that can be processed by a given compute unit and the time the client waits to obtain the results are key measures. When it comes to resource provisioning, average loads are used as the capacity benchmark for batch processing, while peak loads are used for real time inference. Auto scaling is used in real time to optimize resource allocation. How does…

Contents

- (Locked)
  
  Continuing on with MLOps
  
  32s