Efficient Request Queueing – Optimizing LLM Performance

Benjamin Merkel's avatar

Serving LLMs to many applications and users in parallel is challenging because they compete for limited GPU resources. This article is the first in a series on LLM performance, based on our experience with serving self-hosted LLMs at TNG Technology Consulting GmbH. In the first part, we focus on the impact of queuing and discuss different scheduling strategies.

 

To finish reading, please visit source site