Efficient Request Queueing – Optimizing LLM Performance
Serving LLMs to many applications and users in parallel is challenging because they compete for limited GPU resources. This article is the first in a series on LLM performance, based on our experience with serving self-hosted LLMs at TNG Technology Consulting GmbH. In the first part, we focus on the impact of queuing and discuss different scheduling strategies.