How Long Prompts Block Other Requests – Optimizing LLM Performance

Benjamin Merkel's avatar
At TNG, we are self-hosting numerous Large Language Models on our cluster of 24 H100 GPUs. Serving LLMs for over 50 applications, thereby consuming more than 100 million tokens while generating over 10 millions tokens per day, requires us to carefully tune our request processing.

In the previous part of our series on LLM performance, we looked into

 

 

 

To finish reading, please visit source site