Optimization story: Bloom inference

Nicolas Patry's avatar

This article gives you the behind-the-scenes of how we made an efficient inference server that powers bloom.
inference server that powers https://huggingface.co/bigscience/bloom.

We achieved a 5x latency reduction over several weeks (and 50x more throughput). We wanted to share all the struggles and epic wins we went through to achieve such speed improvements.

A lot of different people were involved

 

 

 

To finish reading, please visit source site