Optimization story: Bloom inference
This article gives you the behind-the-scenes of how we made an efficient inference server that powers bloom.
inference server that powers https://huggingface.co/bigscience/bloom.
We achieved a 5x latency reduction over several weeks (and 50x more throughput). We wanted to share all the struggles and epic wins we went through to achieve such speed improvements.
A lot of different people were involved