Overview of natively supported quantization schemes in 🤗 Transformers

We aim to give a clear overview of the pros and cons of each quantization scheme supported in transformers to help you decide which one you should go for.

Currently, quantizing models are used for two main purposes:

Running inference of a large model on a smaller device
Fine-tune adapters on top of quantized models

So far, two integration efforts have been made and are natively supported in transformers : bitsandbytes and auto-gptq.
Note that some additional quantization schemes are also supported in the 🤗 optimum library, but this is out of scope for this blogpost.

To finish reading, please visit source site