Universal Assisted Generation: Faster Decoding with Any Assistant Model

TL;DR: Many LLMs such as gemma-2-9b and Mixtral-8x22B-Instruct-v0.1 lack a much smaller version to use for assisted generation. In this blog post, we present Universal Assisted Generation: a method developed by Intel Labs and Hugging Face which extends assisted generation to work with a small language model from any model family 🤯. As a result, it is now possible to accelerate inference from any decoder or Mixture of Experts model by 1.5x-2.0x with almost zero overhead 🔥🔥🔥. Let’s dive in!

Universal Assisted Generation: Faster Decoding with Any Assistant Model

To finish reading, please visit source site