A failed experiment: Infini-Attention, and why we should keep trying?

TLDR: Infini-attention’s performance gets worse as we increase the number of times we compress the memory, and to the best of our knowledge, ring attention, YaRN and rope scaling are still the best ways for extending a pretrained model to longer context length.

Section 0: Introduction

The context length of language models is one of the central attributes besides the model’s performance. Since the emergence of in-context learning, adding relevant information to

To finish reading, please visit source site