How to train a new language model from scratch using Transformers and Tokenizers

Over the past few months, we made several improvements to our transformers and tokenizers libraries, with the goal of making it easier than ever to train a new language model from scratch. In this post we’ll demo how to train a “small” model (84 M parameters = 6 layers, 768 hidden size, 12 attention heads) – that’s the same number of layers & heads as DistilBERT – on Esperanto. We’ll then fine-tune the model on a downstream task of part-of-speech […]

Read more

How to generate text: using different decoding methods for language generation with Transformers

Note: Edited on July 2023 with up-to-date references and examples. Introduction In recent years, there has been an increasing interest in open-ended language generation thanks to the rise of large transformer-based language models trained on millions of webpages, including OpenAI’s ChatGPT and Meta’s LLaMA. The results on conditioned open-ended language generation are impressive, having shown to generalize to new tasks, handle code, or take non-text data as input. Besides the improved transformer architecture and massive unsupervised training data, better decoding […]

Read more

The Reformer – Pushing the limits of language modeling

How the Reformer uses less than 8GB of RAM to train on sequences of half a million tokens The Reformer model as introduced by Kitaev, Kaiser et al. (2020) is one of the most memory-efficient transformer models for long sequence modeling as of today. Recently, long sequence modeling has experienced a surge of interest as can be seen by the many submissions from this year alone – Beltagy et al. (2020), Roy et al. (2020), Tay et al., Wang et […]

Read more

Block Sparse Matrices for Smaller and Faster Language Models

In previous blog posts we introduced sparse matrices and what they could do to improve neural networks. The basic assumption is that full dense layers are often overkill and can be pruned without a significant loss in precision. In some cases sparse linear layers can even improve precision or/and generalization. The main issue is that currently available code that supports sparse algebra computation is severely lacking efficiency. We are also still waiting for official PyTorch support. That’s why we ran […]

Read more

Transformers-based Encoder-Decoder Models

!pip install transformers==4.2.1 !pip install sentencepiece==0.1.95 The transformer-based encoder-decoder model was introduced by Vaswani et al. in the famous Attention is all you need paper and is today the de-facto standard encoder-decoder architecture in natural language processing (NLP). Recently, there has been a lot of research on different pre-training objectives for transformer-based encoder-decoder models, e.g. T5, Bart, Pegasus, ProphetNet, Marge, etc…, but the model architecture has stayed largely the same. The goal of the blog post is to give an […]

Read more

Hyperparameter Search with Transformers and Ray Tune

A guest blog post by Richard Liaw from the Anyscale team With cutting edge research implementations, thousands of trained models easily accessible, the Hugging Face transformers library has become critical to the success and growth of natural language processing today. For any machine learning model to achieve good performance, users often need to implement some form of parameter tuning. Yet, nearly everyone (1, 2) either ends up disregarding hyperparameter tuning or opting to do a simplistic grid search with a […]

Read more

Transformers.js v4: Now Available on NPM!

We’re excited to announce that Transformers.js v4 is now available on NPM! After a year of development (we started in March 2025 🤯), we’re finally ready for you to use it. npm i @huggingface/transformers Performance & Runtime Improvements The biggest change is undoubtedly the adoption of a new WebGPU Runtime, completely rewritten in C++. We’ve worked closely with the ONNX Runtime team to thoroughly test this runtime across our ~200 supported model architectures, as well as many new v4-exclusive architectures. […]

Read more

TRL v1.0: Post-Training Library Built to Move with the Field

We’re releasing TRL v1.0, and it marks a real shift in what TRL is. What started as a research codebase has become a dependable library people build on, with clearer expectations around stability. This isn’t just a version bump. It reflects the reality that TRL now powers production systems, and embraces that responsibility. TRL now implements more than 75 post-training methods. But coverage isn’t the goal by itself. What matters is making these methods easy to try, compare, and actually […]

Read more

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

Today we’re excited to announce Granite 4.0 3B Vision, a compact vision-language model (VLM) designed for enterprise document understanding. It’s purpose-built for reliable information extraction from complex documents, forms, and structured visuals. Granite 4.0 3B Vision excels on the following capabilities: Table Extraction: Accurately parsing complex table structures (e.g., multi-row, multi-column, etc.) from document images Chart Understanding: Converting charts and figures into structured machine-readable formats, summaries, or executable code Semantic Key-Value Pair (KVP) Extraction: Identifying and grounding semantically meaningful key-value […]

Read more

Falcon Perception

TL;DR — Falcon Perception is a 0.6B-parameter early-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processes image patches + text in one sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and lightweight output heads. On SA-Co, Falcon Perception reaches 68.0 Macro-F1 (vs. 62.3 for SAM 3) with the main remaining gap being presence calibration (MCC 0.64 vs. 0.82). We also introduce PBench, a diagnostic benchmark […]

Read more
1 2 3 1,071