March 13, 2026 huggingface

Prefill and Decode for Concurrent Requests – Optimizing LLM Performance

Handling load from multiple users in parallel is crucial for the performance of LLM applications. In the previous part of our series on LLM performance, we discussed queueing strategies for the prioritization of different users. In this second part, we will now focus on the concurrent processing of requests, and how it impacts relevant metrics such as latency

March 13, 2026 huggingface

Finetuning olmOCR to be a faithful OCR-Engine

At TNG, we created a fine-tune of an Optical Character Recognition model based on olmOCR to help us automate our internal document processing workflows. Recently, the Allen Institute for Artificial Intelligence

March 13, 2026 huggingface

Tiny Agents: an MCP-powered agent in 50 lines of code

New! (May 23, ’25) If you prefer Python, check out the companion post Tiny Agents in Python. Over the past few weeks, I’ve been diving into MCP (Model Context Protocol) to understand what the hype around it was all about. My TL;DR is that it’s fairly simple, but still quite powerful: MCP is a standard API

March 13, 2026 huggingface

PipelineRL

We are excited to open-source PipelineRL, an experimental RL implementation that tackles a fundamental challenge in large-scale Reinforcement Learning with LLMs: the trade-off between inference throughput and on-policy data collection. PipelineRL’s key innovation is inflight weight updates during RL training (see Figure 1 below). This allows PipelineRL to achieve constantly high inference throughput and minimize the lag between the weights used for rollouts and the most recently updated model weights. The result: fast and stable RL training for large language […]

March 13, 2026 huggingface

What is AutoRound?

As large language models (LLMs) and vision-language models (VLMs) continue to grow in size and complexity, deploying them efficiently becomes increasingly challenging. Quantization offers a solution by reducing model size and inference latency. Intel’s AutoRound emerges as a cutting-edge quantization tool that balances accuracy, efficiency, and compatibility. AutoRound is a weight-only post-training quantization (PTQ) method developed by Intel. It uses signed gradient descent to jointly optimize weight rounding and clipping ranges, enabling accurate low-bit quantization (e.g., INT2 – INT8) with […]

March 13, 2026 huggingface

The 4 Things Qwen-3’s Chat Template Teaches Us

What a boring Jinja snippet tells us about the new Qwen-3 model. The new Qwen-3 model by Qwen ships with a much more sophisticated chat template than its predecessors Qwen-2.5 and QwQ. By taking a look at the differences in the Jinja template, we can find interesting insights into the new model. Chat Templates

March 13, 2026 huggingface

How to Build an MCP Server in 5 Lines of Python

Updated! (September 2025) This post has been updated with the latest Gradio MCP features including Resources, Prompts, enhanced authentication, and many more. Gradio is a Python library used by more than 1 million developers

March 13, 2026 huggingface

The Transformers Library: standardizing model definitions

TLDR: Going forward, we’re aiming for Transformers to be the pivot across frameworks: if a model architecture is supported by transformers, you can expect it to be supported in the rest of the ecosystem. Transformers was created in 2019, shortly following the release of the BERT Transformer model. Since then, we’ve continuously aimed to add state-of-the-art architectures, initially focused on NLP, then growing to Audio and computer vision. Today, transformers is the default library for LLMs and VLMs in the […]

March 13, 2026 huggingface

Falcon-Edge: A series of powerful, universal, fine-tunable 1.58bit language models.

In this blogpost, we present the key highlights and rationales about the Falcon-Edge series – a collection of powerful, universal, and fine-tunable language models available in ternary format, based on the BitNet architecture. Drawing from our experience with BitNet, Falcon-Edge introduces and validates an new pre-training paradigm that delivers a full-scope output from a single training process, simultaneously yielding both non-quantized and quantized model variants. This comprehensive approach produces a non-BitNet model in bfloat16 format, the native BitNet model, and […]

March 13, 2026 huggingface

Microsoft and Hugging Face expand collaboration to make open models easy to use on Azure

Today at the Microsoft Build conference, Satya Nadella announced an expanded collaboration with Hugging Face, to make its wide diversity of open models easy to deploy on Azure secure infrastructure. If you head over to Azure AI Foundry today, you will find a vastly expanded collection of 10,000+ Hugging Face models you can deploy in a couple clicks to power AI applications working with text, audio and images. And we’re just getting started!

« 1 … 53 54 55 56 57 … 70 »