Building Blocks for Foundation Model Training and Inference on AWS

For a long time, “scaling” in foundation models mostly meant one thing: spend more compute on pre-training and capabilities rise. That intuition was supported by empirical work such as Kaplan et al. (2020), which reported predictable power-law trends in loss as you scale model parameters, dataset size, and training compute. In practice, these trends justified sustained investment in large-scale accelerator capacity and the surrounding distributed infrastructure needed to keep it efficiently utilized. But the frontier has evolved—and scaling is no […]

Read more

Unlocking asynchronicity in continuous batching

TL;DR: we explain how to separate CPU and GPU workloads to get a massive performance boost for inference. This is the second post in a series on efficient LLM inference. The first post covered continuous batching from first principles. It introduces some concepts we build upon: KV cache, FlashAttention, attention masks, etc. An H200 costs around $5 an hour on Inference Endpoints. That’s cheap for an hour, but use it for a day and you are already paying $120. If […]

Read more

Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality

TL;DR: Two new Apache 2.0 multilingual embedding models built on ModernBERT — a 97M-parameter compact model that beats every open sub-100M multilingual embedder on MTEB Multilingual Retrieval (60.3), and a 311M full-size model that scores 65.2 on MTEB Multilingual Retrieval (#2 among open models under 500M parameters) with Matryoshka support. Both cover 200+ languages, are tuned on 52 languages, handle 32K-token context (64x R1), and add code retrieval across 9 programming languages. In this post: Enterprise-Ready by Design · A […]

Read more

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

“When a measure becomes a target, it ceases to be a good measure.” (Goodhart’s Law) TLDR: Appen Inc. and DataoceanAI have provided high-quality English ASR datasets covering scripted and conversational speech over multiple accents. To prevent potential risks of benchmaxxing or test-set contamination, we will keep these datasets private for a high-quality measure of performance on multiple tasks. We’re not updating the average WER at this time: by default, the leaderboard’s Average WER remains computed on public datasets only. You […]

Read more

vLLM V0 to V1: Correctness Before Corrections in RL

PipelineRL uses vLLM as the inference engine for rollout generation. The inference engine samples tokens and returns token logprobs; the trainer uses those logprobs to compute policy ratios, KL, clip rate, entropy, and reward. Any discrepancy in how those logprobs are computed can change the training dynamics. This is the train-inference mismatch we needed to eliminate during the vLLM V0 to V1 migration. TL;DR. vLLM V1 matched our vLLM V0 reference after we fixed four things: processed rollout logprobs, V1-specific […]

Read more

EMO: Pretraining mixture of experts for emergent modularity

🧠 Models: https://huggingface.co/collections/allenai/emo | 📄 Tech report: https://allenai.org/papers/emo | 💻 Code: https://github.com/allenai/EMO | 📊 Visualization: https://emovisualization.netlify.app/ Today we’re releasing EMO, a new mixture-of-experts (MoE) model pretrained end-to-end so that modular structure emerges directly from the data without relying on human-defined priors. EMO lets you use a small subset of its experts – just 12.5% of the total – for a given task while keeping near full-model performance, and still works as a strong general-purpose model when all experts are used […]

Read more

MachinaCheck: Building a Multi-Agent CNC Manufacturability System on AMD MI300X

Built at the AMD Developer Hackathon on lablab.ai — May 2026 The Problem We Solved Walk into any small CNC machine shop and ask the manager how they decide whether to accept a customer job. The answer is almost always the same: they print the drawing, read every dimension by hand, walk around the shop checking which tools are available, estimate whether their machines can hold the required tolerances, and write notes on a clipboard. The whole process takes 30 […]

Read more

How to build scalable web apps with OpenAI’s Privacy Filter

OpenAI released Privacy Filter on the Hub this week: an open-source personally-identifiable information (PII) detector that labels text across eight categories in a single forward pass over a 128k context. Model card. We spent a few hours building with it and landed on three apps that each reveals a different slice of what it can do. Document Privacy Explorer: drop in a PDF or DOCX, read the document back with every PII span highlighted in place. Image Anonymizer: upload an […]

Read more

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

NVIDIA Nemotron 3 Nano Omni is a new omni-modal understanding model built for real-world document analysis, multiple image reasoning, automatic speech recognition, long audio-video understanding, agentic computer use, and general reasoning. It extends the Nemotron multimodal line from a strong vision-language system to a broader text + image + video + audio model. Nemotron 3 Nano Omni delivers best-in-class accuracy on complex document intelligence leaderboards such as MMlongbench-Doc, OCRBenchV2, while also leading in video and audio leaderboards like WorldSense and […]

Read more

DeepInfra on Hugging Face Inference Providers 🔥

We’re thrilled to share that DeepInfra is now a supported Inference Provider on the Hugging Face Hub! DeepInfra joins our growing ecosystem, enhancing the breadth and capabilities of serverless inference directly on the Hub’s model pages. Inference Providers are also seamlessly integrated into our client SDKs (for both JS and Python), making it super easy to use a wide variety of models with your preferred providers. DeepInfra is a serverless AI inference platform offering one of the most cost-effective pricing […]

Read more
1 2 3 4 5 78