AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

AssetOpsBench is a comprehensive benchmark and evaluation system with six qualitative dimensions that bridges the gap for agentic AI in domain-specific settings, starting with industrial Asset Lifecycle Management. Introduction While existing AI benchmarks excel at isolated tasks such as coding or web navigation, they often fail to capture the complexity of real-world industrial operations. To bridge this gap, we introduce AssetOpsBench, a framework specifically designed to evaluate agent    

Read more

Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective

Agentic reinforcement learning (RL) extends traditional LLM training by optimizing not just a single-turn response, but an entire decision-making process learned through direct interaction with an environment during training. Unlike traditional single-turn reinforcement learning or offline preference-based methods that rely on static datasets, agentic RL trains policies by actively collecting on-policy data as the agent plans actions, invokes tools, observes outcomes, and adapts its behavior over multi-step trajectories in either simulated or real environments. This interaction-driven optimization assigns credit across […]

Read more

Alyah ⭐️: Toward Robust Evaluation of Emirati Dialect Capabilities in Arabic LLMs

Arabic is one of the most widely spoken languages in the world, with hundreds of millions of speakers across more than twenty countries. Despite this global reach, Arabic is not a monolithic language. Modern Standard Arabic coexists with a rich landscape of regional dialects that differ significantly in vocabulary, syntax, phonology, and cultural grounding. These dialects are the primary medium of daily communication, oral storytelling, poetry, and social interaction. However, most existing benchmarks for Arabic large language models focus almost […]

Read more

We got Claude to teach open models how to write CUDA kernels!

The best thing about agent skills is upskilling your agents on hard problems. There are two ways to look at that: You can take Opus 4.5 or other SOTA models and tackle the hardest problems out there. You can take models that run on your laptop and upskill them to harder problems. In this blog post, we’ll show you how to take on the latter. This blog post walks through the process of using a new tool, upskill, to generate […]

Read more

Introducing Daggr: Chain apps programmatically, inspect visually

TL;DR: Daggr is a new, open-source Python library for building AI workflows that connect Gradio apps, ML models, and custom functions. It automatically generates a visual canvas where you can inspect intermediate outputs, rerun individual steps, and manage state for complex pipelines, all in a few lines of Python code! Table of Contents Background Getting Started Sharing Your Workflows End-to-End Example with Different Nodes Next Steps    

Read more

Training Design for Text-to-Image Models: Lessons from Ablations

Welcome back! This is the second part of our series on training efficient text-to-image models from scratch. In the first post of this series, we introduced our goal: training a competitive text-to-image foundation model entirely from scratch, in the open, and at scale. We focused primarily on architectural choices and motivated the core design decisions behind our model PRX. We also released an early, small (1.2B parameters) version of the model as a preview of what we are building (go […]

Read more
1 66 67 68 69 70