Back to The Future: Evaluating AI Agents on Predicting Future Events

Most current AI benchmarks focus on answering questions about the past, either by testing models on existing knowledge (in a static manner, such as HLE or GPQA, or augmented, like BrowseComp or GAIA) or previously solved problems (like PaperBench, DABStep, or most coding evaluations). However, we believe that more valuable AI, and ultimately AGI, will be distinguished by its ability to use this past to forecast interesting aspects of the future, rather than merely reciting old facts. Forecasting future events […]

Read more

Consilium: When Multiple LLMs Collaborate

Picture this: four AI experts sitting around a poker table, debating your toughest decisions in real-time. That’s exactly what Consilium, the multi-LLM platform I built during the Gradio Agents & MCP Hackathon, does. It lets AI models discuss complex questions and reach consensus through structured debate. The platform works both as a visual Gradio interface and as an MCP (Model Context Protocol) server    

Read more

Accelerate a World of LLMs on Hugging Face with NVIDIA NIM

AI builders want a choice of the latest large language models (LLM) architectures and specialized variants for use in AI agents and other apps, but handling all the diversity can slow testing and deployment pipelines. In particular, managing and optimizing different inference software frameworks to achieve best performance across varied LLMs and serving requirements is a time-consuming bottleneck    

Read more

TimeScope: How Long Can Your Video Large Multimodal Model Go?

TimeScope is an open-source benchmark designed to measure how well vision-language models understand long videos. By adding short “needle” clips into videos ranging from 1 minute to 8 hours, it evaluates three skills: localized retrieval, information synthesis, fine-grained temporal perception. Timescope reveals that many state-of-the-art models still struggle with true temporal comprehension. Table of Contents Recent advances in multimodal AI have produced models claiming to understand hour-long videos. This trend mirrors progress in long-context language models,    

Read more

Parquet Content-Defined Chunking

Reduce Parquet file upload and download times on Hugging Face Hub by leveraging the new Xet storage layer and Apache Arrow’s Parquet Content-Defined Chunking (CDC) feature enabling more efficient and scalable data workflows. TL;DR: Parquet Content-Defined Chunking (CDC) is now available in PyArrow and Pandas, enabling efficient deduplication of Parquet files on content-addressable storage systems like Hugging Face’s Xet storage layer. CDC    

Read more

Introducing Trackio: A Lightweight Experiment Tracking Library from Hugging Face

TL;DR: Trackio is a new, open-source, and free experiment tracking Python library that provides a local dashboard and seamless integration with Hugging Face Spaces for easy sharing and collaboration. Since trackio is a drop-in replacement for wandb, you can get started with the syntax you already know! Background If you have trained your own machine learning model, you know how important it is to be able to track metrics, parameters, and hyperparameters during training and visualize    

Read more

📚 3LM: A Benchmark for Arabic LLMs in STEM and Code

Why 3LM? Arabic Large Language Models (LLMs) have seen notable progress in recent years, yet existing benchmarks fall short when it comes to evaluating performance in high-value technical domains. Most evaluations to date have focused on general-purpose tasks like summarization, sentiment analysis, or generic question answering. However, scientific reasoning and programming are essential for a broad range of real-world applications, from education to technical problem-solving. To address this gap, we introduce 3LM (علم),    

Read more
1 58 59 60 61 62 1,021