IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Ayhan Sebin
Saurabh Jha
Rohan Arora
Daby Sow
Mert Cemri
Melissa Pan
Ion Stoica

ITBench HF Space
ITBench HF Dataset
MAST HF Dataset
ITBench Github
MAST Github

IBM Research and UC Berkeley collaborated to study how agentic LLM systems break in real-world IT automation, for tasks involving incident triage, logs/metrics queries, and Kubernetes actions in long-horizon tool loops.

Benchmarks typically reduce performance to a single number, telling you whether an agent failed but never why. To solve this black-box

To finish reading, please visit source site