IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST
Ayhan Sebin
Saurabh Jha
Rohan Arora
Daby Sow
Mert Cemri
Melissa Pan
Ion Stoica
Saurabh Jha
Rohan Arora
Daby Sow
Mert Cemri
Melissa Pan
Ion Stoica
ITBench HF Space
ITBench HF Dataset
MAST HF Dataset
ITBench Github
MAST Github
IBM Research and UC Berkeley collaborated to study how agentic LLM systems break in real-world IT automation, for tasks involving incident triage, logs/metrics queries, and Kubernetes actions in long-horizon tool loops.
Benchmarks typically reduce performance to a single number, telling you whether an agent failed but never why. To solve this black-box