TimeScope: How Long Can Your Video Large Multimodal Model Go?
TimeScope is an open-source benchmark designed to measure how well vision-language models understand long videos. By adding short “needle” clips into videos ranging from 1 minute to 8 hours, it evaluates three skills:
- localized retrieval,
- information synthesis,
- fine-grained temporal perception. Timescope reveals that many state-of-the-art models still struggle with true temporal comprehension.
Table of Contents
Recent advances in multimodal AI have produced models claiming to understand hour-long videos. This trend mirrors progress in long-context language models,