MMCTAgent: Enabling multimodal reasoning over large video and image collections

Modern multimodal AI models can recognize objects, describe scenes, and answer questions about images and short video clips, but they struggle with long-form and large-scale visual data, where real-world reasoning requires moving beyond object recognition and short-clip analysis.
Real-world reasoning increasingly involves analyzing long-form video content, where context spans minutes or hours, far beyond the context limits of most models. It