Researchers Evaluate AI Reasoning With 786 Real-World Videos

Researchers are addressing a critical limitation in current multimodal foundation models by introducing a new benchmark designed to test situated awareness, the ability to understand one’s surroundings and potential actions within them. Chuhan Li, Ruilin Han from Yale University, and Joy Hsu from Stanford University, working with Yongyuan Liang from University of Maryland, College Park, Rajiv Dhawan from Amazon, Jiajun Wu, Ming-Hsuan Yang from University of California, Merced, and Xin Eric Wang from University of California, Santa Barbara, present SAW-Bench, a dataset of 786 real-world videos captured using smart glasses and comprising over 2,071 annotated question-answer pairs. This work is significant because existing benchmarks primarily focus on object relationships, neglecting the crucial observer-centric perspective needed for true spatial understanding. Their evaluation reveals a substantial performance gap between humans and even the most advanced models like Gemini 3 Flash, highlighting the need for improved algorithms capable of inferring coherent camera geometry and grounded, observer-centric dynamics.

Thirty-eight per cent separates current artificial intelligence from human understanding of everyday surroundings. This gap, measured through real-world video analysis, highlights a key limitation in how machines perceive space relative to themselves. Closing it will be essential for building truly perceptive robots and virtual assistants. Scientists have introduced SAW-Bench, a new benchmark designed to evaluate how well artificial intelligence understands spatial awareness from a first-person perspective.

Current methods for assessing multimodal foundation models (MFMs) largely concentrate on understanding relationships between objects within a scene, neglecting the important element of an observer’s viewpoint and movement. This new benchmark aims to address this oversight by focusing on ‘situated awareness’, the ability to understand one’s surroundings in relation to their own position and motion.

SAW-Bench utilizes real-world videos captured using Ray-Ban Meta smart glasses, presenting a more realistic and active evaluation environment for these models. Assessing an agent’s understanding of space requires more than simply identifying objects; it demands comprehension of how those objects relate to the agent itself. Unlike existing benchmarks that treat models as detached observers, SAW-Bench challenges them to reason about space from an embodied perspective, mirroring how humans perceive and interact with the world.

Tasks within SAW-Bench require models to determine relative directions, plan routes, and assess spatial affordances, the possibilities for action within an environment. These tasks necessitate an understanding of the observer’s location, orientation, and trajectory. Initial evaluations reveal a performance disparity of 37.66% between humans and Gemini 3 Flash, currently the best-performing MFM tested on SAW-Bench.

By accurately gauging an agent’s position and orientation, systems can interact with the physical world more effectively and create more immersive experiences for users. Improving situated awareness is vital for building reliable and intelligent systems, as failures in spatial understanding can lead to cascading errors.

Detailed video annotation builds a benchmark for spatial reasoning and contextual understanding

Initially, 786 first-person videos, captured using Ray-Ban Meta (Gen 2) smart glasses, formed the core dataset for evaluating situated awareness. These videos, recorded in a variety of both indoor and outdoor settings, provided realistic egocentric perspectives. Each video was then subjected to detailed annotation, resulting in over 2,071 question-answer pairs designed to probe a model’s understanding of spatial relationships and contextual awareness.

This extensive annotation process was undertaken by human evaluators to establish a ground truth for performance assessment. Researchers defined six distinct awareness tasks, each targeting a specific aspect of observer-centric understanding, to ensure the benchmark accurately assessed situated awareness. These tasks required models to reason about the agent’s viewpoint, pose, and motion relative to the surrounding environment.

The experimental design involved the careful selection of real-world videos. While datasets comprised of synthetic or staged scenes are common, the use of naturally captured footage presented challenges related to variability in lighting, occlusion, and camera motion. This realism was considered essential for accurately gauging a model’s ability to generalise to real-world scenarios.

The smart glasses provided a unique data source, mirroring human visual experience more closely than traditional camera setups. Accurately assessing spatial reasoning is complex, so the research team focused on observer-centric relationships, a dimension often overlooked in existing multimodal benchmarks. The work prioritized understanding how a model interprets the environment from the perspective of an agent, rather than solely evaluating a model’s ability to identify objects and their relationships. This emphasis on egocentric awareness necessitated a novel benchmark design, leading to the creation of SAW-Bench.

Human spatial awareness exceeds leading AI on SAW-Bench benchmark tasks

Researchers established a performance gap of 37.66% between human observers and the best-performing multimodal foundation model, Gemini 3 Flash, when assessed on the SAW-Bench benchmark. This measurement, derived from evaluating observer-centric spatial awareness using real-world videos, highlights a considerable disparity in how effectively humans and artificial intelligence perceive and reason about environments from a first-person perspective.

SAW-Bench comprises 786 self-recorded videos and over 2,071 human-annotated question-answer pairs, providing a detailed assessment across six distinct awareness tasks. Human baseline performance reached 91.55% overall, with peak accuracy of 94.00% in the Self-Localization task, demonstrating a high capacity for understanding one’s own position within a scene.

The lowest human score was 79.01% in the Reverse Route Plan task, indicating this presents the greatest challenge even for human observers. Gemini 3 Flash achieved an overall score of 53.89%, with 66.00% on the Spatial Affordance task and 64.84% on the Relative Direction task. Qwen3-VL 235B-A22B achieved 41.40%, while smaller models like Qwen3-VL 8B reached only 36.12%.

Qwen2.5-VL 32B achieved 36.46% and LLaVA OneVision 72B scored 33.70%. These results demonstrate a significant performance range among different models, and highlight the challenges in developing AI systems that can match human-level spatial reasoning capabilities in active, real-world environments.

Evaluating artificial intelligence through first-person spatial reasoning and action assessment

Scientists have created a new benchmark to test how well artificial intelligence understands the world from a human perspective. Progress in artificial intelligence has focused on identifying objects and their relationships within a scene, but less attention has been given to how an agent perceives those objects relative to itself. This new test, called SAW-Bench, uses videos recorded from wearable cameras to assess whether AI can accurately reason about space and actions from an observer’s viewpoint, something humans do effortlessly.

Current models still struggle with this kind of ‘situated awareness’, exhibiting a considerable performance difference compared to human capabilities. Numbers reveal a gap of over thirty-seven percent, demonstrating that even the most advanced systems fall short of replicating basic human spatial understanding. The significance extends beyond simply achieving higher scores on a test; it speaks to the limitations of current AI in truly interacting with the physical world.

A robot navigating a home or assisting someone with a task requires more than just object recognition; it needs to understand where it is in relation to those objects and how its actions will affect the environment. Unlike previous benchmarks, SAW-Bench forces AI to grapple with these observer-centric challenges, exposing weaknesses in spatial reasoning that might not surface in more static scenarios.

Addressing these shortcomings could unlock more natural and effective human-machine collaboration. The benchmark highlights that models often rely on superficial cues rather than building a genuine understanding of camera geometry. A key question remains: can AI truly ‘see’ the world as we do, or will it forever be limited to processing visual data without grasping the underlying spatial relationships. Future efforts might explore how AI can learn from active environments and adapt to changing viewpoints, bringing us closer to genuinely intelligent systems.

Researchers Evaluate AI Reasoning With 786 Real-World Videos

Tags: