Human vs. Machine Minds: Ego-Centric Action Recognition Compared

* These authors contributed equally
[1] University of Surrey [2] University of Newcastle
IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025 Workshop on Multimodal Algorithmic Reasoning (MAR'25)

Our research pipeline, outlines our approach to comparing human and AI performance in ego-centric video action recognition. We began by employing a classifier to pre-select easy and hard video sets. To enable a comparison between how human and AI models recognise activities in video, we artificially and systematically reduced the video's spatial resolution. Then, using human participants and an AI model as classifiers, we evaluate and compare human and AI performance on these spatially reduced videos to quantify the difference in recognition between the human and AI models.

Abstract

Humans reliably surpass the performance of the most advanced AI models in action recognition, especially in real-world scenarios with low resolution, occlusions, and visual clutter. These models are somewhat similar to humans in using architecture that allows hierarchical feature extraction. However, they prioritise different features, leading to notable differences in their recognition. This study investigated these differences by introducing Epic ReduAct, a dataset derived from Epic-Kitchens-100. It consists of Easy and Hard ego-centric videos across various action classes. Critically, our dataset incorporates the concepts of Minimal Recognisable Configuration (MIRC) and sub-MIRC derived by progressively reducing the spatial content of the action videos across multiple stages. This enables a controlled evaluation of recognition difficulty for humans and AI models. This study examines the fundamental differences between human and AI recognition processes. While humans, unlike AI models, demonstrate proficiency in recognising hard videos, they experience a sharp decline in recognition ability as visual information is reduced, ultimately reaching a threshold beyond which recognition is no longer possible. In contrast, the AI models examined in this study appeared to exhibit greater resilience within this specific context, with recognition confidence decreasing gradually or, in some cases, even increasing at later reduction stages. These findings suggest that the limitations observed in human recognition do not directly translate to AI models, highlighting the distinct nature of their processing mechanisms.

Epic ReduAct Dataset

Epic ReduAct Dataset download

To enable out investigation, we first created an Easy and Hard subsets of the Epic-Kitchens dataset that represents different levels of activity recognition difficulty for AI models. Each set comprises of 18 Epic-Kitchens videos with a mean duration of 2.35s (Standard Deviation (SD) duration = 1.11s), to enable comparisons between human and AI model performance on distinct difficulty levels. Next, we conducted online experiments to systematically reduce the spatial information of the 18 Easy and 18 Hard videos (36 total) across eight hierarchical levels to identify MIRCs. The process is shown in below for a video with the GT label close. At Level 0, we spatially cropped a region of the video that best encompassed the entirety of each video. At Level 1, frames from each parent video were cropped at the four corners, generating four child sub-videos per original. Levels 2 through 7 involved recursively applying this corner-cropping method to each subsequent generation of parent videos.

Recongtion Gap

This image presents the recognition-gap frequency distribution for the Easy, Hard and combined sets (a,b,c), which allows for comparison between humans and AI model. Our results show a similar distribution pattern to previous work with images (d). Similarly, AI models exhibit some improvement in image recognition, whereas human accuracy consistently declines. Humans also experience a sharper decrease in recognition performance compared to AI models (d). Our results further show that humans are susceptible to substantial losses in recognition confidence. In contrast, spatial reductions can enhance the AI model’s ability to detect actions, as evidenced by negative recognition gaps. The frequency distributions are also broader for humans compared to the AI model, showing more diverse reductions than the AI model reductions in recognition gaps, which are more gradual. These findings indicate that, despite advancements in AI models, the gap between human and machine recognition capabilities persists.

BibTeX

@inproceedings{Rahmani:HumanvsMachine:CVPRWS:2025,
        AUTHOR = Rahmani, Sadegh and Rybansky, Filip and Vuong, Quoc and Guerin, Frank and Gilbert, Andrew ",
        TITLE = "Human vs. Machine Minds: Ego-Centric Action Recognition Compared",
        BOOKTITLE = "IEEE/CVF Conference on Computer Vision and Pattern Recognition - Workshop on Multimodal Algorithmic Reasoning (MAR'25)",
        YEAR = "2025",
        }