Share this research:

Twitter LinkedIn
Seeing Just Enough: Hands, Objects and Visual Features for Egocentric Action Recognition

Seeing Just Enough: The Contribution of Hands, Objects and Visual Features to Egocentric Action Recognition

* These authors contributed equally
1Newcastle University 2University of Surrey
bioRxiv preprint, doi: 10.64898/2026.02.15.705896 (2026)

What visual information is just enough for humans to recognise everyday actions from egocentric videos?

Pipeline for identifying critical hands, objects and visual features for egocentric action recognition

We combine a language-model based semantic labelling framework with a recursive video reduction paradigm to identify minimal recognisable configurations (MIRCs) for egocentric actions. From these MIRCs, we quantify the contribution of hands, manipulated objects, scene background and mid-level visual features such as orientation and motion signals to human action recognition.

Abstract

Humans recognise everyday actions without conscious effort despite challenges such as poor viewing conditions and visual similarity between actions. Yet the visual features contributing to action recognition remain unclear. To address this, we combined semantic modelling and feature reduction methods to identify critical features for recognising actions from challenging egocentric perspectives. We first identified egocentric action videos from home environments that a motion-focused action classification network could correctly classify (Easy videos) or not (Hard videos). In Experiment 1, participants labelled the action and object in the videos. Using a language model framework, we derived human ground truth labels for each video and quantified its recognition consistency based on semantic similarity.

In Experiment 2, we recursively reduced the Easy and Hard videos with high recognition consistency to extract minimal recognisable configurations (MIRCs), in which any further spatial or temporal reductions disrupted recognition. The data were collected using a large-scale online study. From the 474 MIRCs, we extracted information related to the hand, objects, scene background and visual features (e.g. orientation or motion signals). Binary classification showed that recognition was disrupted when regions containing the manipulated object and strong orientation signals were removed, while temporal reduction by frame-scrambling disrupted recognition in most MIRCs. The active hand had some marginal contribution. Our results highlight the importance of both mid- and high-level information for egocentric action recognition and link hierarchical feature theories with naturalistic human perception.

Want to explore the details?

Download the full preprint or browse the analysis code and data used to derive minimal recognisable configurations and feature importance.

Download Paper Code & Data

Egocentric Minimal Video Stimuli

Overview of egocentric action videos and minimal recognisable configurations

We start from 237 egocentric kitchen videos from the EPIC-KITCHENS-100 dataset and derive robust human ground truth labels using a semantic similarity framework. From consistently recognised Easy and Hard videos, we recursively crop and temporally scramble the videos to obtain 474 minimal recognisable configurations (MIRCs) and their spatial and spatiotemporal variants. These stimuli allow us to probe which regions and features are indispensable for human action recognition.

Key Findings

Summary of recognition decline and recognition gap results

The mean reduction rate and distribution of recognition gap for Easy and Hard videos in Experiment 2. A) The mean reduction rate at each reduction level for Easy and Hard videos. Reduction level 7 only involved quadrants with assumed unrecognizability so the reduction rate could not be calculated (see Table 5). Error bars reflect 95% confidence intervals. B) The probability distribution of the reduction rate for Easy and Hard videos. A negative value meant that recognition improved after the quadrant was reduced. C) The probability distribution of the recognition gap for MIRCs and their Spatial subMIRCs. D) The probability distribution of the recognition gap for spatiotemporal MIRCs and their spatiotemporal subMIRC.

Importance of Features

Summary of feature importance

Feature importance for significant binary classifications in Experiment 2. A) Mean feature importance for the classification of MIRC vs unrecognizable quadrants from Easy videos. B) Mean feature importance for the classification of MIRC vs unrecognizable quadrants from Hard videos. C) Mean feature importance for the classification of Easy vs Hard MIRCs. Mean feature importance greater than the Boruta Threshold (purple) were significantly informative for the classifier. Features are ordered by their SHAP summaries in Figure S6.1 of the Supplementary Material. Error bars reflect 95% confidence intervals.

BibTeX

@article{Rybansky:SeeingJustEnough:bioRxiv:2026,
  author  = {Rybansky, Filip and Rahmaniboldaji, Sadegh and Gilbert, Andrew and Guerin, Frank and Hurlbert, Anya C. and Vuong, Quoc C.},
  title   = {Seeing Just Enough: The Contribution of Hands, Objects and Visual Features to Egocentric Action Recognition},
  journal = {bioRxiv},
  year    = {2026},
  doi     = {10.64898/2026.02.15.705896},
  url     = {https://doi.org/10.64898/2026.02.15.705896}
}

Frequently Asked Questions

What problem does this work address?

We ask which visual features are minimally sufficient for humans to recognise everyday hand–object actions from egocentric videos, and how these features relate to hierarchical theories of visual processing.

How is this different from the Epic ReduAct work?

Epic ReduAct compared human and model performance under spatial reductions. Here we focus solely on human observers and use MIRCs to dissect the contribution of hands, objects, background and mid-level features such as orientation and motion signals.

Where can I find the stimuli and code?

All code and data used to derive MIRCs, compute semantic similarity and run the feature importance analyses are available on our GitHub repository.