Seeing Just Enough: Hands, Objects and Visual Features for Egocentric Action Recognition

Seeing Just Enough: The Contribution of Hands, Objects and Visual Features to Egocentric Action Recognition

* These authors contributed equally
¹Newcastle University ²University of Surrey
bioRxiv preprint, doi: 10.64898/2026.02.15.705896 (2026)

Abstract

Humans recognise everyday actions without conscious effort despite challenges such as poor viewing conditions and visual similarity between actions. Yet the visual features contributing to action recognition remain unclear. To address this, we combined semantic modelling and feature reduction methods to identify critical features for recognising actions from challenging egocentric perspectives. We first identified egocentric action videos from home environments that a motion-focused action classification network could correctly classify (Easy videos) or not (Hard videos). In Experiment 1, participants labelled the action and object in the videos. Using a language model framework, we derived human ground truth labels for each video and quantified its recognition consistency based on semantic similarity.

In Experiment 2, we recursively reduced the Easy and Hard videos with high recognition consistency to extract minimal recognisable configurations (MIRCs), in which any further spatial or temporal reductions disrupted recognition. The data were collected using a large-scale online study. From the 474 MIRCs, we extracted information related to the hand, objects, scene background and visual features (e.g. orientation or motion signals). Binary classification showed that recognition was disrupted when regions containing the manipulated object and strong orientation signals were removed, while temporal reduction by frame-scrambling disrupted recognition in most MIRCs. The active hand had some marginal contribution. Our results highlight the importance of both mid- and high-level information for egocentric action recognition and link hierarchical feature theories with naturalistic human perception.

BibTeX

@article{Rybansky:SeeingJustEnough:bioRxiv:2026, author = {Rybansky, Filip and Rahmaniboldaji, Sadegh and Gilbert, Andrew and Guerin, Frank and Hurlbert, Anya C. and Vuong, Quoc C.}, title = {Seeing Just Enough: The Contribution of Hands, Objects and Visual Features to Egocentric Action Recognition}, journal = {bioRxiv}, year = {2026}, doi = {10.64898/2026.02.15.705896}, url = {https://doi.org/10.64898/2026.02.15.705896} }

Frequently Asked Questions

What problem does this work address?

We ask which visual features are minimally sufficient for humans to recognise everyday hand–object actions from egocentric videos, and how these features relate to hierarchical theories of visual processing.

How is this different from the Epic ReduAct work?

Epic ReduAct compared human and model performance under spatial reductions. Here we focus solely on human observers and use MIRCs to dissect the contribution of hands, objects, background and mid-level features such as orientation and motion signals.

Where can I find the stimuli and code?

All code and data used to derive MIRCs, compute semantic similarity and run the feature importance analyses are available on our GitHub repository.

Seeing Just Enough: The Contribution of Hands, Objects and Visual Features to Egocentric Action Recognition

What visual information is just enough for humans to recognise everyday actions from egocentric videos?

Abstract

Want to explore the details?

Egocentric Minimal Video Stimuli

Key Findings

Importance of Features

BibTeX

Frequently Asked Questions

What problem does this work address?

How is this different from the Epic ReduAct work?

Where can I find the stimuli and code?