Humans recognise everyday actions without conscious effort despite challenges such as poor viewing conditions and visual similarity between actions. Yet the visual features contributing to action recognition remain unclear. To address this, we combined semantic modelling and feature reduction methods to identify critical features for recognising actions from challenging egocentric perspectives. We first identified egocentric action videos from home environments that a motion-focused action classification network could correctly classify (Easy videos) or not (Hard videos). In Experiment 1, participants labelled the action and object in the videos. Using a language model framework, we derived human ground truth labels for each video and quantified its recognition consistency based on semantic similarity.
In Experiment 2, we recursively reduced the Easy and Hard videos with high recognition consistency to extract minimal recognisable configurations (MIRCs), in which any further spatial or temporal reductions disrupted recognition. The data were collected using a large-scale online study. From the 474 MIRCs, we extracted information related to the hand, objects, scene background and visual features (e.g. orientation or motion signals). Binary classification showed that recognition was disrupted when regions containing the manipulated object and strong orientation signals were removed, while temporal reduction by frame-scrambling disrupted recognition in most MIRCs. The active hand had some marginal contribution. Our results highlight the importance of both mid- and high-level information for egocentric action recognition and link hierarchical feature theories with naturalistic human perception.