LoCATe-GAT: Modeling Multi-Scale Local Context and Action Relationships
for Zero-Shot Action Recognition
Abstract
The increasing number of actions in the real world makes it difficult
for traditional deep-learning models to recognize unseen actions.
Recently, pretrained contrastive image-based visual-language (I-VL)
models have been adapted for efficient â\euroœzero-shotâ\euro? scene
understanding, with transformers for temporal modeling. However, the
significance of modeling the local spatial context of objects and action
environments remains unexplored. In this work, we propose a framework
called LoCATe-GAT, comprising a novel Local Context-Aggregating Temporal
transformer (LoCATe) and a Graph Attention Network (GAT) that take image
and text encodings from a pretrained I-VL model as inputs. Motivated by
the observation that object-centric
and environmental contexts drive both distinguishability and functional
similarity between actions, LoCATe captures multiscale local context
using dilated convolutional layers during temporal modeling.
Furthermore, the proposed GAT models semantic relationships between
classes and achieves a strong
synergy with the video embeddings produced by LoCATe. Extensive
experiments on two widely-used benchmarks â\euro“ UCF101 and HMDB51
â\euro“ show we achieve state-of-the-art results. Specifically, we
obtain absolute gains of 2.8% and 2.3% on these datasets in
conventional and 8.6% on UCF101 in generalized zero-shot action
recognition settings. Additionally, we gain 18.6% and 5.8% on UCF101
and HMDB51 as per the recent â\euroœTruZeâ\euro? evaluation
protocol.Â