Options
Multimodal attention-based transformer for video captioning
Date Issued
2023
Author(s)
Munusamy, H
Sekhar, CC
Abstract
Video captioning is a computer vision task that generates a natural language description for a video. In this paper, we propose a multimodal attention-based transformer using the keyframe features, object features, and semantic keyword embedding features of a video. The Structural Similarity Index Measure (SSIM) is used to extract keyframes from a video. We also detect the unique objects from the extracted keyframes. The features from the keyframes and the objects detected in the keyframes are extracted using a pretrained Convolutional Neural Network (CNN). In the encoder, we use a bimodal attention block to apply two-way cross-attention between the keyframe features and the object features. In the decoder, we combine the features of the words generated up to the previous time step, the semantic keyword embedding features, and the encoder features using a tri-modal attention block. This allows the decoder to choose the multimodal features dynamically to generate the next word in the description. We evaluated the proposed approach using the MSVD, MSR-VTT, and Charades datasets and observed that the proposed model provides better performance than other state-of-the-art models.