End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning

Ran, Yuting; Fang, Bin; Chen, Lei; Wei, Xuekai; Xian, Weizhi; Zhou, Mingliang

Published in

World Scientific Publishing, Journal of Circuits, Systems, and Computers, 04(33), 2023

DOI: 10.1142/s0218126624500749

Tools

Export citation

Search in Google Scholar

End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning

Journal article published in 2023 by Yuting Ran

, Bin Fang

, Lei Chen

, Xuekai Wei

, Weizhi Xian

, Mingliang Zhou

This paper was not found in any repository, but could be made available legally by the author.

Full text: Unavailable

Preprint: archiving allowed

Upload

Postprint: archiving allowed

Upload

Published version: archiving forbidden

Policy details

Data provided by

Abstract

In this paper, we propose an end-to-end dual-stream transformer with a parallel encoder (DST-PE) for video captioning, which combines multimodal features and global–local representations to generate coherent captions. First, we design a parallel encoder that includes a local visual encoder and a bridge module, which simultaneously generates refined local and global visual features. Second, we devise a multimodal encoder to enhance the representation ability of our model. Finally, we adopt a transformer decoder with multimodal features as inputs and local visual features fused with textual features using a cross-attention block. Extensive experimental results demonstrate that our model achieves state-of-the-art performance with low training costs on several widely used datasets.

Published in

Links

Tools

End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning

Abstract