Published in

World Scientific Publishing, Journal of Circuits, Systems, and Computers, 04(33), 2023

DOI: 10.1142/s0218126624500749

Links

Tools

Export citation

Search in Google Scholar

End-to-End Dual-Stream Transformer with a Parallel Encoder for Video Captioning

Journal article published in 2023 by Yuting Ran ORCID, Bin Fang ORCID, Lei Chen ORCID, Xuekai Wei ORCID, Weizhi Xian ORCID, Mingliang Zhou ORCID
This paper was not found in any repository, but could be made available legally by the author.
This paper was not found in any repository, but could be made available legally by the author.

Full text: Unavailable

Green circle
Preprint: archiving allowed
Green circle
Postprint: archiving allowed
Red circle
Published version: archiving forbidden
Data provided by SHERPA/RoMEO

Abstract

In this paper, we propose an end-to-end dual-stream transformer with a parallel encoder (DST-PE) for video captioning, which combines multimodal features and global–local representations to generate coherent captions. First, we design a parallel encoder that includes a local visual encoder and a bridge module, which simultaneously generates refined local and global visual features. Second, we devise a multimodal encoder to enhance the representation ability of our model. Finally, we adopt a transformer decoder with multimodal features as inputs and local visual features fused with textual features using a cross-attention block. Extensive experimental results demonstrate that our model achieves state-of-the-art performance with low training costs on several widely used datasets.