Paper
3 October 2024 Chinese image captioning via double decoding based on visual prompts
Weiguo Tan, Yannan Xiao, Yayuan Wen, Zhenrong Deng, Wenming Huang
Author Affiliations +
Proceedings Volume 13272, Fifth International Conference on Computer Vision and Data Mining (ICCVDM 2024); 1327210 (2024) https://doi.org/10.1117/12.3048381
Event: 5th International Conference on Computer Vision and Data Mining (ICCVDM 2024), 2024, Changchun, China
Abstract
The significant progress of vision-language pre-trained models (VLMs) and large language models (LLMs) have provided a feasible new mode for image captioning, which relies on VLMs to process images and then utilizes LLMs to generate captions, simplifying the caption generation process and making it lighter. Based on this mode, to address the issues of deviation between the generated captions and the content of image expressions, incomplete descriptive information, and resource hungry in previous Chinese image captioning models, we propose a Chinese image captioning model via double decoding based on Visual Prompts, DDVP. This model employs the CLIP model as the encoder, the GPT2 model as the decoder, and introduces Visual Prompts, which are keywords related to image content. The model adopts a double decoding approach, first decoding to generate Visual Prompts and then decoding to generate the final captions based on the Visual Prompts. Through evaluation, we have demonstrated that our model achieves competitive results on the AIC-ICC dataset, and while maintaining fluency, the generated captions of DDVP can also cover the information in the image more comprehensively and accurately.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Weiguo Tan, Yannan Xiao, Yayuan Wen, Zhenrong Deng, and Wenming Huang "Chinese image captioning via double decoding based on visual prompts", Proc. SPIE 13272, Fifth International Conference on Computer Vision and Data Mining (ICCVDM 2024), 1327210 (3 October 2024); https://doi.org/10.1117/12.3048381
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Visual process modeling

Visualization

Image processing

Data modeling

Information visualization

Digital image processing

Image understanding

RELATED CONTENT

Limiting human perception for image sequences
Proceedings of SPIE (April 22 1996)
Image understanding in terms of semiotics
Proceedings of SPIE (June 13 1995)
Knowledge and vision engines a new generation of image...
Proceedings of SPIE (October 11 2000)
Modeling and visualizing uncertainty in digital thematic maps
Proceedings of SPIE (December 22 2006)

Back to Top