Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

The Optimal Choice of the Encoder-Decoder Model Components for Image Captioning

and

Version 1 : Received: 13 August 2024 / Approved: 14 August 2024 / Online: 14 August 2024 (16:36:18 CEST)

A peer-reviewed article of this Preprint also exists.

Bartosiewicz, M.; Iwanowski, M. The Optimal Choice of the Encoder–Decoder Model Components for Image Captioning. Information 2024, 15, 504. Bartosiewicz, M.; Iwanowski, M. The Optimal Choice of the Encoder–Decoder Model Components for Image Captioning. Information 2024, 15, 504.

Abstract

Image captioning aims at generating meaningful verbal descriptions of a digital image. Our paper focuses on the classic encoder-decoder deep learning model that consists of several components – sub-networks, each performing a separate task that, combined, form an effective caption generator. We investigate image feature extractors, recurrent neural networks, word embedding models, and word generation layers and discuss how each component influences the captioning model’s overall performance. Our experiments are performed on the MS COCO 2014 dataset. The results help design efficient models with optimal combinations of their components.

Keywords

image captioning; image processing; image analysis; computer vision; recurrent neural network

Subject

Computer Science and Mathematics, Computer Vision and Graphics

Copyright: This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download PDF