After we cleaned the training captions and pre-calculated image-/word- features, the caption network can be trained. In this example, the task of image captioning is modelled as an iterative approach predicting the captions word-by-word. For this, we input the network with an image, a partial caption, and let it predict the next word in the caption. This also means we model the task as a word classification (using our vocabulary as possible classes). Before we can start training, we have to bring the data into this iterative format. For each image/caption pair, we create several training examples using all possible partial sentences until all words in the caption have been used as target word once. We use a simple network with two input branches for training. The first branch contains some dense layers to further process the image feature vector. The second branch contains an embedding layer to map our encoded caption to GLOVE vectors. This is achieved by setting the parameters of the embedding vector to the previously created Python dictionary and making the layer untrainable. As output, the workflow writes the trained model to disk.

This is a companion discussion topic for the original entry at https://kni.me/w/iVse1ZWwmf0z9ozX