IMAGE CAPTIONING

Aug 13, 20213 min read

Author- Himanshu Bobade

Automatically creating the description of an image using any natural language sentences is a tough task. This blog discusses about image captioning task. The functioning, the models used for extracting the features, determining captions is clearly mentioned. Model training and in addition to that, testing and determining captions using the trained model is also demonstrated.

Humans can describe almost everything around them. Machines can also describe Images using NLP, Deep learning and Computer vision techniques. This process is called Image Captioning. In this way, we can generate captions with the help of a trained model for a given image or set of images. The algorithm we will execute, uses InceptionV3 and Glove. We need to use transfer learning to deal with Inception and Glove. Inception is used for identifying objects or extracting the features from the Image. GloVe stands for Global Vectors for Word Representation. It is and unsupervised learning algorithm which offers vector representation for words. Training is performed on aggregated global words co-occurrence statistics from a corpus, and the resulting representations flaunts interesting linear substructures of the word vector space.

ILLUSTRATION

A simple illustration of the execution is as shown in fig. 1, we extract features/objects from the image and try to combine them to form a sentence. The model detects it’s a woman walking with green field around.

FIG:1

Caption Generated: a woman walking on a lush green field. The necessary modules needed to be imported are as follows:

FIG:2

Now, we will use a variable called root_captioning to store the folder location where we have our dataset files. We need to download the Glove file and flickr_8k dataset.

DATA PREPROCESSING

We will load the dataset and convert the string texts in the file into lowercases, remove punctuations, etc. For each image there are 4-5 captions. We will make a dictionary and assign image name as the keys and captions as the values associated with every key.

FIG:3

Now we need to load the image dataset and after doing that split them into train and test sets.

FIG:4

We will later use the start token to begin the process of generating a caption. Encountering the stop token in the generated text will let us know the process is complete.

TRAINING

Now we are going to load the inception model. We are using the output_dim as 2048 which is less. You can increase it with other models like mobilenet but then it’d take time to process and train.

FIG:5

After that, we need to encode the images to create training sets:

FIG:6

The loaded image files are now needed to be pickled and turning them into 2048 vector. Similarly we do it with test set. Now, we need to create a word vocabulary for our captions: