Author- Himanshu Bobade
Automatically creating the description of an image using any natural language sentences is a tough task. This blog discusses about image captioning task. The functioning, the models used for extracting the features, determining captions is clearly mentioned. Model training and in addition to that, testing and determining captions using the trained model is also demonstrated.
Humans can describe almost everything around them. Machines can also describe Images using NLP, Deep learning and Computer vision techniques. This process is called Image Captioning. In this way, we can generate captions with the help of a trained model for a given image or set of images. The algorithm we will execute, uses InceptionV3 and Glove. We need to use transfer learning to deal with Inception and Glove. Inception is used for identifying objects or extracting the features from the Image. GloVe stands for Global Vectors for Word Representation. It is and unsupervised learning algorithm which offers vector representation for words. Training is performed on aggregated global words co-occurrence statistics from a corpus, and the resulting representations flaunts interesting linear substructures of the word vector space.
ILLUSTRATION
A simple illustration of the execution is as shown in fig. 1, we extract features/objects from the image and try to combine them to form a sentence. The model detects it’s a woman walking with green field around.
![](https://static.wixstatic.com/media/6e3b57_ee6844e90466465282eb308fd4e1d47f~mv2.jpeg/v1/fill/w_980,h_548,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/6e3b57_ee6844e90466465282eb308fd4e1d47f~mv2.jpeg)
FIG:1
Caption Generated: a woman walking on a lush green field. The necessary modules needed to be imported are as follows:
![](https://static.wixstatic.com/media/6e3b57_2860c6f5d4b3433a8edcdf255422a37d~mv2.jpeg/v1/fill/w_980,h_410,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/6e3b57_2860c6f5d4b3433a8edcdf255422a37d~mv2.jpeg)
FIG:2
Now, we will use a variable called root_captioning to store the folder location where we have our dataset files. We need to download the Glove file and flickr_8k dataset.
DATA PREPROCESSING
We will load the dataset and convert the string texts in the file into lowercases, remove punctuations, etc. For each image there are 4-5 captions. We will make a dictionary and assign image name as the keys and captions as the values associated with every key.
![](https://static.wixstatic.com/media/6e3b57_680f3a265d4c427b8d50e7304a20de66~mv2.jpeg/v1/fill/w_980,h_622,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/6e3b57_680f3a265d4c427b8d50e7304a20de66~mv2.jpeg)
FIG:3
Now we need to load the image dataset and after doing that split them into train and test sets.
![](https://static.wixstatic.com/media/b670ab_66a8477f9af44da5938ba7c5c2f3b3c4~mv2.jpg/v1/fill/w_980,h_189,al_c,q_80,usm_0.66_1.00_0.01,enc_auto/b670ab_66a8477f9af44da5938ba7c5c2f3b3c4~mv2.jpg)
FIG:4
We will later use the start token to begin the process of generating a caption. Encountering the stop token in the generated text will let us know the process is complete.
TRAINING
Now we are going to load the inception model. We are using the output_dim as 2048 which is less. You can increase it with other models like mobilenet but then it’d take time to process and train.
![](https://static.wixstatic.com/media/b670ab_23b69f64ce874c439cd4331773d0e61a~mv2.jpg/v1/fill/w_980,h_220,al_c,q_80,usm_0.66_1.00_0.01,enc_auto/b670ab_23b69f64ce874c439cd4331773d0e61a~mv2.jpg)
FIG:5
After that, we need to encode the images to create training sets:
![](https://static.wixstatic.com/media/b670ab_7464b14c5a424671aca2bc834e8dfb7a~mv2.jpg/v1/fill/w_980,h_197,al_c,q_80,usm_0.66_1.00_0.01,enc_auto/b670ab_7464b14c5a424671aca2bc834e8dfb7a~mv2.jpg)
FIG:6
The loaded image files are now needed to be pickled and turning them into 2048 vector. Similarly we do it with test set. Now, we need to create a word vocabulary for our captions:
![](https://static.wixstatic.com/media/b670ab_09ed7ebf399942e69dc88d175ce6fc72~mv2.jpg/v1/fill/w_980,h_198,al_c,q_80,usm_0.66_1.00_0.01,enc_auto/b670ab_09ed7ebf399942e69dc88d175ce6fc72~mv2.jpg)
FIG:7
The table idxtoword converts index numbers to actual words to index values.
The way our model will work is it will give a words one by one. For example,
A woman
A woman walking
A woman walking in lush green fields.
![](https://static.wixstatic.com/media/b670ab_959dae50607c405f9a5f7e6dd5623242~mv2.jpg/v1/fill/w_980,h_198,al_c,q_80,usm_0.66_1.00_0.01,enc_auto/b670ab_959dae50607c405f9a5f7e6dd5623242~mv2.jpg)
FIG:8
The words are added one by one. This can be achieved by :
Neural Network building:
![](https://static.wixstatic.com/media/b670ab_efab806e6cea478eb7edfbda16f3872f~mv2.jpg/v1/fill/w_973,h_231,al_c,q_80,enc_auto/b670ab_efab806e6cea478eb7edfbda16f3872f~mv2.jpg)
FIG:9
Defining layers and functions:
Optimizer used here is ‘Adam’. This will take couple of hours to train.
![](https://static.wixstatic.com/media/b670ab_d11402b209f6491b9fcc8f85d184bdba~mv2.jpg/v1/fill/w_980,h_983,al_c,q_85,usm_0.66_1.00_0.01,enc_auto/b670ab_d11402b209f6491b9fcc8f85d184bdba~mv2.jpg)
FIG:10
Testing and Evaluate the dataset images:
Our model is ready, now we need to test it. We will define a function to test it. We will use generate Caption function.
![](https://static.wixstatic.com/media/b670ab_cca5573127e44615b91a52ac85ac64ea~mv2.jpg/v1/fill/w_976,h_220,al_c,q_80,enc_auto/b670ab_cca5573127e44615b91a52ac85ac64ea~mv2.jpg)
FIG:11
Input Image from test dataset shown in fig. 2:
![](https://static.wixstatic.com/media/6e3b57_d4e23060a5834ec3b83043d909ccaa47~mv2.jpeg/v1/fill/w_590,h_336,al_c,q_80,enc_auto/6e3b57_d4e23060a5834ec3b83043d909ccaa47~mv2.jpeg)
FIG:12
Output caption: A dog is chasing balls.
Thus, we learnt how we can effectively detect objects and generate it’s most suitable caption to describe the image.
REFERENCES
GitHub Link
https://github.com/himanshubobade/Image_captioning
https://www.youtube.com/channel/UCR1-GEpyOPzT2AO4D_eifdw
Comments