I have implemented the image captioning with 2 approaches-
V1). Implemented CNN-RNN(LSTM) Architecture to convert the image into caption which acheived a LOSS of 2.689
on Flickr 8K dataset.
V2). Implemented CNN-RNN Architecture with Attention Mechanism to acheive better accuracy .Used a Larger MSCOCO Dataset of 327437 sample images which acheived a LOSS of 1.625
.
Local Attention As Global attention focus on all source side words for all target words, it is computationally very expensive and is impractical when translating for long sentences. To overcome this deficiency local attention chooses to focus only on a small subset of the hidden states of the encoder per target word.
Every location of convolution layers corresponds to some location of image as shown below.
Now, for example, the output of the 5th convolution layer of Inception is a 14 * 14 * 512 size feature map. This 5th convolution layer has 14*14 pixel locations which corresponds to certain portion in image, that means we have 196 such pixel locations. And finally, we can treat these 196 locations(each having 512 dimensional representation) .
The model will then learn an attention over these locations(which in turn corresponds to actual locations in the images).
Let’s discuss equations for Local Attention and Global Attention with General score :
pizza with pier and paper on it
plane bear with sized zebras on it
zebra standing next to car in batter
woman in lamb group is holding skis
display case filled with lots of different kinds of donuts
wall mens truck is parked in grass