Audio-Description-of-Image-for-visually-impaired-person

Version-1 -
Version-2 -

Idea behind the project is to convert the image -> caption -> audio

I have implemented the image captioning with 2 approaches-

V1). Implemented CNN-RNN(LSTM) Architecture to convert the image into caption which acheived a LOSS of 2.689 on Flickr 8K dataset.
V2). Implemented CNN-RNN Architecture with Attention Mechanism to acheive better accuracy .Used a Larger MSCOCO Dataset of 327437 sample images which acheived a LOSS of 1.625.

Simple CNN-RNN Architecute

Now, How the 2nd approach using Attention Mechanism improved the model

MATHS Behind the Attention Mechanism

Local Attention As Global attention focus on all source side words for all target words, it is computationally very expensive and is impractical when translating for long sentences. To overcome this deficiency local attention chooses to focus only on a small subset of the hidden states of the encoder per target word.

Every location of convolution layers corresponds to some location of image as shown below.

Taking an example

Now, for example, the output of the 5th convolution layer of Inception is a 14 * 14 * 512 size feature map. This 5th convolution layer has 14*14 pixel locations which corresponds to certain portion in image, that means we have 196 such pixel locations. And finally, we can treat these 196 locations(each having 512 dimensional representation) .

The model will then learn an attention over these locations(which in turn corresponds to actual locations in the images).

Let’s discuss equations for Local Attention and Global Attention with General score :

Some Models predictions on test dataset

These descriptions are converted to audio in CODE

pizza with pier and paper on it

plane bear with sized zebras on it

zebra standing next to car in batter

woman in lamb group is holding skis

display case filled with lots of different kinds of donuts

wall mens truck is parked in grass

black and white cat standing in his of patch phones

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Lec65_Object_Localization_and_Detection.ipynb		Lec65_Object_Localization_and_Detection.ipynb
README.md		README.md
audio-description-of-image-without-using-attention-model-v1.ipynb		audio-description-of-image-without-using-attention-model-v1.ipynb
audio-description-of-image_using_ATTENTION_MECHANISM(improved)-v2.ipynb		audio-description-of-image_using_ATTENTION_MECHANISM(improved)-v2.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio-Description-of-Image-for-visually-impaired-person

Idea behind the project is to convert the image -> caption -> audio

Simple CNN-RNN Architecute

Now, How the 2nd approach using Attention Mechanism improved the model

MATHS Behind the Attention Mechanism

Taking an example

Some Models predictions on test dataset

These descriptions are converted to audio in CODE

About

Releases

Packages

Languages

harshwalia36/Audio-Description-of-Image-for-visually-impaired-person

Folders and files

Latest commit

History

Repository files navigation

Audio-Description-of-Image-for-visually-impaired-person

Idea behind the project is to convert the image -> caption -> audio

Simple CNN-RNN Architecute

Now, How the 2nd approach using Attention Mechanism improved the model

MATHS Behind the Attention Mechanism

Taking an example

Some Models predictions on test dataset

These descriptions are converted to audio in CODE

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages