Skip to content

Commit

Permalink
Merge pull request #2531 from kxk302/RNN
Browse files Browse the repository at this point in the history
Added slides and minor revision to tutorial
  • Loading branch information
shiltemann authored May 31, 2021
2 parents cac9ffd + 967e644 commit 7b17c5e
Show file tree
Hide file tree
Showing 4 changed files with 523 additions and 13 deletions.
178 changes: 178 additions & 0 deletions topics/statistics/tutorials/RNN/slides.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
---
layout: tutorial_slides
logo: GTN
title: "Recurrent neural networks (RNN) \n Deep Learning - Part 2"
zenodo_link: "https://zenodo.org/record/4477881"
questions:
- "What is a recurrent neural network (RNN)?"
- "What are some applications of RNN?"
objectives:
- "Understand the difference between feedforward neural networks (FNN) and RNN"
- "Learn various RNN types and architectures"
- "Learn how to create a neural network using Galaxy’s deep learning tools"
- "Solve a sentiment analysis problem on IMDB movie review dataset using RNN in Galaxy"
key_points:
requirements:
-
type: internal
topic_name: statistics
tutorials:
- intro_deep_learning
- FNN
contributors:
- kxk302
---

# What is a recurrent neural network (RNN)?

???

What is a recurrent neural network (RNN)?

---

# Recurrent Neural Network (RNN)

- RNN models sequential data (temporal/ordinal)
- In RNN, training example is a sequence, which is presented to RNN one at a time
- E.g., sequence of English words is passed to RNN, one at a time
- And, RNN generates a sequence of Persian words, one at a time
- In RNN, output of network at time *t* is used as input at time *t+1*
- RNN applied to image description, machine translation, sentiment analysis, etc.

---

# One-to-many RNN

![Neurons forming a one-to-many recurrent neural network]({{site.baseurl}}/topics/statistics/images/RNN_1_to_n.png) <!-- https://pixy.org/3013900/ CC0 license-->

---

# Many-to-one RNN

![Neurons forming a many-to-one recurrent neural network]({{site.baseurl}}/topics/statistics/images/RNN_n_to_1.png) <!-- https://pixy.org/3013900/ CC0 license-->

---

# Many-to-many RNN

![Neurons forming a many-to-many recurrent neural network]({{site.baseurl}}/topics/statistics/images/RNN_n_to_m.png) <!-- https://pixy.org/3013900/ CC0 license-->

---

# RNN architectures

- Vanilla RNN
- Suffers from *vanishing gradient* problem
- LSTM and GRU
- Uses *gates* to avoid vanishing gradient problem

---

# Sentiment analysis

- We perform sentiment analysis on IMDB movie reviews dataset
- Train RNN on training dataset (25000 positive/negative movie reviews)
- Test RNN on test set (25000 positive/negative movie reviews)
- Training and test sets have no overlap
- Since dealing with text data, good to review mechanisms for representing text data

---

# Text preprocessing

- Tokenize a document, i.e., break it down into words
- Remove punctuations, URLs, and stop words (‘a’, ‘of’, ‘the’, etc.)
- Normalize the text, e.g., replace ‘brb’ with ‘Be right back’, etc
- Run the spell checker to fix typos
- Make all words lowercase

---

# Text preprocessing

- Perform stemming/lemmatization
- If we have words like ‘organizer’, ‘organize’, and ‘organized’
- Want to reduce all of them to a single word
- Stemming cuts end of these words for a single root
- E.g., ‘organiz’. May not be an actual word
- Lemmatization reduces to a root that is actually a word
- E.g., ‘organize’

---

# Bag of words (BoW)

- If you don’t care about the order of the words in a document
- 2D array. Rows represent documents. Columns represent words in vocabulary
- All unique words in all documents
- If a word not present in a document, we have a zero at row and column entry
- If a word is present in a document, we have a one at row and colum entry
- Or, we could use the word count or frequency

---

# Bag of words (BoW)

- Document 1: Magic passed the basketball to Kareem
- Document 2: Lebron stole the basketball from Curry

![Table showing a bag-of-words representation of sample documents]({{site.baseurl}}/topics/statistics/images/BoW.png) <!-- https://pixy.org/3013900/ CC0 license-->

- BoW is simple, but does not consider rarity of words across documents
- Important for document classification

---

# Term frequency inverse document frequency (TF-IDF)

- If you don’t care about the order of the words in a document
- Similar to BoW, we have an entry for each document-word pair
- Entry is product of
- Term frequency, frequency of a word in a document, and
- Inverse document frequency, total number of documents divided by number of documents that have word
- Usually use logarithm of the IDF
- TF-IDF takes into account rarity of a word across documents

---

# One-hot encoding (OHE)

- Technique to convert categorical variables such as words into a vector
- Suppose our vocabulary has 3 words: orange, apple, banana
- Each word is represented by a vector of size 3

![Mathematical vectors representing one-hot-encoding representation of words orange, apple, and banana]({{site.baseurl}}/topics/statistics/images/OHE.gif) <!-- https://pixy.org/3013900/ CC0 license-->

- OHE problems
- For very large vocabulary sizes requires tremendous amount of storage
- Also, no concept of word similarity

---

# Word2Vec

- Each word represented as an *n* dimensional vector
- *n* much smaller than vocabulary size
- Words that have similar meanings are close in vector space
- Words considered similar if they co-occur often in documents
- Two Word2Vec architectures
- Continuous BoW
- predicts probability of a word given the surrounding words
- Continuous skip-gram
- given a word predicts probability of the surrounding words

---

# Sentiment analysis

- Sentiment classification of IMDB movie reviews with RNN
- Train RNN using IMDB movie reviews
- Goal is to learn a model such that given a review we predict whether review is positive/negative
- We evaluate the trained RNN on test dataset and plot confusion matrix

---

# For references, please see tutorial's References section

---
25 changes: 13 additions & 12 deletions topics/statistics/tutorials/RNN/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,8 @@ describe various RNN architectures and solve a sentiment analysis problem using

In feedforward neural networks (FNN) a single training example is presented to the network,
after which the the network generates an output. For example, a lung X-ray image is passed
to a FNN, and the network predicts tumor or no tumor. By contrast, in RNN a sequence of
training examples are presented to the network one at a time. For example, a sequence of
to a FNN, and the network predicts tumor or no tumor. By contrast, in RNN a training example
is a sequence, which is presented to the network one at a time. For example, a sequence of
English words is passed to a RNN, one at a time, and the network generates a sequence of
Persian words, one at a time. RNN handle sequential data, whether its temporal or ordinal.

Expand Down Expand Up @@ -117,7 +117,7 @@ Unlike FNN, in RNN the output of the network at time t is used as network input
## Possible RNN inputs/outputs

There are 4 possible input/output combinations for RNN and each have a specific application. One-to-one is basically a FNN. One-to-many,
where we have one input and a variable number of output. One example application is image captioning, where a single image is provided
where we have one input and a variable number of outputs. One example application is image captioning, where a single image is provided
as input and a variable number of words (which caption the image) is returned as output (See Figure 7).

![Neurons forming a one-to-many recurrent neural network](../../images/RNN_1_to_n.png "One-to-many RNN")
Expand All @@ -138,14 +138,15 @@ we pass in n words in English and get m words in Italian (See Figure 9).

Mainly, there are three types of RNN: 1) Vanilla RNN, 2) LSTM ({% cite hochreiter1997long %}), and 3) GRU ({% cite cho-etal-2014-learning %}).
A Vanilla RNN, simply combines the state information from the previous timestamp with the input from the current timestamp to generate the
state information for current timestamp. The problem with Vanilla RNN is that training deep RNN networks is impossible due to the
**vanishing gradient** problem. Basically, starting from the output layer, in order to determine weights/biases updates, we need to calculate
the derivative of the loss function relative to the layers input, which is usually a small number. This is not a problem for the output layer,
but for the previous layers, this process must be repeated recursively, resulting in very small updates to weights/biases of the initial layers
of the RNN, halting the learning process.
state information and output for current timestamp. The problem with Vanilla RNN is that training deep RNN networks is impossible due to the
**vanishing gradient** problem. Basically, weights/biases are updated according to the gradient of the loss functions relative to
the weights/biases. The gradients are calculated recursively from the output layer towards the input layer (Hence, the name *backpropagation*).
The gradient of the input layer is the product of the gradient of the subsequent layers. If those gradients are small, the gradient of the input
layer (which is the product of multiple small values) will very small, resulting in very small updates to weights/biases of the initial layers
of the RNN, effectively halting the learning process.

LSTM and GRU are two RNN architectures that address vanishing gradient problem. Full description of LSTM/GRU is beyond the scope of this
tutorial (Please refer to ref1 and ref2), but in a nutshell both LSTM and GRU use **gates** such that the weights/biases updates in previous
tutorial (Please refer to {% cite hochreiter1997long %} and {% cite cho-etal-2014-learning %}), but in a nutshell both LSTM and GRU use **gates** such that the weights/biases updates in previous
layers are calculated via a series of additions (not multiplications). Hence, these architectures can learn even when the RNN has hundreds or
thousands of layers.

Expand All @@ -170,7 +171,7 @@ the next 10,000 words in our dataset. Reviews are limited to 500 words. They are

## Bag of words and TF-IDF

If you don't care about the order of the words in a document, you can use bag of words (BoW) or text frequency inverse document frequency (TF-IDF).
If you don't care about the order of the words in a document, you can use bag of words (BoW) or term frequency inverse document frequency (TF-IDF).
In these models we have a 2 dimensional array. The rows represent the documents (in our example, the movie reviews) and the columns
represent the words in our vocabulary (all the unique words in all the documents). If a word is not present in a document, we have a zero
at the corresponding row and column as the entry. If a word is present in the document, we have a one as the entry -- Alternatively, we could use
Expand All @@ -184,7 +185,7 @@ representation of these documents is given in Figure 10.
BoW's advantage is its simplicity, yet it does not take into account the rarity of a word across documents, which unlike common words are
important for document classification.

In TF-IDF, similar to BoW we have an entry for each document-word pair. In TD-IDF, the entry is the product of 1) Text frequency, the
In TF-IDF, similar to BoW we have an entry for each document-word pair. In TD-IDF, the entry is the product of 1) Term frequency, the
frequency of a word in a document, and 2) Inverse document frequency, the inverse of the number of documents that have the word divided
by the total number of documents (we usually use logarithm of the IDF).

Expand Down Expand Up @@ -380,7 +381,7 @@ Figure 12 is the resultant confusion matrix for our sentiment analysis problem.
class labels (we have 10,397 + 2,103 = 12,500 reviews with negative sentiment). The second row represents the *true* 1 (or positive sentiment) class labels
(Again, we have 1,281 + 11,219 = 12,500 reviews with positive sentiment). The left column represents the *predicted* negative sentiment class labels (Our RNN
predicted 10,397 + 1,281 = 11,678 reviews as having a negative sentiment). The right column represents the *predicted* positive class labels (Our RNN
predicted 11,219 + 2,103 = 13,322 reviews as having a positive sentiment).Looking at the bottom right cell, we seethat our RNN has correctly predicted 11,219
predicted 11,219 + 2,103 = 13,322 reviews as having a positive sentiment).Looking at the bottom right cell, we see that our RNN has correctly predicted 11,219
reviews as having a positive sentiment (True positives). Looking at the top right cell, we see that our RNN has incorrectly predicted 2,103 reviews as having
a positive (False positives). Similarly, looking at the top left cell, we see that our RNN has correctly predicted 10,397 reviews as having negative sentiment
(True negative). Finally, looking at the bottom left cell, we see that our RNN has incorrectly predicted 1,281 reviews as negative (False negative). Given
Expand Down
Loading

0 comments on commit 7b17c5e

Please sign in to comment.