Merge pull request #2531 from kxk302/RNN

Added slides and minor revision to tutorial
galaxyproject · May 31, 2021 · 7b17c5e · 7b17c5e
2 parents cac9ffd + 967e644
commit 7b17c5e
Show file tree

Hide file tree

Showing 4 changed files with 523 additions and 13 deletions.
diff --git a/topics/statistics/tutorials/RNN/slides.html b/topics/statistics/tutorials/RNN/slides.html
@@ -0,0 +1,178 @@
+---
+layout: tutorial_slides
+logo: GTN
+title: "Recurrent neural networks (RNN) \n Deep Learning - Part 2"
+zenodo_link: "https://zenodo.org/record/4477881"
+questions:
+  - "What is a recurrent neural network (RNN)?"
+  - "What are some applications of RNN?"
+objectives:
+  - "Understand the difference between feedforward neural networks (FNN) and RNN"
+  - "Learn various RNN types and architectures"
+  - "Learn how to create a neural network using Galaxy’s deep learning tools"
+  - "Solve a sentiment analysis problem on IMDB movie review dataset using RNN in Galaxy"
+key_points:
+requirements:
+  -
+    type: internal
+    topic_name: statistics
+    tutorials:
+      - intro_deep_learning
+      - FNN
+contributors:
+  - kxk302
+---
+
+# What is a recurrent neural network (RNN)?
+
+???
+
+What is a recurrent neural network (RNN)?
+
+---
+
+# Recurrent Neural Network (RNN)
+
+- RNN models sequential data (temporal/ordinal)
+- In RNN, training example is a sequence, which is presented to RNN one at a time 
+	- E.g., sequence of English words is passed to RNN, one at a time 
+	- And, RNN generates a sequence of Persian words, one at a time
+- In RNN, output of network at time *t* is used as input at time *t+1*
+- RNN applied to image description, machine translation, sentiment analysis, etc. 
+
+---
+
+# One-to-many RNN
+
+![Neurons forming a one-to-many recurrent neural network]({{site.baseurl}}/topics/statistics/images/RNN_1_to_n.png) <!-- https://pixy.org/3013900/ CC0 license-->
+
+---
+
+# Many-to-one RNN
+
+![Neurons forming a many-to-one recurrent neural network]({{site.baseurl}}/topics/statistics/images/RNN_n_to_1.png) <!-- https://pixy.org/3013900/ CC0 license-->
+
+---
+
+# Many-to-many RNN
+
+![Neurons forming a many-to-many recurrent neural network]({{site.baseurl}}/topics/statistics/images/RNN_n_to_m.png) <!-- https://pixy.org/3013900/ CC0 license-->
+
+---
+
+# RNN architectures
+
+- Vanilla RNN
+	- Suffers from *vanishing gradient* problem
+- LSTM and GRU
+	- Uses *gates* to avoid vanishing gradient problem
+
+---
+
+# Sentiment analysis
+
+- We perform sentiment analysis on IMDB movie reviews dataset
+- Train RNN on training dataset (25000 positive/negative movie reviews)
+- Test RNN on test set (25000 positive/negative movie reviews)
+- Training and test sets have no overlap
+- Since dealing with text data, good to review mechanisms for representing text data
+
+---
+
+# Text preprocessing
+
+- Tokenize a document, i.e., break it down into words
+- Remove punctuations, URLs, and stop words (‘a’, ‘of’, ‘the’, etc.)
+- Normalize the text, e.g., replace ‘brb’ with ‘Be right back’, etc
+- Run the spell checker to fix typos
+- Make all words lowercase
+
+---
+
+# Text preprocessing
+
+- Perform stemming/lemmatization
+	- If we have words like ‘organizer’, ‘organize’, and ‘organized’
+		- Want to reduce all of them to a single word
+	- Stemming cuts end of these words for a single root
+		- E.g., ‘organiz’.  May not be an actual word
+	- Lemmatization reduces to a root that is actually a word
+		- E.g., ‘organize’
+
+---
+
+# Bag of words (BoW)
+
+- If you don’t care about the order of the words in a document
+- 2D array. Rows represent documents. Columns represent words in vocabulary
+	- All unique words in all documents
+- If a word not present in a document, we have a zero at row and column entry
+- If a word is present in a document, we have a one at row and colum entry
+	- Or, we could use the word count or frequency
+
+---
+
+# Bag of words (BoW)
+
+- Document 1: Magic passed the basketball to Kareem
+- Document 2: Lebron stole the basketball from Curry
+
+![Table showing a bag-of-words representation of sample documents]({{site.baseurl}}/topics/statistics/images/BoW.png) <!-- https://pixy.org/3013900/ CC0 license-->
+
+- BoW is simple, but does not consider rarity of words across documents
+	- Important for document classification
+
+---
+
+# Term frequency inverse document frequency (TF-IDF)
+
+- If you don’t care about the order of the words in a document
+- Similar to BoW, we have an entry for each document-word pair
+- Entry is product of
+	- Term frequency, frequency of a word in a document, and
+	- Inverse document frequency, total number of documents divided by number of documents that have word
+		- Usually use logarithm of the IDF
+- TF-IDF takes into account rarity of a word across documents
+
+---
+
+# One-hot encoding (OHE)
+
+- Technique to convert categorical variables such as words into a vector
+- Suppose our vocabulary has 3 words: orange, apple, banana
+- Each word is represented by a vector of size 3
+
+![Mathematical vectors representing one-hot-encoding representation of words orange, apple, and banana]({{site.baseurl}}/topics/statistics/images/OHE.gif) <!-- https://pixy.org/3013900/ CC0 license-->
+
+- OHE problems
+	- For very large vocabulary sizes requires tremendous amount of storage
+	- Also, no concept of word similarity
+
+---
+
+# Word2Vec
+
+- Each word represented as an *n* dimensional vector
+	- *n* much smaller than vocabulary size
+- Words that have similar meanings are close in vector space
+- Words considered similar if they co-occur often in documents
+- Two Word2Vec architectures
+	- Continuous BoW
+		- predicts probability of a word given the surrounding words
+	- Continuous skip-gram
+		- given a word predicts probability of the surrounding words
+
+---
+
+# Sentiment analysis
+
+- Sentiment classification of IMDB movie reviews with RNN
+- Train RNN using IMDB movie reviews
+- Goal is to learn a model such that given a review we predict whether review is positive/negative
+- We evaluate the trained RNN on test dataset and plot confusion matrix
+
+---
+
+# For references, please see tutorial's References section
+
+---
diff --git a/topics/statistics/tutorials/RNN/tutorial.md b/topics/statistics/tutorials/RNN/tutorial.md
@@ -62,8 +62,8 @@ describe various RNN architectures and solve a sentiment analysis problem using
 
 In feedforward neural networks (FNN) a single training example is presented to the network,
 after which the the network generates an output. For example, a lung X-ray image is passed
-to a FNN, and the network predicts tumor or no tumor. By contrast, in RNN a sequence of
-training examples are presented to the network one at a time. For example, a sequence of
+to a FNN, and the network predicts tumor or no tumor. By contrast, in RNN a training example
+is a sequence, which is presented to the network one at a time. For example, a sequence of
 English words is passed to a RNN, one at a time, and the network generates a sequence of
 Persian words, one at a time. RNN handle sequential data, whether its temporal or ordinal.
 
@@ -117,7 +117,7 @@ Unlike FNN, in RNN the output of the network at time t is used as network input
 ## Possible RNN inputs/outputs
 
 There are 4 possible input/output combinations for RNN and each have a specific application. One-to-one is basically a FNN. One-to-many,
-where we have one input and a variable number of output. One example application is image captioning, where a single image is provided
+where we have one input and a variable number of outputs. One example application is image captioning, where a single image is provided
 as input and a variable number of words (which caption the image) is returned as output (See Figure 7).
 
 ![Neurons forming a one-to-many recurrent neural network](../../images/RNN_1_to_n.png "One-to-many RNN")
@@ -138,14 +138,15 @@ we pass in n words in English and get m words in Italian (See Figure 9).
 
 Mainly, there are three types of RNN: 1) Vanilla RNN, 2) LSTM ({% cite hochreiter1997long %}), and 3) GRU ({% cite cho-etal-2014-learning %}).
 A Vanilla RNN, simply combines the state information from the previous timestamp with the input from the current timestamp to generate the
-state information for current timestamp. The problem with Vanilla RNN is that training deep RNN networks is impossible due to the
-**vanishing gradient** problem. Basically, starting from the output layer, in order to determine weights/biases updates, we need to calculate
-the derivative of the loss function relative to the layers input, which is usually a small number. This is not a problem for the output layer,
-but for the previous layers, this process must be repeated recursively, resulting in very small updates to weights/biases of the initial layers
-of the RNN, halting the learning process.
+state information and output for current timestamp. The problem with Vanilla RNN is that training deep RNN networks is impossible due to the
+**vanishing gradient** problem. Basically, weights/biases are updated according to the gradient of the loss functions relative to 
+the weights/biases. The gradients are calculated recursively from the output layer towards the input layer (Hence, the name *backpropagation*).
+The gradient of the input layer is the product of the gradient of the subsequent layers. If those gradients are small, the gradient of the input
+layer (which is the product of multiple small values) will very small, resulting in very small updates to weights/biases of the initial layers
+of the RNN, effectively halting the learning process.
 
 LSTM and GRU are two RNN architectures that address vanishing gradient problem. Full description of LSTM/GRU is beyond the scope of this
-tutorial (Please refer to ref1 and ref2), but in a nutshell both LSTM and GRU use **gates** such that the weights/biases updates in previous
+tutorial (Please refer to {% cite hochreiter1997long %} and {% cite cho-etal-2014-learning %}), but in a nutshell both LSTM and GRU use **gates** such that the weights/biases updates in previous
 layers are calculated via a series of additions (not multiplications). Hence, these architectures can learn even when the RNN has hundreds or
 thousands of layers.
 
@@ -170,7 +171,7 @@ the next 10,000 words in our dataset. Reviews are limited to 500 words. They are
 
 ## Bag of words and TF-IDF
 
-If you don't care about the order of the words in a document, you can use bag of words (BoW) or text frequency inverse document frequency (TF-IDF).
+If you don't care about the order of the words in a document, you can use bag of words (BoW) or term frequency inverse document frequency (TF-IDF).
 In these models we have a 2 dimensional array. The rows represent the documents (in our example, the movie reviews) and the columns
 represent the words in our vocabulary (all the unique words in all the documents). If a word is not present in a document, we have a zero
 at the corresponding row and column as the entry. If a word is present in the document, we have a one as the entry -- Alternatively, we could use
@@ -184,7 +185,7 @@ representation of these documents is given in Figure 10.
 BoW's advantage is its simplicity, yet it does not take into account the rarity of a word across documents, which unlike common words are
 important for document classification.
 
-In TF-IDF, similar to BoW we have an entry for each document-word pair. In TD-IDF, the entry is the product of 1) Text frequency, the
+In TF-IDF, similar to BoW we have an entry for each document-word pair. In TD-IDF, the entry is the product of 1) Term frequency, the
 frequency of a word in a document, and 2) Inverse document frequency, the inverse of the number of documents that have the word divided
 by the total number of documents (we usually use logarithm of the IDF).
 
@@ -380,7 +381,7 @@ Figure 12 is the resultant confusion matrix for our sentiment analysis problem.
 class labels (we have 10,397 + 2,103 = 12,500 reviews with negative sentiment). The second row represents the *true* 1 (or positive sentiment) class labels
 (Again, we have 1,281 + 11,219 = 12,500 reviews with positive sentiment). The left column represents the *predicted* negative sentiment class labels (Our RNN
 predicted 10,397 + 1,281 = 11,678 reviews as having a negative sentiment). The right column represents the *predicted* positive class labels (Our RNN
-predicted 11,219 + 2,103 = 13,322 reviews as having a positive sentiment).Looking at the bottom right cell, we seethat our RNN has correctly predicted 11,219
+predicted 11,219 + 2,103 = 13,322 reviews as having a positive sentiment).Looking at the bottom right cell, we see that our RNN has correctly predicted 11,219
 reviews as having a positive sentiment (True positives). Looking at the top right cell, we see that our RNN has incorrectly predicted 2,103 reviews as having
 a positive (False positives). Similarly, looking at the top left cell, we see that our RNN has correctly predicted 10,397 reviews as having negative sentiment
 (True negative). Finally, looking at the bottom left cell, we see that our RNN has incorrectly predicted 1,281 reviews as negative (False negative). Given