The Hundred-Page Machine Learning Book by Adriy Burkov

The Hundred-Page Machine Learning Book by Adriy Burkov

Author:Adriy Burkov
Language: eng
Format: mobi
Published: 2019-07-04T22:24:38+00:00


6.2.2 Recurrent Neural Network

Recurrent neural networks (RNNs) are used to label, classify, or generate sequences. A sequence is a matrix, each row of which is a feature vector and the order of rows matters. To label a sequence is to predict a class for each feature vector in a sequence. To classify a sequence is to predict a class for the entire sequence. To generate a sequence is to output another sequence (of a possibly different length) somehow relevant to the input sequence.

RNNs are often used in text processing because sentences and texts are naturally sequences of either words/punctuation marks or sequences of characters. For the same reason, recurrent neural networks are also used in speech processing.

A recurrent neural network is not feed-forward: it contains loops. The idea is that each unit of recurrent layer has a real-valued state . The state can be seen as the memory of the unit. In RNN, each unit in each layer receives two inputs: a vector of states from the previous layer and the vector of states from this same layer from the previous time step.

To illustrate the idea, let’s consider the first and the second recurrent layers of an RNN. The first (leftmost) layer receives a feature vector as input. The second layer receives the output of the first layer as input.

This situation is schematically depicted in fig. 30 below.

Figure 30: The first two layers of an RNN. The input feature vector is two-dimensional; each layer has two units.

As I said above, each training example is a matrix in which each row is a feature vector. For simplicity, let’s illustrate this matrix as a sequence of vectors , where is the length of the input sequence. If our input example is a text sentence, then feature vector for each represents a word in the sentence at position .

As depicted in fig. 30, in an RNN, the feature vectors from an input example are “read” by the neural network sequentially in the order of the timesteps. The index denotes a timestep. To update the state at each timestep in each unit of each layer we first calculate a linear combination of the input feature vector with the state vector of this same layer from the previous timestep, . The linear combination of two vectors is calculated using two parameter vectors , and a parameter . The value of is then obtained by applying activation function to the result of the linear combination. A typical choice for function is . The output is typically a vector calculated for the whole layer at once. To obtain , we use activation function that takes a vector as input and returns a different vector of the same dimensionality. The function is applied to a linear combination of the state vector values calculated using a parameter matrix and a parameter vector . In classification, a typical choice for is the softmax function:

where

The softmax function is a generalization of the sigmoid function to multidimensional outputs. It has the property that and for all .



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.