Nothing will surprise you more than recurrent nets if you practice machine learning. Recurrent net is the most powerful, successful and the luckiest neural network ever. Today’s research in deep learning relies heavily on recurrent nets, although they are not recognized as deep learning techniques.
The history of recurrent nets returns back to 1980s but only saw this renaissance with the rise of deep learning and deep neural networks.
Before introducing how recurrent neural network works, you have to know that recurrent nets (and all of their variations) are Turing complete which means they are capable of simulating a Turing machine and can compute all the algorithms a Turing machine can compute. In simple terms, a recurrent net has a form of memory that somehow behaves as a Turing machine input/output tape.
Imagine a piece of paper that is extended infinitely, call it the input space. Each point in this space represents an input to a neural network. Now imagine that these points are connected together along an axis (they are either horizontal lines or vertical lines), a recurrent net can now work by applying the affine transformation followed by a non-linear activation function (as a regular feed-forward network) taking as input each point on a line until the line ends. Note that the points are located at discrete distances from each other. At each point, the recurrent net applies the non-linear transformation of the input computing a hidden activation, so for the network to know the relation between this point and the next one it needs to remember either the hidden activation or the output of the recurrent net. If the inputs in this space are word vectors, the network would map both the current word and the previous to a new hidden space combines them both.
Now you can see that the axis the recurrent net moves on is the time. The recurrent net can now compute inputs that happen at discrete time steps (for example the temperature degrees during a certain period of time). Of course, a recurrent net can compute inputs that exist on an infinite time axis but that does not happen in practice, we are only concerned with the network behaving the same at each timestep and that does not happen in practice too for long time steps.
Now you understand how recurrent neural network works, the magic happens. A recurrent net has the ability to process sequences using a single set of weights. This means that each time you advance the recurrent net forward you use the same weights, and this the difference between feed-forward network and recurrent net. You will treat each input point as a different input which requires a different set of weights. This is called parameter sharing and it is a very important concept of machine learning. We can use parameter sharing in feed forward networks too, but a feed forward networks would not keep its previous hidden activation.
You can think of recurrent net as feed forward network working in a loop each time it takes two input vectors, the input at the current time step and the previously hidden activation using the same weights.
How it works
A recurrent neural network operates over time separated inputs or simply a time series.
The mathematical equations that describe a recurrent net operation:
Now is interesting, when it is learned by backpropagation algorithm it express the memory of the network. controls what should pass from the previous timestep hidden activations. A more sophisticated architecture called “Long short-term memory” divides into several parameters that control how much of the input or hidden activation to remember in the present or to pass in the next time step.
More interestingly, a recurrent net can forget the hidden activation at the time when a new input is presented to the network at the time that is because is fixed over time steps. The problem can be solved by slight modification in backpropagation and modifications in recurrent net neurons architecture.
A recurrent neural network can learn to classify sequences just as neural network classify input vectors. Think of a natural language sentence which is a sequence of word vectors, these word vectors are fed one by one to the recurrent net and produce as final output a sentiment class (positive, negative or neutral). Of course, you can classify each input vector instead of classifying the whole sequence. A recurrent net can output sequences, as in the task of machine translation, for each word vector in a source language the recurrent net outputs a word in a target language. You can see that recurrent neural networks are the natural choice when dealing with natural language because we are dealing with sequences of words.