This is a list of summaries of papers in the field of deep learning I have been reading recently. The summaries are meant to help me internalize and keep a record of the techniques I have read about, and not as full analyses of the papers.

Table of contents


Emergence of Invariance and Disentanglement in Deep Representations (Part 1)

Paper: https://arxiv.org/pdf/1706.01350.pdf

This paper looks at the broad question of “Why do heavily over-parametrized deep nets generalize well?” from the perspective of Information Theory. This was the first paper I’ve read fully that takes an information theory approach, and was quite a long read.

First, it’s useful to know some terms related to representations. \(z\) is a representation of \(x\) if “the distribution of \(z\) if fully described by the conditional \(p(z|x)\)”, giving rise to the Markov chain \(y \rightarrow x \rightarrow z\), \(y\) being the task. \(z\) is sufficient for \(y\) if \(I(z;y) = I(x;y)\) (remember that mutual information, informally, is a measure of how much information \(x\) captures about \(y\)) . \(z\) is minimal when \(I(x;z)\) is smallest among sufficient representations (we don’t want to memorize too much). A nuisance \(n\) is something that affects \(x\) but is “not informative to the task we’re trying to solve”, i.e., \(I(y;n) = 0\). A representation \(z\) that minimizes \(I(z;n)\) among all sufficient representations is said to be maximally insensitive to \(n\). The Total Correlation (TC) of a distribution is defined as \(TC(z) = KL(p(z) || \Pi_{i} p(z_i))\) - note that TC can also be regarded as the amount of (redundant) shared information among variables in a set. A useful relation to remember is \(I(x;y) = KL(p(x, y) || p(x)p(y))\) - the expected extra (redundant) number of bits to identify \(x\) and \(y\) if they are transmitted using their marginal distributions instead of their joint distributions.

One of the first interesting facts in the paper is that invariants (think nuisance-free representations) can be constructed by reducing the mutual information between \(x\) and \(z\), i.e., minimality. The authors state that if \(n\rightarrow x \rightarrow z\), then \(I(z;n) \leq I(z;x) - I(x;y)\). The second term is a constant, so the lower the value \(I(z;x)\), the more invariant the representation \(z\). Bottlenecks, such as dimensionality reduction between successive layers of a network, also promote invariance - if \(x \rightarrow z_1 \rightarrow z_2\), and there is a communication bottleneck between \(z_1\) and \(z_2\), then provided that \(z_2\) is still sufficient, \(z_2\) is more nuisance-invariant than \(z_1.\) This also implies that stacking layers (“deep nets”) promotes invariance, although this does not simply mean that more layers means better generalization, since it assumes that the last layer is still a sufficient representation for \(x\) (meaning the network has been trained properly, which is increasingly difficult for large networks).

The next part of the paper states that the amount of information in the weights can act as a useful regularizer, allowing us to control when a network will overfit/ underfit. The paper decomposes the standard cross-entropy loss into the following form: \[H_{p, q}(x, y) = \mathrm{(sum\ of\ positive\ terms)} - I(y;w| \mathbf{x}, \theta)\] The only negative quantity here is the last term, which can be thought of as how much information the weights have “memorized” about the labels (since we are already conditioning on the true state of nature and the dataset). Thus, a network could minimize this loss simply by increasing the term on the right, leading to overfitting, i.e., by memorizing the dataset. We would thus want to add a term back into the loss to account for this, but \(I(y;w|\mathbf{x}, \theta)\) is intractable. We can still upper bound this term, by noticing that \(I(y;w|\mathbf{x}, \theta) \leq I(w;\mathcal{D}|\theta) \leq I(w; \mathcal{D})\) (since \(\theta \rightarrow \mathcal{D}\), and conditioning reduces mutual information for a Markov chain). Thus, we can write the new loss as \(L = H_{p, q}(\mathbf{x}, y) + \beta I(w;\mathcal{D})\). Apparently, this was suggested as a regularizer as far back as 1993 by Hinton, but no efficient way to optimize this was known until Kingma’s paper in 2015. The authors also further upper bound this term, showing that \(I(w; \mathcal{D}) \leq KL(q(w|\mathcal{D}) || \Pi_{i} q(w_i))\), where \(q(w)\) is a distribution over the weights over all possible trainings and datasets. This is then used in the local reparametrization trick.

Interestingly, the \(\beta\) term can also be used to predict precisely when overfitting or underfitting will occur for random labels, as shown below (figures from the paper). By changing the value of \(\beta\), which controls the amount of information in the weights, the authors also obtain a graph that closely resembles the classing bias-variance tradeoff curve, suggesting that \(\beta\) (and thus, the information in the weights) correlates well with generalization.

Train and test accuracies for different values of beta
Test error vs beta - resembles bias-variance tradeoff curve

The paper also mentions that under certain conditions, Stochastic Gradient Descent, without this regularizer, “introduces an entropic bias of a very similar form to the information in the weights” that was described above. Additionally, the authors also note that some forms of SGD bias the optimization towards “flat minima”, which require lower \(I(w;\mathcal{D})\). This could explain why even without this regularizer, networks can often be trained to be generalizable. Note that it is commonly believed that SGD implicitly acts as a regularizer, due to the noise introduced by the stochasticity.

The second part of the writeup will deal with the remaining results in the paper.

Emergence of Invariance and Disentanglement in Deep Representations (Part 2)

Paper: https://arxiv.org/pdf/1706.01350.pdf

Sections 4 and 5 of this paper derive more interesting results about the relationship between information in the weights, flat minima, and sufficiency. Using modeling assumptions on the weights, the authors upper bound the information in the weights similarly to Kingma et al. (2015). They use an improper log-uniform prior on \(w\) (\(\tilde{q}(w_i) = c/|w_i|\)), and parametrize the weight distribution during training as \(w_i | \mathcal{D} \sim \epsilon_i\hat{w}_i\), where \(\hat{w}_i\) is a learned mean and \(\epsilon_i \sim \log \mathcal{N}(-\alpha_i/2, \alpha_i)\) is “i.i.d. multiplicative log-normal noise with mean 1 and variance \(\exp(\alpha_i)-1\)”. Using this parametrization, they show an upper bound on the information in the weights \(I(w; \mathcal{D})\), which is used with the local reparametrization trick as a regularizer.

The authors also formally show a relationship between \(I(w; \mathcal{D})\) and the “flatness of the minima”. Flatness is a vague term, and is formalized as “the nuclear norm of the Hessian of the loss function”. While I only have a “visual intuition” of why this would work (smaller nuclear norm \(\rightarrow\) smaller singular values of Hessian \(\rightarrow\) less rapidly changing gradients?), I have yet to find a more rigorous justification of this. The authors show that \[I(w;\mathcal{D}) \leq \frac{1}{2}K[(\log ||\hat{w}||_2^2) + log||\mathcal{H}||_* - \mathrm{(some\ constant)}]\] where \(\hat{w}\) is the minima of the weights according to the cross-entropy loss, \(\mathcal{H}\) is the Hessian, and \(K =\ \mathrel{dim}(w)\). Thus, the flatter the minima, the lower the information in the weights!

In the next section, the authors prove that \(I(x;z) + TC(z)\) is tightly bounded by \(\tilde{I}(w;\mathcal{D})\) (this is itself a sharp upper bound for \(I(w;\mathcal{D})\)). This means that lowering the information in the weights (explicitly, or implicitly via SGD) automatically improves the minimality and hence, by earlier propositions (see part 1 above), the invariance and disentanglement of the learned representation. Using the Markov property of the layers, the authors extend this to multi-layer networks. Taken together with the facts stated earlier, this implies that SGD is “biased toward learning invariant and disentangled representations of the data”.

Speech Recognition with RNNs

Paper: http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf

This paper deals with applying deep RNNs end-to-end for speech recognition - transcribing a sequence of acoustic data into a sequence of phonemes. The deep RNN architecture consists of many layers both across time and space. One major issue in speech recognition is aligning the acoustic input with the phoneme outputs, and the paper shows how to handle this using [CTCs](http://www.machinelearning.org/proceedings/icml2006/047_Connectionist_Tempo .pdf) or RNN transducers.

The architecture consists of LSTM (Long Short-Term Memory) cells and is bidirectional. A bidirectional RNN simply consists of two separate RNNs running in opposite directions along the sequence, and the output for each time step is a weighted average of the outputs from the two directions. The network is also deep, meaning that at every time step, there are hidden layers of LSTMs, and at each time step, the input for each hidden layer comes from the output from the previous layer (in case of the first hidden layer, the input is simply the values), as well as the output from the previous time step at the same layer.

An issue that remains is the aligning of input to output data- the input data is not segmented by hand. A CTC is added to the output layer of the RNN, and it at every time step, it emits softmax probabilities of symbols, where is the number of phonemes, and the 1 comes from a special blank symbol. Note that the CTC model does not look at the outputs from the previous time step - it only uses the output of the last hidden layer for the current time step. The probability of an output sequence is then the sum over all alignments that are equivalent to this sequence. For example, “He is” in the audio data can be transcribed as “[hi] _ [Iz]” or “_ _ [hiz]” (blanks denoting spaces), and both should be correct. This can be computed by using a variant of the Forward-Backward algorithm for HMMs (described here).

The RNN transducer method seems to augment CTCs by adding a separate RNN at the prediction layer, so that the prediction at every time step can also depend on the predictions made so far. So for a label , it obtains both the from the CTC network, and from the prediction network, These two values are multiplied together and normalized.

Points to ponder

RNN Encoder-Decoders in Statistical Machine Translation

Paper: https://arxiv.org/pdf/1406.1078.pdf

In this paper, the authors describe an RNN based approach for encoding a variable length sequence into a fixed size vector (encoder), and then decoding a variable-length sequence from a fixed size vector. They claim the RNN encoder-decoder learns a continuous space representation of phrases that preserve both semantic and syntactic content.

The output of the encoder is the hidden state of the RNN at the last time step. For the decoder RNN, the hidden state at every time step is a function of the previous hidden state, the previous output, and . That is, . The two components are then jointly trained to maximize the conditional normalized sum of the log likelihoods (the log likelihood of a single sequence-to-sequence translation is ). The following image from the paper shows the broad overview of the architecture:

RNN encoder-decoder

The authors also introduce a new hidden unit similar to the LSTM - the Gated Recurrent Unit (GRU). This consists of a reset gate and an update gate . The reset gate decides how much of the previous hidden state to include in computing a temporary new hidden state , which also depends on the input , while the update gate decides how much information from the previous hidden state will carry over to the current hidden state. So:

The RNN encoder decoder is applied in the scoring of phrase pairs in language translation. The statistical model of translation tries to find that maximizes (the translation and language model terms, respectively), given an input . Phrase pairs from the two languages can be fed into the system, and the score is simply , where is the phrase pair. This score can then add as an additional feature in the model.

The authors also mention the use of CSLM in their models, which uses NNs for the language model. It appears that the contributions of the RNN encoder-decoder and CSLM are independent. The authors claim that the embeddings generated by the encoder also capture both syntactic and semantic content of the phrases.

Sequence to Sequence Learning with Deep RNNs

Paper: Sutskever et al. (http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf)

This paper is very similar to the paper on RNN encoder-decoders by Cho et al. The authors use deep RNNs with LSTMs as the hidden layer for the task of machine translation, essentially a sequence-to-sequence learning task.

The authors first use a deep (four layer) RNN with LSTMs for converting the input sequence into a fixed-dimensional vector (the hidden state of the final time step in this RNN), and then connected a second similar RNN to generate the output sequence (given the previous hidden state and previous emitted symbol). The “previous hidden state” for the first time step in the second RNN is the final hidden state in the first RNN.

An important strategy used by the authors is that they reverse the order of words in the input sequence. The intuitive reason why this improves results is as follows: consider an input sequence and the corresponding output . If the input sequence is reversed so that the input-to-output mapping is now , is in close proximity to , fairly close to , and so on. Thus, it is easier for SGD to “ “establish communication” between the input and the output”. The authors say that they do not have a complete explanation for this phenomenon, however.

Decoding the output sequence is done in the paper using beam search. The authors use a form of gradient clipping to address exploding gradients. They also made sure that all sentences within a minibatch were roughly of the same length, so that the more frequent shorter sentences do not suppress the learning of longer sentences within a minibatch.

The main way this paper differs from the paper by Cho et al. is that the RNNs are used directly for machine translation in this paper. In the other paper, the RNNs were used to obtain scores for phrase pairs, which made up the translation model of a Statistical Machine Translation system.

Attention-based Neural Machine Translation

Paper: https://arxiv.org/pdf/1508.04025.pdf

This paper deals with techniques for applying “attention”-based mechanisms to RNNs in machine translation. The basic architecture used here is similar to that in the paper by Sutskever et al. (2014), with a stacking (deep) RNN with LSTMs used for encoding the source sentence, and then another connected deep RNN with LSTMs for producing the target sentence. The authors introduce two types of attention - global attention (all source words are attended) and local attention (a specific window of source words is attended).

In both types of attention approaches, at each time step of the decoding phase, a context vector is built. encodes the source-side “attentional” information. This is concatenated to the current hidden state of the decoder, , and passed through a layer to produce a new hidden state, ().

The picture from the paper helps to clarify this:

Global attention model

So far, the model does not take into consideration previous alignment information when generating target words. To address this, the authors use an input feeding approach, where the from the previous time step in the decoder is concatenated to the input of the next time step. This makes the model aware of previous alignment choices.

During training the authors pre-process the corpus by replacing words outside of the 50K most frequent with tags. The source sentence is reversed, gradient clipping used, and dropout () is employed between the spatial layers. is set to 10 for the local attention models.

Dynamic Memory Networks for NLP

Paper: https://arxiv.org/pdf/1506.07285v1.pdf

This paper discusses a framework for question answering given some previous inputs. The dataset consists of triples: a list of sentences (inputs), a question using facts found in the sentence, and an answer. Facebook’s bAbI dataset is an example. The framework is divided into a series of modules: input module, semantic memory module, question module, episodic memory module, and an answer module.

The model is trained with backpropagation and Adagrad. The authors use regularization and ‘word dropout’, where each word vector is set to with some probability . The model can then perform tasks like question answering, POS tagging, sentiment analysis and even machine translation, better than a lot of existing models.

Spatial Pyramid Pooling in Deep CNNs

Paper: https://arxiv.org/pdf/1406.4729.pdf

In this papers, the authors address the issue of allowing images in the input to have variable sizes (dimensions). Typically in CNNs, all images in the input are pre-processed (e.g.: by cropping/ warping) so that the images are all of the same size. This is needed not because of the convolutional layers (which operate in a sliding window fashion), but for the fully-connected layers. The authors add a spatial pyramid pooling layer after the final convolutional layer to allow inputs of arbitrary size.

As mentioned above, convolutional layers do not need fixed sized inputs. This is because the “filter” can be slid across the image/ feature maps with the appropriate stride until the entire image is covered. This also applies to the maxpooling layers that might be placed after the convolutional layers. In the paper, the spatial pyramid pooling layer is placed after the final convolutional layer. Suppose that the final convolutional layer has dimensions , where is the number of filters (two of the dimensions are the same to make this example easier). Spatial pyramid pooling is analogous to creating a set of bins of varying size. The pyramid consists of ‘level’s, where each ‘level’ is like a grid laid out over each of the filters. Each bin is like a square in these grids. For instance, if we want to create a level, this level will give us bins, and the size of each bin will be (with some bins possibly having parts outside the images). Within each bin, we can use a pooling operation (the paper uses maxpooling). This is similar to creating a maxpooling layer with with the dimension of each pool as , and a stride of . Each of these levels is applied to each layer in the filter. If we create bins in total across all the levels, the output of this layer will thus be dimensional vectors. This is illustrated in the following figure from the paper:

SPP layer

In this figure, there are 3 levels, , and .

The authors state that current GPU implementations are preferably run on fixed input sizes, so at training time, they consider a set of predefined sizes. For each of these predefined sizes, a different network is used (all the networks share the same parameters, however, and the number of parameters is the same because the output of the SPP layer is the same size for all networks). In other words, during training they implement the varying-input-size SPP-net by two fixed-size networks that share parameters.

Neural Turing Machines

Paper: https://arxiv.org/pdf/1410.5401.pdf

The authors in this paper augment neural networks by adding the analog of an “addressable memory” to the system. The network consists of a controller (think CPU) that interacts with the external inputs and produces outputs, and a “memory matrix” (think RAM) that the controller can read to or write from via read/ write “heads”. Since neural networks like RNNs using LSTMs only have a limited amount of memory, the idea is that having a dedicated memory matrix will make recalling information easier. The basic structure is as shown in the diagram below:

NTM layout

There are 2 main parts that are new to this network: the reading mechanism and the writing mechanism.

The controller can use either a feed-forward neural net or a recurrent neural net. If an RNN with an LSTM is used, the memory of the LSTMs can be likened to the registers of a CPU, allowing data from multiple times steps to be stored and used together.

The NTM is evaluated on algorithmic tasks such as copying, repeat-copying, associative recall, dynamic N-grams and priority sort.

N.B.: What are Hopfield networks?