# Lecture 13 Neural Networks with External Memory CMSC 35246: Deep Learning

Shubhendu Trivedi & Risi Kondor

University of Chicago

May 10, 2017

< 行 →

CMSC 35246

#### Neural Networks with Explicit Memory

• We have looked at a bunch of supervised neural network models

- We have looked at a bunch of supervised neural network models
- These models (such as for object recognition, machine translation) slowly absorb the examples into their weights to learn the concept over successive gradient descent iterations

• Knowledge about the concepts is held *implicitly* in the weights

- Knowledge about the concepts is held *implicitly* in the weights
- This accords networks with a very limited short term memory

- Knowledge about the concepts is held *implicitly* in the weights
- This accords networks with a very limited short term memory
- In many real tasks of interest we want to be able to explicitly store, and be able to access and manipulate such information

- Knowledge about the concepts is held *implicitly* in the weights
- This accords networks with a very limited short term memory
- In many real tasks of interest we want to be able to explicitly store, and be able to access and manipulate such information
- Traditional frameworks struggle with memorizing facts and being able to manipulate information for some task of interest (such as question answering, programming need longer term memory, out of sequence accesses to information)

< A >

- Knowledge about the concepts is held *implicitly* in the weights
- This accords networks with a very limited short term memory
- In many real tasks of interest we want to be able to explicitly store, and be able to access and manipulate such information
- Traditional frameworks struggle with memorizing facts and being able to manipulate information for some task of interest (such as question answering, programming need longer term memory, out of sequence accesses to information)
- Solution: Endow a Neural Network with an external memory that it can read from and write to

• A memory unit has pieces of information stored at different *locations* 

- A memory unit has pieces of information stored at different *locations*
- We want to be able to:
  - Access locations

- A memory unit has pieces of information stored at different *locations*
- We want to be able to:
  - Access locations
  - Read/write information from/to these locations

- A memory unit has pieces of information stored at different *locations*
- We want to be able to:
  - Access locations
  - Read/write information from/to these locations
- Attention mechanisms give us a way to point to specific chunks (we saw two examples Machine Translation and Caption generation)

- A memory unit has pieces of information stored at different *locations*
- We want to be able to:
  - Access locations
  - Read/write information from/to these locations
- Attention mechanisms give us a way to point to specific chunks (we saw two examples Machine Translation and Caption generation)
- Let us now see how can we use these intuition to construct networks with an explicit external memory

< Al 1

- A memory unit has pieces of information stored at different *locations*
- We want to be able to:
  - Access locations
  - Read/write information from/to these locations
- Attention mechanisms give us a way to point to specific chunks (we saw two examples Machine Translation and Caption generation)
- Let us now see how can we use these intuition to construct networks with an explicit external memory
- Slightly anachronistic: Will first look at Neural Turing Machines and then Memory Networks

< A >

#### A Primitive Computer Model











• Goal: Turn a Neural Network into a Differentiable Computer





- Goal: Turn a Neural Network into a Differentiable Computer
- We roughly want the Neural Network to be the CPU





- Goal: Turn a Neural Network into a Differentiable Computer
- We roughly want the Neural Network to be the CPU
- Give NN read-write access to an external memory unit



- Goal: Turn a Neural Network into a Differentiable Computer
- We roughly want the Neural Network to be the CPU
- Give NN read-write access to an external memory unit
- Want the whole system to trainable by backpropagation

# A Turing Machine





< 17 >

• A Turing Machine consists of:

- A Turing Machine consists of:
- A Tape:

- A Turing Machine consists of:
- A Tape:
  - Consists of cells next to each other

- A Turing Machine consists of:
- A Tape:
  - Consists of cells next to each other
  - A cell has a symbol from some finite alphabet (which has a special blank symbol)

- A Turing Machine consists of:
- A Tape:
  - Consists of cells next to each other
  - A cell has a symbol from some finite alphabet (which has a special blank symbol)
  - Is unbounded: The TM is always given as much tape as needed for its computation

- A Turing Machine consists of:
- A Tape:
  - Consists of cells next to each other
  - A cell has a symbol from some finite alphabet (which has a special blank symbol)
  - Is unbounded: The TM is always given as much tape as needed for its computation
- A Head: Can read/write symbols on the tape and move left/right one cell at a time

- A Turing Machine consists of:
- A Tape:
  - Consists of cells next to each other
  - A cell has a symbol from some finite alphabet (which has a special blank symbol)
  - Is unbounded: The TM is always given as much tape as needed for its computation
- A Head: Can read/write symbols on the tape and move left/right one cell at a time

- A Turing Machine consists of:
- A Tape:
  - Consists of cells next to each other
  - A cell has a symbol from some finite alphabet (which has a special blank symbol)
  - Is unbounded: The TM is always given as much tape as needed for its computation
- A Head: Can read/write symbols on the tape and move left/right one cell at a time
- A State Register: Stores the state of the TM ("state of mind" that a person performing the computation)

- A Turing Machine consists of:
- A Tape:
  - Consists of cells next to each other
  - A cell has a symbol from some finite alphabet (which has a special blank symbol)
  - Is unbounded: The TM is always given as much tape as needed for its computation
- A Head: Can read/write symbols on the tape and move left/right one cell at a time
- A State Register: Stores the state of the TM ("state of mind" that a person performing the computation)
- A Finite table of instructions: Given current state and the symbol it is reading on the tape tells the machine to either:
  - Either erase or write a symbole to a cell

- A Turing Machine consists of:
- A Tape:
  - Consists of cells next to each other
  - A cell has a symbol from some finite alphabet (which has a special blank symbol)
  - Is unbounded: The TM is always given as much tape as needed for its computation
- A Head: Can read/write symbols on the tape and move left/right one cell at a time
- A State Register: Stores the state of the TM ("state of mind" that a person performing the computation)
- A Finite table of instructions: Given current state and the symbol it is reading on the tape tells the machine to either:
  - Either erase or write a symbole to a cell
  - Move the head L or R

- A Turing Machine consists of:
- A Tape:
  - Consists of cells next to each other
  - A cell has a symbol from some finite alphabet (which has a special blank symbol)
  - Is unbounded: The TM is always given as much tape as needed for its computation
- A Head: Can read/write symbols on the tape and move left/right one cell at a time
- A State Register: Stores the state of the TM ("state of mind" that a person performing the computation)
- A Finite table of instructions: Given current state and the symbol it is reading on the tape tells the machine to either:
  - Either erase or write a symbole to a cell
  - Move the head L or R
  - Then assume the same or a new state

• A Turing Machine consists of:



- A Turing Machine consists of:
- A State Register: Stores the state of the TM ("state of mind" that a person performing the computation)

- A Turing Machine consists of:
- A State Register: Stores the state of the TM ("state of mind" that a person performing the computation)
- A Finite table of instructions: Given current state and the symbol it is reading on the tape tells the machine to either:
  - Either erase or write a symbole to a cell

#### **Informal Description**

- A Turing Machine consists of:
- A State Register: Stores the state of the TM ("state of mind" that a person performing the computation)
- A Finite table of instructions: Given current state and the symbol it is reading on the tape tells the machine to either:
  - Either erase or write a symbole to a cell
  - Move the head L or R

#### **Informal Description**

- A Turing Machine consists of:
- A State Register: Stores the state of the TM ("state of mind" that a person performing the computation)
- A Finite table of instructions: Given current state and the symbol it is reading on the tape tells the machine to either:
  - Either erase or write a symbole to a cell
  - Move the head L or R
  - Then assume the same or a new state

# Goal: Want to mimic the working of a Turing Machine in a differentiable manner



Lecture 13 Neural Networks with External Memory

CMSC 35246

< 🗇 ►



✓ ☐ ▶
CMSC 35246

Lecture 13 Neural Networks with External Memory



• Controller: Is a RNN or a CNN

Lecture 13 Neural Networks with External Memory



- Controller: Is a RNN or a CNN
- Receives input vectors and outputs vectors just as a normal neural network

< (P) >





Lecture 13 Neural Networks with External Memory



• The controller is connected to a real valued memory matrix which it can read/write to



- The controller is connected to a real valued memory matrix which it can read/write to
- Controller interacts with this memory matrix with attentional processes that try to mimic the notion of heads in a TM

< 行 →



- The controller is connected to a real valued memory matrix which it can read/write to
- Controller interacts with this memory matrix with attentional processes that try to mimic the notion of heads in a TM
- Main Idea: Keep everything differentiable

< 行 →

• As hinted, we don't want to read/write to the whole memory, but want to focus on selective parts

- As hinted, we don't want to read/write to the whole memory, but want to focus on selective parts
- We do this by an attentional model

- As hinted, we don't want to read/write to the whole memory, but want to focus on selective parts
- We do this by an attentional model
- The controller will output a weight vector a distribution over the rows of the memory matrix

- As hinted, we don't want to read/write to the whole memory, but want to focus on selective parts
- We do this by an attentional model
- The controller will output a weight vector a distribution over the rows of the memory matrix
- We do this in two ways:

< 行 →

- As hinted, we don't want to read/write to the whole memory, but want to focus on selective parts
- We do this by an attentional model
- The controller will output a weight vector a distribution over the rows of the memory matrix
- We do this in two ways:
  - Based on location

< A >

- As hinted, we don't want to read/write to the whole memory, but want to focus on selective parts
- We do this by an attentional model
- The controller will output a weight vector a distribution over the rows of the memory matrix
- We do this in two ways:
  - Based on location
  - Based on content

- As hinted, we don't want to read/write to the whole memory, but want to focus on selective parts
- We do this by an attentional model
- The controller will output a weight vector a distribution over the rows of the memory matrix
- We do this in two ways:
  - Based on location
  - Based on content
- Let's see both

• The controller emits a key k which is compared to the content of each memory location M[i]

- The controller emits a key k which is compared to the content of each memory location M[i]
- This comparison is done using some similarity measure  $S[\cdot, \cdot]$  (e.g. cosine similarity)

- The controller emits a key k which is compared to the content of each memory location M[i]
- This comparison is done using some similarity measure  $S[\cdot, \cdot]$  (e.g. cosine similarity)
- The similarities are then normalized using softmax (should remind of earlier lecture)

- The controller emits a key k which is compared to the content of each memory location M[i]
- This comparison is done using some similarity measure  $S[\cdot, \cdot]$  (e.g. cosine similarity)
- The similarities are then normalized using softmax (should remind of earlier lecture)
- Additionally: We define a parameter  $\beta \geq 0$  that controls the sharpness of focus

< A >

- The controller emits a key k which is compared to the content of each memory location M[i]
- This comparison is done using some similarity measure  $S[\cdot, \cdot]$  (e.g. cosine similarity)
- The similarities are then normalized using softmax (should remind of earlier lecture)
- Additionally: We define a parameter  $\beta \geq 0$  that controls the sharpness of focus

$$\mathbf{w}[i] = \frac{\exp(\beta S(\mathbf{k}, M[i]))}{\sum_{j} \exp(\beta S(\mathbf{k}, M[j]))}$$

#### • Find memories close to the key

< A >

• Content based addressing is a form of associative lookup: only cares about what the vectors are and not where they are

- Content based addressing is a form of associative lookup: only cares about what the vectors are and not where they are
- Sometimes we want to not care about the contents, but only care about the actual location

- Content based addressing is a form of associative lookup: only cares about what the vectors are and not where they are
- Sometimes we want to not care about the contents, but only care about the actual location
- Idea: Controller outputs a shift kernel s (usually softmax on numbers between +1 and -1)

- Content based addressing is a form of associative lookup: only cares about what the vectors are and not where they are
- Sometimes we want to not care about the contents, but only care about the actual location
- Idea: Controller outputs a shift kernel s (usually softmax on numbers between +1 and -1)
- $\bullet$  This is convolved with a weighing  ${\bf w}$  to produce a shifted weighing  $\tilde{{\bf w}}$

- Content based addressing is a form of associative lookup: only cares about what the vectors are and not where they are
- Sometimes we want to not care about the contents, but only care about the actual location
- Idea: Controller outputs a shift kernel s (usually softmax on numbers between +1 and -1)
- $\bullet$  This is convolved with a weighing  ${\bf w}$  to produce a shifted weighing  $\tilde{{\bf w}}$

$$\tilde{\mathbf{w}} = \sum_{j} \mathbf{w}[j] \mathbf{s}(i-j)$$



- Content based addressing is a form of associative lookup: only cares about what the vectors are and not where they are
- Sometimes we want to not care about the contents, but only care about the actual location
- Idea: Controller outputs a shift kernel s (usually softmax on numbers between +1 and -1)
- $\bullet$  This is convolved with a weighing  ${\bf w}$  to produce a shifted weighing  $\tilde{{\bf w}}$

$$\tilde{\mathbf{w}} = \sum_{j} \mathbf{w}[j] \mathbf{s}(i-j)$$

• Gives a way to use a weighing already generated and push it up or down

< 行 →

#### **Motivation**

• Why do we need these different addressing mechanisms?

#### Motivation

- Why do we need these different addressing mechanisms?
- Idea is to mimic various data structures and accessors in programming languages

#### Motivation

- Why do we need these different addressing mechanisms?
- Idea is to mimic various data structures and accessors in programming languages
- Content key only: Associative map
- Content and Location: **k** finds an array in memory, shift indexes in it
- Location: Only iterates from the last focus

#### **Reading from Memory**

• We've defined weighings: How do we read from the memory?

# **Reading from Memory**

- We've defined weighings: How do we read from the memory?
- $\bullet$  Reading is simple: The read head gives a read vector  ${\bf r}$  to the controller

#### **Reading from Memory**

- We've defined weighings: How do we read from the memory?
- $\bullet$  Reading is simple: The read head gives a read vector  ${\bf r}$  to the controller

$$\mathbf{r} = \sum_{i} \mathbf{w}[i] M[i]$$

< 行 →

#### Writing to Memory

• Slightly more complicated



- Slightly more complicated
- Decompose it into an erase and an add

- Slightly more complicated
- Decompose it into an erase and an add
- The controller will generate an erase vector **e** and an add vector **a** (both between 0 and 1) and sends it to the write vector

- Slightly more complicated
- Decompose it into an erase and an add
- The controller will generate an erase vector **e** and an add vector **a** (both between 0 and 1) and sends it to the write vector
- The write head then resets and writes to the memory

- Slightly more complicated
- Decompose it into an erase and an add
- The controller will generate an erase vector **e** and an add vector **a** (both between 0 and 1) and sends it to the write vector
- The write head then resets and writes to the memory

$$M[i] \leftarrow M[i](1 - \mathbf{w}[i] \odot \mathbf{e}) + \mathbf{w}[i] \odot \mathbf{a}$$

< A >

• Basic motivation: How to turn Neural Networks into differentiable computers?

- Basic motivation: How to turn Neural Networks into differentiable computers?
- Roughly: Want our NN to operate as a CPU that can read/write to an external memory

- Basic motivation: How to turn Neural Networks into differentiable computers?
- Roughly: Want our NN to operate as a CPU that can read/write to an external memory
- Idea: The two together should be able to learn to program from input and output examples using backpropagation

- Basic motivation: How to turn Neural Networks into differentiable computers?
- Roughly: Want our NN to operate as a CPU that can read/write to an external memory
- Idea: The two together should be able to learn to program from input and output examples using backpropagation
- Separate computation from memory

• Task: Read a vector and then reproduce the whole vector

- Task: Read a vector and then reproduce the whole vector
- Implements a simple algorithm. Time goes from left to right. Left column shows the write weighings, right columns shows the read weightings



< 行 →

• Interesting part: Trained on sequences of length 10, it generalizes to sequence of length 120

• Interesting part: Trained on sequences of length 10, it generalizes to sequence of length 120



CMSC 35246

< 17 >

# Task 2: Copy N times

• For loop: Give it a sequence and number, reproduce it N times

## Task 2: Copy N times

• For loop: Give it a sequence and number, reproduce it N times



Figure: Neural Turing Machine, Graves et al.



#### Task 3: Priority Sort





Lecture 13 Neural Networks with External Memory

#### Task 3: Priority Sort





#### NTM v 2.0

#### Differentiable Neural Computers





Lecture 13 Neural Networks with External Memory



• Remember: The whole architecture is recurrent even if the controller is not recurrent.





< 17 >



- Remember: The whole architecture is recurrent even if the controller is not recurrent.
- Let us see how far can we push this paradigm of modifying these real numbers i.e. the memory

< 行 →







- We now have three attentional processes:
  - Based on content



- We now have three attentional processes:
  - Based on content
  - Based on memory allocation



- We now have three attentional processes:
  - Based on content
  - Based on memory allocation
  - Based on temporal order

< (P) >



- We now have three attentional processes:
  - Based on content
  - Based on memory allocation
  - Based on temporal order
- Based on content, memory allocation and temporal order: The controller will interpolate between these by using scalar gates

< 17 >

• NTM could only allocate memory in contiguous blocks

- NTM could only allocate memory in contiguous blocks
- Leads to memory management issues (blocks start to overlap)

- NTM could only allocate memory in contiguous blocks
- Leads to memory management issues (blocks start to overlap)
- We can define a differentiable free list that book-keeps the memory usage of each location  $\mathbf{u}_t$

- NTM could only allocate memory in contiguous blocks
- Leads to memory management issues (blocks start to overlap)
- We can define a differentiable free list that book-keeps the memory usage of each location  $\mathbf{u}_t$
- Usage is increased after a write  $\mathbf{w}_t^w$ , maybe decreased after each read  $\mathbf{w}_t^{r,i}$  by using a free gate  $f_t^i$

< Al 1

- NTM could only allocate memory in contiguous blocks
- Leads to memory management issues (blocks start to overlap)
- We can define a differentiable free list that book-keeps the memory usage of each location  $\mathbf{u}_t$
- Usage is increased after a write  $\mathbf{w}_t^w$ , maybe decreased after each read  $\mathbf{w}_t^{r,i}$  by using a free gate  $f_t^i$

$$\mathbf{u}_{t} = (\mathbf{u}_{t-1} + \mathbf{w}_{t-1}^{w} - \mathbf{u}_{t-1} \odot \mathbf{w}_{t-1}^{w}) \odot \prod_{i=1}^{R} (\mathbf{1} - f_{t}^{i} \mathbf{w}_{t-1}^{r,i})$$

< Al 1

### **Allocating Memory: Test**

• Gave a bunch of random sequences and asked the system to reproduce them without resetting the memory:



Figure: Hybrid Computing using a Neural Network with Dynamic External Memory, Graves et al.

• A Neural Turing Machine was able to search by content and index location but not by the order in which memories were written

- A Neural Turing Machine was able to search by content and index location but not by the order in which memories were written
- This is essential for certain tasks in which a sequence of sub-tasks have to be remembered in a certain order

- A Neural Turing Machine was able to search by content and index location but not by the order in which memories were written
- This is essential for certain tasks in which a sequence of sub-tasks have to be remembered in a certain order
- We can move iterate over the memory in order that they were written by making use of a precedence weighting

< Al 1

CMSC 35246

- A Neural Turing Machine was able to search by content and index location but not by the order in which memories were written
- This is essential for certain tasks in which a sequence of sub-tasks have to be remembered in a certain order
- We can move iterate over the memory in order that they were written by making use of a precedence weighting

$$\mathbf{p}_t = \left(1 - \sum_i \mathbf{w}_t^w[i]\right) \mathbf{p}_{t-1} + \mathbf{w}_t^w$$

< Al 1

$$\mathbf{p}_t = \left(1 - \sum_i \mathbf{w}_t^w[i]\right) \mathbf{p}_{t-1} + \mathbf{w}_t^w$$

Lecture 13 Neural Networks with External Memory

$$\mathbf{p}_t = \left(1 - \sum_i \mathbf{w}_t^w[i]\right) \mathbf{p}_{t-1} + \mathbf{w}_t^w$$

•  $\mathbf{p}_t$  updates a matrix  $L_t$  called the Temporal Link Matrix

$$\mathbf{p}_t = \left(1 - \sum_i \mathbf{w}_t^w[i]\right) \mathbf{p}_{t-1} + \mathbf{w}_t^w$$

•  $\mathbf{p}_t$  updates a matrix  $L_t$  called the Temporal Link Matrix

$$L_t[i,j] = (1 - \mathbf{w}_t^w[i] - \mathbf{w}_t^w[j])L_{t-1}[i,j] + \mathbf{w}_t^w[i]\mathbf{p}_{t-1}[j]$$



$$\mathbf{p}_t = \left(1 - \sum_i \mathbf{w}_t^w[i]\right) \mathbf{p}_{t-1} + \mathbf{w}_t^w$$

•  $\mathbf{p}_t$  updates a matrix  $L_t$  called the Temporal Link Matrix

$$L_t[i,j] = (1 - \mathbf{w}_t^w[i] - \mathbf{w}_t^w[j])L_{t-1}[i,j] + \mathbf{w}_t^w[i]\mathbf{p}_{t-1}[j]$$

• The controller can use  $L_t$  to retrieve the row that was written immediately before or after the last read

< A >

CMSC 35246



$$\mathbf{p}_t = \left(1 - \sum_i \mathbf{w}_t^w[i]\right) \mathbf{p}_{t-1} + \mathbf{w}_t^w$$

•  $\mathbf{p}_t$  updates a matrix  $L_t$  called the Temporal Link Matrix

$$L_t[i,j] = (1 - \mathbf{w}_t^w[i] - \mathbf{w}_t^w[j])L_{t-1}[i,j] + \mathbf{w}_t^w[i]\mathbf{p}_{t-1}[j]$$

- The controller can use  $L_t$  to retrieve the row that was written immediately before or after the last read
- This allows the controller to iterate in time

< A >

$$\mathbf{p}_t = \left(1 - \sum_i \mathbf{w}_t^w[i]\right) \mathbf{p}_{t-1} + \mathbf{w}_t^w$$

•  $\mathbf{p}_t$  updates a matrix  $L_t$  called the Temporal Link Matrix

$$L_t[i,j] = (1 - \mathbf{w}_t^w[i] - \mathbf{w}_t^w[j])L_{t-1}[i,j] + \mathbf{w}_t^w[i]\mathbf{p}_{t-1}[j]$$

- The controller can use  $L_t$  to retrieve the row that was written immediately before or after the last read
- This allows the controller to iterate in time
- Three way gates are used to interpolate between the forward and backward iterations as well as content

< A >

### Architecture



Figure: Hybrid Computing using a Neural Network with Dynamic External Memory, Graves et al.



### **Shortest Paths**



Figure: Hybrid Computing using a Neural Network with Dynamic External Memory, Graves et al.



#### **Shortest Paths**



Figure: Hybrid Computing using a Neural Network with Dynamic External Memory, Graves et al.

## **Family Tree**



Figure: Hybrid Computing using a Neural Network with Dynamic External Memory, Graves et al. https://www.youtube.com/watch?v=B9U8sI7TcMY



#### Sample: Sheep are afraid of wolves Cats are afraid of dogs Mice are afraid of cats Gertrude is a sheep

Question: What is Gertrude afraid of?



Lecture 13 Neural Networks with External Memory

#### Sample: Sheep are afraid of wolves Cats are afraid of dogs Mice are afraid of cats Gertrude is a sheep Question: What is Gertrude afraid of?

Sample: Sheep are afraid of wolves Cats are afraid of dogs Mice are afraid of cats Gertrude is a sheep Question: What is Gertrude afraid of?

Sample: Sheep are afraid of wolves Cats are afraid of dogs Mice are afraid of cats Gertrude is a sheep Question: What is Gertrude afraid of? Answer: Wolves

## **Motivation**

• Sentences are accessed out of order

## **Motivation**

- Sentences are accessed out of order
- There can be many sentences in between: long term dependencies

• Neural Network model with an external memory

- Neural Network model with an external memory
- Soft attention mechanisms are used to read from memory

- Neural Network model with an external memory
- Soft attention mechanisms are used to read from memory
- Depending on the taak, we can do multiple hops on the memory

- Neural Network model with an external memory
- Soft attention mechanisms are used to read from memory
- Depending on the taak, we can do multiple hops on the memory
- The goal again is to keep the system differentiable end-to-end

< A >