CS224n assignment 2

2018-12-31

#NLP #Deep Learning

學校課程/ 自然語言處理

CS224n assignment 2

這次的作業主要目的是讓我們實作 Dependency Parsing 以及熟悉 Tensorflow 的運作原理。

1. Tensorflow Softmax

In this question, we will implement a linear classifier with loss function \[ J(\mathbf{W}) = CE(\mathbf{y},softmax(\mathbf{xW} + \mathbf{b})) \] Where \(\mathbf{x}\) is a vector of features, \(\mathbf{W}\) is the model’s weight matrix, and \(\mathbf{b}\) is a bias term. We will use TensorFlow’s automatic differentiation capability to fit this model to provided data.

(a) 使用 Tensorflow 實作 Softmax

Implement the softmax function using TensorFlow in q1_softmax.py. Remember that \[ softmax(x)_i = \dfrac{e^{x_i}}{\sum_j{e^{x_j}}} \] Note that you may not use tf.nn.softmax or related built-in functions. You can run basic (nonexhaustive tests) by running python q1_softmax.py.

1.(a) solution

1
2
3

def softmax(x):
    out = tf.exp(x) / tf.reduce_sum(tf.exp(x), axis=1, keepdims=True)
    return out

(b) 使用 Tensorflow 實作 Cross Entropy Loss

Implement the cross-entropy loss using TensorFlow in q1_softmax.py. Remember that \[ CE(\mathbf{y},\mathbf{\hat{y}})=-\sum\limits_{i=1}^{N_c} y_i log(\hat{y_i})\] where \(\mathbf{y} \in \mathbb{R}^{N_c}\) is a one-hot label vector and \(Nc\) is the number of classes. This loss is summed over all examples in a minibatch. Note that you may not use TensorFlow’s built-in cross-entropy functions for this question. You can run basic (non-exhaustive tests) by running python q1_softmax.py.

1.(b) solution

1
2
3

def cross_entropy_loss(y, yhat):
    out = -tf.reduce_sum(tf.multiply(tf.to_float(y), tf.log(yhat)))
    return out

(c) Placeholder Variable 與 Feed Dictionary

Carefully study the Model class in model.py. Briefly explain the purpose of placeholder variables and feed dictionaries in TensorFlow computations. Fill in the implementations for add_placeholders and create_feed_dict in q1_classifier.py.

1.(c) solution

Hint: Note that configuration variables are stored in the Config class. You will need to use these configuration variables in the code.

1
2
3

def add_placeholders(self):
    self.input_placeholder = tf.placeholder(dtype=tf.float32, shape=(self.config.batch_size, self.config.n_features))
    self.labels_placeholder = tf.placeholder(dtype=tf.float32, shape=(self.config.batch_size, self.config.n_classes))

def create_feed_dict(self, inputs_batch, labels_batch=None):
    feed_dict = {self.input_placeholder: inputs_batch,
                    self.labels_placeholder: labels_batch}

    return feed_dict

(d) 使用 Tensorflow 建立網路架構

Implement the transformation for a softmax classifier in the function add_prediction_op in q1_classifier.py. Add cross-entropy loss in the function add_loss_op in the same file. Use the implementations from the earlier parts of the problem (already imported for you), not TensorFlow built-ins.

1.(d) solution

def add_prediction_op(self):
    with tf.variable_scope('transformation'):
        b = tf.Variable(tf.zeros(shape=[self.config.n_classes]))
        W = tf.Variable(tf.zeros(shape=[self.config.n_features, self.config.n_classes]))
        z = tf.matmul(self.input_placeholder, W) + b
    pred = softmax(z)
    return pred

1
2
3

def add_loss_op(self, pred):
    loss = cross_entropy_loss(self.labels_placeholder, pred)
    return loss

(e) 使用 Tensorflow 加入 Optimizer

Fill in the implementation for add_training_op in q1_classifier.py. Explain in a few sentences what happens when the model’s train_op is called (what gets computed during forward propagation, what gets computed during backpropagation, and what will have changed after the op has been run?). Verify that your model is able to fit to synthetic data by running python q1_classifier.py and making sure that the tests pass.

Hint: Make sure to use the learning rate specified in Config.

1.(e) solution

1
2
3

def add_training_op(self, loss):
    train_op = tf.train.GradientDescentOptimizer(learning_rate=self.config.lr).minimize(loss)
    return train_op

2. Neural Transition-Based Dependency Parsing

In this section, you’ll be implementing a neural-network based dependency parser. A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between “head” words and words which modify those heads. Your implementation will be a transition-based parser, which incrementally builds up a parse one step at a time. At every step it maintains a partial parse, which is represented as follows:

A stack of words that are currently being processed.
A buffer of words yet to be processed.
A list of dependencies predicted by the parser.

Initially, the stack only contains ROOT, the dependencies lists is empty, and the buffer contains all words of the sentence in order. At each step, the parse applies a transition to the partial parse until its buffer is empty and the stack is size 1. The following transitions can be applied:

SHIFT: removes the first word from the buffer and pushes it onto the stack.
LEFT-ARC: marks the second (second most recently added) item on the stack as a dependent of the first item and removes the second item from the stack.
RIGHT-ARC: marks the first (most recently added) item on the stack as a dependent of the second item and removes the first item from the stack.

Your parser will decide among transitions at each state using a neural network classifier. First, you will implement the partial parse representation and transition functions.

(a) 試試看

Go through the sequence of transitions needed for parsing the sentence “I parsed this sentence correctly”. The dependency tree for the sentence is shown below. At each step, give the configuration of the stack and buffer, as well as what transition was applied this step and what new dependency was added (if any). The first three steps are provided below as an example.

2.(a) solution

stack	buffer	new dependency	transition
[ROOT]	[I, parsed, this, setence, correctly]		Initial Configuration
[ROOT, I]	[parsed, this, setence, correctly]		SHIFT
[ROOT, I, parsed]	[this, setence, correctly]		SHIFT
[ROOT, parsed]	[this, setence, correctly]	parsed -> I	LEFT-ARC
[ROOT, parsed, this]	[setence, correctly]		SHIFT
[ROOT, parsed, this]	[setence, correctly]		SHIFT
[ROOT, parsed, this, sentence]	[correctly]		SHIFT
[ROOT, parsed, setence]	[correctly]	sentence -> this	LEFT-ARC
[ROOT, parsed]	[correctly]	parse -> sentence	RIGHT-ARC
[ROOT, parsed, correctly]	[]		SHFIT
[ROOT, parsed]	[]	parse -> correctly	RIGHT-ARC
[ROOT]	[]	ROOT -> parsed	RIGHT-ARC

(b) 時間複雜度

A sentence containing n words will be parsed in how many steps (in terms of n)? Briefly explain why.

2.(b) solution

每個詞都會進行一次 SHIFT 及 LEFT/RIGHT-ARC，因此共 2n 次。

(c) 實作 Dependency Parsing

Implement the __init__ and parse_step functions in the PartialParse class in q2_parser_transitions.py. This implements the transition mechanics your parser will use. You can run basic (not-exhaustive) tests by running python q2_parser_transitions.py.

2.(c) solution

def __init__(self, sentence):
    self.sentence = sentence
    self.stack = ['ROOT']
    self.buffer = sentence[:]
    self.dependencies = []

def parse_step(self, transition):
    if transition == "S":
        self.stack.append(self.buffer[0])
        self.buffer.pop(0)
    elif transition == "LA":
        self.dependencies.append((self.stack[-1], self.stack[-2]))
        self.stack.pop(-2)
    else:
        self.dependencies.append((self.stack[-2], self.stack[-1]))
        self.stack.pop(-1)

(d) Minibatch Dependency Parsing

Our network will predict which transition should be applied next to a partial parse. We could use it to parse a single sentence by applying predicted transitions until the parse is complete. However, neural networks run much more efficiently when making predictions about batches of data at a time (i.e., predicting the next transition for a many different partial parses simultaneously). We can parse sentences in minibatches with the following algorithm.

Implement this algorithm in the minibatch_parse function in q2_ parser_transitions.py. You can run basic (not-exhaustive) tests by running python q2_parser_transitions.py.

Note: You will need minibatch_parse to be correctly implemented to evaluate the model you will build in part (h). However, you do not need it to train the model, so you should be able to complete most of part (h) even if minibatch parse is not implemented yet.

2.(d) solution

def minibatch_parse(sentences, model, batch_size):
    partial_parse = [PartialParse(sentence) for sentence in sentences]
    dependencies = []

    while len(partial_parse) > 0:
        mini_batch = partial_parse[:batch_size]
        while len(mini_batch) > 0:
            transitions = model.predict(mini_batch)
            for i, action in enumerate(transitions):
                mini_batch[i].parse_step(action)
            mini_batch = [parse for parse in mini_batch if len(parse.stack) > 1 or len(parse.buffer) > 0]
        dependencies.extend(p.dependencies for p in partial_parse[:batch_size])
        partial_parse = partial_parse[batch_size:]

    return dependencies

(e) Xavier initialization

In order to avoid neurons becoming too correlated and ending up in poor local minimina, it is often helpful to randomly initialize parameters. One of the most frequent initializations used is called Xavier initialization.

Given a matrix \(A\) of dimension \(m × n\), Xavier initialization selects values \(A_{ij}\) uniformly from \([-\epsilon,\epsilon]\), where

\[\epsilon=\dfrac{\sqrt{6}}{\sqrt{m+n}}\]

Implement the initialization in xavier weight init in q2 _initialization.py. You can run basic (nonexhaustive tests) by running python q2_initialization.py. This function will be used to initialize \(W\) and \(U\).

2.(e) solution

def xavier_weight_init():
    def _xavier_initializer(shape, **kwargs):
        epsilon = tf.sqrt(6 / tf.cast(np.sum(shape), tf.float32))
        out = tf.random_uniform(shape=shape, minval=-epsilon, maxval=epsilon)

        return out
    # Returns defined initializer function.
    return _xavier_initializer

(f) Dropout

We will regularize our network by applying Dropout. During training this randomly sets units in the hidden layer \(h\) to zero with probability \(p_{drop}\) and then multiplies \(h\) by a constant \(\gamma\) (dropping different units each minibatch). We can write this as

\[h_{drop}=\gamma d \circ h\]

where \(d ∈ \{0, 1\}^{D_h}\) (\(D_h\) is the size of \(h\)) is a mask vector where each entry is 0 with probability \(p_{drop}\) and 1 with probability (1 − \(p_{drop}\)). \(\gamma\) is chosen such that the value of hdrop in expectation equals \(h\):

\[E_{p_{drop}}[h_{drop}]_i=h_i\]

for all \(0 < i < D_h\). What must \(\gamma\) equal in terms of \(p_{drop}\)? Briefly justify your answer.

2.(f) solution

\(E_p{h_p}_i=E_p[\gamma d_i h_i] = p \times 0 + (1-p) \times \gamma h_i = (1-p)\gamma h_i = h_i\)

\(\Rightarrow r = \dfrac{1}{1-p}\)

(g) Adam Optmizer

We will train our model using the Adam optimizer. Recall that standard SGD uses the update rule

\[\theta \leftarrow \theta - \alpha \nabla_\theta J_{minibatch}(\theta)\]

where \(\theta\) is a vector containing all of the model parameters, \(J\) is the loss function, \(\nabla_\theta J_{minibatch}(\theta)\) is the gradient of the loss function with respect to the parameters on a minibatch of data, and \(\alpha\) is the learning rate. Adam uses a more sophisticated update rule with two additional steps.

First, Adam uses a trick called momentum by keeping track of m, a rolling average of the gradients:

\[m \leftarrow \beta_1 m + (1-\beta_1) \nabla_\theta J_{minibatch}(\theta)\]

\[\theta \leftarrow \theta - \alpha m\]

where \(\beta_1\) is a hyperparameter between 0 and 1 (often set to 0.9). Briefly explain (you don’t need to prove mathematically, just give an intuition) how using \(m\) stops the updates from varying as much. Why might this help with learning?

Adam also uses adaptive learning rates by keeping track of v, a rolling average of the magnitudes of the gradients:

\[m \leftarrow \beta_1 m + (1-\beta_1) \nabla_\theta J_{minibatch}(\theta)\]

\[m \leftarrow \beta_2 v + (1-\beta_2) \nabla_\theta J_{minibatch}(\theta) \circ \nabla_\theta J_{minibatch}(\theta)\]

\[\theta \leftarrow \theta - \alpha \circ m / \sqrt{v}\]

where \(\circ\) and \(/\) denote elementwise multiplication and division (so \(z \circ z\) is elementwise squaring) and \(\beta_2\) is a hyperparameter between 0 and 1 (often set to 0.99). Since Adam divides the update by \(\sqrt{v}\), which of the model parameters will get larger updates? Why might this help with learning?

(h) Parser 模型實作

In q2_parser_model.py implement the neural network classifier governing the dependency parser by filling in the appropriate sections. We will train and evaluate our model on the Penn Treebank (annotated with Universal Dependencies).Run python q2_parser_model.py to train your model and compute predictions on the test data (make sure to turn off debug settings when doing final evaluation).

Hints:

When debugging, pass the keyword argument debug=True to the main method (it is set to true by default). This will cause the code to run over a small subset of the data, so the training the model won’t take as long.
This code should run within 1 hour on a CPU.
When running with debug=True, you should be able to get a loss smaller than 0.2 and a UAS larger than 65 on the dev set (although in rare cases your results may be lower, there is some randomness when training). When running with debug=False, you should be able to get a loss smaller than 0.08 on the train set and an Unlabeled Attachment Score larger than 87 on the dev set. For comparison, the model in the original neural dependency parsing paper gets 92.5 UAS. If you want, you can tweak the hyperparameters for your model (hidden layer size, hyperparameters for Adam, number of epochs, etc.) to improve the performance (but you are not required to do so).

2.(h) solution

import cPickle
import os
import time
import tensorflow as tf

from model import Model
from q2_initialization import xavier_weight_init
from utils.parser_utils import minibatches, load_and_preprocess_data


class Config(object):
    n_features = 36
    n_classes = 3
    dropout = 0.5  # (p_drop in the handout)
    embed_size = 50
    hidden_size = 200
    batch_size = 1024
    n_epochs = 10
    lr = 0.0005


class ParserModel(Model):

    def add_placeholders(self):
        self.input_placeholder = tf.placeholder(dtype=tf.int32, shape=(None, self.config.n_features))
        self.labels_placeholder = tf.placeholder(dtype=tf.float32, shape=(None, self.config.n_classes))
        self.dropout_placeholder = tf.placeholder(dtype=tf.float32)

    def create_feed_dict(self, inputs_batch, labels_batch=None, dropout=0):
        feed_dict = {self.input_placeholder: inputs_batch,
                     self.dropout_placeholder: dropout}

        if labels_batch is not None:
            feed_dict[self.labels_placeholder] = labels_batch

        return feed_dict

    def add_embedding(self):
        embedding = tf.Variable(self.pretrained_embeddings)
        embeddings = tf.reshape(tf.nn.embedding_lookup(embedding, self.input_placeholder), [-1, self.config.n_features * self.config.embed_size])
        return embeddings

    def add_prediction_op(self):
        x = self.add_embedding()
        xavier_initializer = xavier_weight_init()
        W = tf.Variable(xavier_initializer((self.config.n_features * self.config.embed_size, self.config.hidden_size)))
        b1 = tf.Variable(tf.zeros(self.config.hidden_size))
        U = tf.Variable(xavier_initializer((self.config.hidden_size, self.config.n_classes)))
        b2 = tf.Variable(tf.zeros(self.config.n_classes))
        h = tf.nn.relu(tf.matmul(x, W) + b1)
        h_drop = tf.nn.dropout(h, 1 - self.dropout_placeholder)
        pred = tf.matmul(h_drop, U) + b2

        return pred

    def add_loss_op(self, pred):
        loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=pred, labels=self.labels_placeholder))
        return loss

    def add_training_op(self, loss):
        train_op = tf.train.AdamOptimizer(learning_rate=self.config.lr).minimize(loss)
        return train_op

    def train_on_batch(self, sess, inputs_batch, labels_batch):
        feed = self.create_feed_dict(inputs_batch, labels_batch=labels_batch,
                                     dropout=self.config.dropout)
        _, loss = sess.run([self.train_op, self.loss], feed_dict=feed)
        return loss

    def run_epoch(self, sess, parser, train_examples, dev_set):
        n_minibatches = 1 + len(train_examples) / self.config.batch_size
        prog = tf.keras.utils.Progbar(target=n_minibatches)
        for i, (train_x, train_y) in enumerate(minibatches(train_examples, self.config.batch_size)):
            loss = self.train_on_batch(sess, train_x, train_y)
            prog.update(i + 1, [("train loss", loss)])

        print "Evaluating on dev set",
        dev_UAS, _ = parser.parse(dev_set)
        print "- dev UAS: {:.2f}".format(dev_UAS * 100.0)
        return dev_UAS

    def fit(self, sess, saver, parser, train_examples, dev_set):
        best_dev_UAS = 0
        for epoch in range(self.config.n_epochs):
            print "Epoch {:} out of {:}".format(epoch + 1, self.config.n_epochs)
            dev_UAS = self.run_epoch(sess, parser, train_examples, dev_set)
            if dev_UAS > best_dev_UAS:
                best_dev_UAS = dev_UAS
                if saver:
                    print "New best dev UAS! Saving model in ./data/weights/parser.weights"
                    saver.save(sess, './data/weights/parser.weights')
            print

    def __init__(self, config, pretrained_embeddings):
        self.pretrained_embeddings = pretrained_embeddings
        self.config = config
        self.build()


def main(debug=True):
    print 80 * "="
    print "INITIALIZING"
    print 80 * "="
    config = Config()
    parser, embeddings, train_examples, dev_set, test_set = load_and_preprocess_data(debug)
    if not os.path.exists('./data/weights/'):
        os.makedirs('./data/weights/')

    with tf.Graph().as_default() as graph:
        print "Building model...",
        start = time.time()
        model = ParserModel(config, embeddings)
        parser.model = model
        init_op = tf.global_variables_initializer()
        saver = None if debug else tf.train.Saver()
        print "took {:.2f} seconds\n".format(time.time() - start)
    graph.finalize()

    with tf.Session(graph=graph) as session:
        parser.session = session
        session.run(init_op)

        print 80 * "="
        print "TRAINING"
        print 80 * "="
        model.fit(session, saver, parser, train_examples, dev_set)

        if not debug:
            print 80 * "="
            print "TESTING"
            print 80 * "="
            print "Restoring the best model weights found on the dev set"
            saver.restore(session, './data/weights/parser.weights')
            print "Final evaluation on test set",
            UAS, dependencies = parser.parse(test_set)
            print "- test UAS: {:.2f}".format(UAS * 100.0)
            print "Writing predictions"
            with open('q2_test.predicted.pkl', 'w') as f:
                cPickle.dump(dependencies, f, -1)
            print "Done!"


if __name__ == '__main__':
    main(False)

(i) 加分題

3. Recurrent Neural Networks: Language Modeling