CS224n assignment 2
這次的作業主要目的是讓我們實作 Dependency Parsing 以及熟悉 Tensorflow 的運作原理。
1. Tensorflow Softmax
In this question, we will implement a linear classifier with loss function \[ J(\mathbf{W}) = CE(\mathbf{y},softmax(\mathbf{xW} + \mathbf{b})) \] Where \(\mathbf{x}\) is a vector of features, \(\mathbf{W}\) is the model’s weight matrix, and \(\mathbf{b}\) is a bias term. We will use TensorFlow’s automatic differentiation capability to fit this model to provided data.
(a) 使用 Tensorflow 實作 Softmax
Implement the softmax function using TensorFlow in q1_softmax.py. Remember that \[ softmax(x)_i = \dfrac{e^{x_i}}{\sum_j{e^{x_j}}} \] Note that you may not use tf.nn.softmax or related built-in functions. You can run basic (nonexhaustive tests) by running python q1_softmax.py.
1.(a) solution
| 1 | def softmax(x): | 
(b) 使用 Tensorflow 實作 Cross Entropy Loss
Implement the cross-entropy loss using TensorFlow in q1_softmax.py. Remember that \[ CE(\mathbf{y},\mathbf{\hat{y}})=-\sum\limits_{i=1}^{N_c} y_i log(\hat{y_i})\] where \(\mathbf{y} \in \mathbb{R}^{N_c}\) is a one-hot label vector and \(Nc\) is the number of classes. This loss is summed over all examples in a minibatch. Note that you may not use TensorFlow’s built-in cross-entropy functions for this question. You can run basic (non-exhaustive tests) by running python q1_softmax.py.
1.(b) solution
| 1 | def cross_entropy_loss(y, yhat): | 
(c) Placeholder Variable 與 Feed Dictionary
Carefully study the Model class in model.py. Briefly explain the purpose of placeholder variables and feed dictionaries in TensorFlow computations. Fill in the implementations for add_placeholders and create_feed_dict in q1_classifier.py.
1.(c) solution
Hint: Note that configuration variables are stored in the Config class. You will need to use these configuration variables in the code.
| 1 | def add_placeholders(self): | 
| 1 | def create_feed_dict(self, inputs_batch, labels_batch=None): | 
(d) 使用 Tensorflow 建立網路架構
Implement the transformation for a softmax classifier in the function add_prediction_op in q1_classifier.py. Add cross-entropy loss in the function add_loss_op in the same file. Use the implementations from the earlier parts of the problem (already imported for you), not TensorFlow built-ins.
1.(d) solution
| 1 | def add_prediction_op(self): | 
| 1 | def add_loss_op(self, pred): | 
(e) 使用 Tensorflow 加入 Optimizer
Fill in the implementation for add_training_op in q1_classifier.py. Explain in a few sentences what happens when the model’s train_op is called (what gets computed during forward propagation, what gets computed during backpropagation, and what will have changed after the op has been run?). Verify that your model is able to fit to synthetic data by running python q1_classifier.py and making sure that the tests pass.
Hint: Make sure to use the learning rate specified in Config.
1.(e) solution
| 1 | def add_training_op(self, loss): | 
2. Neural Transition-Based Dependency Parsing
In this section, you’ll be implementing a neural-network based dependency parser. A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between “head” words and words which modify those heads. Your implementation will be a transition-based parser, which incrementally builds up a parse one step at a time. At every step it maintains a partial parse, which is represented as follows:
- A stack of words that are currently being processed.
- A buffer of words yet to be processed.
- A list of dependencies predicted by the parser.
Initially, the stack only contains ROOT, the dependencies lists is empty, and the buffer contains all words of the sentence in order. At each step, the parse applies a transition to the partial parse until its buffer is empty and the stack is size 1. The following transitions can be applied:
- SHIFT: removes the first word from the buffer and pushes it onto the stack.
- LEFT-ARC: marks the second (second most recently added) item on the stack as a dependent of the first item and removes the second item from the stack.
- RIGHT-ARC: marks the first (most recently added) item on the stack as a dependent of the second item and removes the first item from the stack.
Your parser will decide among transitions at each state using a neural network classifier. First, you will implement the partial parse representation and transition functions.
(a) 試試看
Go through the sequence of transitions needed for parsing the sentence “I parsed this sentence correctly”. The dependency tree for the sentence is shown below. At each step, give the configuration of the stack and buffer, as well as what transition was applied this step and what new dependency was added (if any). The first three steps are provided below as an example.

2.(a) solution
| stack | buffer | new dependency | transition | 
|---|---|---|---|
| [ROOT] | [I, parsed, this, setence, correctly] | Initial Configuration | |
| [ROOT, I] | [parsed, this, setence, correctly] | SHIFT | |
| [ROOT, I, parsed] | [this, setence, correctly] | SHIFT | |
| [ROOT, parsed] | [this, setence, correctly] | parsed -> I | LEFT-ARC | 
| [ROOT, parsed, this] | [setence, correctly] | SHIFT | |
| [ROOT, parsed, this] | [setence, correctly] | SHIFT | |
| [ROOT, parsed, this, sentence] | [correctly] | SHIFT | |
| [ROOT, parsed, setence] | [correctly] | sentence -> this | LEFT-ARC | 
| [ROOT, parsed] | [correctly] | parse -> sentence | RIGHT-ARC | 
| [ROOT, parsed, correctly] | [] | SHFIT | |
| [ROOT, parsed] | [] | parse -> correctly | RIGHT-ARC | 
| [ROOT] | [] | ROOT -> parsed | RIGHT-ARC | 
(b) 時間複雜度
A sentence containing n words will be parsed in how many steps (in terms of n)? Briefly explain why.
2.(b) solution
每個詞都會進行一次 SHIFT 及 LEFT/RIGHT-ARC,因此共 2n 次。
(c) 實作 Dependency Parsing
Implement the __init__ and parse_step functions in the PartialParse class in q2_parser_transitions.py. This implements the transition mechanics your parser will use. You can run basic (not-exhaustive) tests by running python q2_parser_transitions.py.
2.(c) solution
| 1 | def __init__(self, sentence): | 
(d) Minibatch Dependency Parsing
Our network will predict which transition should be applied next to a partial parse. We could use it to parse a single sentence by applying predicted transitions until the parse is complete. However, neural networks run much more efficiently when making predictions about batches of data at a time (i.e., predicting the next transition for a many different partial parses simultaneously). We can parse sentences in minibatches with the following algorithm.

Implement this algorithm in the minibatch_parse function in q2_ parser_transitions.py. You can run basic (not-exhaustive) tests by running python q2_parser_transitions.py.
Note: You will need minibatch_parse to be correctly implemented to evaluate the model you will build in part (h). However, you do not need it to train the model, so you should be able to complete most of part (h) even if minibatch parse is not implemented yet.
2.(d) solution
| 1 | def minibatch_parse(sentences, model, batch_size): | 
(e) Xavier initialization
In order to avoid neurons becoming too correlated and ending up in poor local minimina, it is often helpful to randomly initialize parameters. One of the most frequent initializations used is called Xavier initialization.
Given a matrix \(A\) of dimension \(m × n\), Xavier initialization selects values \(A_{ij}\) uniformly from \([-\epsilon,\epsilon]\), where
\[\epsilon=\dfrac{\sqrt{6}}{\sqrt{m+n}}\]
Implement the initialization in xavier weight init in q2 _initialization.py. You can run basic (nonexhaustive tests) by running python q2_initialization.py. This function will be used to initialize \(W\) and \(U\).
2.(e) solution
| 1 | def xavier_weight_init(): | 
(f) Dropout
We will regularize our network by applying Dropout. During training this randomly sets units in the hidden layer \(h\) to zero with probability \(p_{drop}\) and then multiplies \(h\) by a constant \(\gamma\) (dropping different units each minibatch). We can write this as
\[h_{drop}=\gamma d \circ h\]
where \(d ∈ \{0, 1\}^{D_h}\) (\(D_h\) is the size of \(h\)) is a mask vector where each entry is 0 with probability \(p_{drop}\) and 1 with probability (1 − \(p_{drop}\)). \(\gamma\) is chosen such that the value of hdrop in expectation equals \(h\):
\[E_{p_{drop}}[h_{drop}]_i=h_i\]
for all \(0 < i < D_h\). What must \(\gamma\) equal in terms of \(p_{drop}\)? Briefly justify your answer.
2.(f) solution
\(E_p{h_p}_i=E_p[\gamma d_i h_i] = p \times 0 + (1-p) \times \gamma h_i = (1-p)\gamma h_i = h_i\)
\(\Rightarrow r = \dfrac{1}{1-p}\)
(g) Adam Optmizer
We will train our model using the Adam optimizer. Recall that standard SGD uses the update rule
\[\theta \leftarrow \theta - \alpha \nabla_\theta J_{minibatch}(\theta)\]
where \(\theta\) is a vector containing all of the model parameters, \(J\) is the loss function, \(\nabla_\theta J_{minibatch}(\theta)\) is the gradient of the loss function with respect to the parameters on a minibatch of data, and \(\alpha\) is the learning rate. Adam uses a more sophisticated update rule with two additional steps.
First, Adam uses a trick called momentum by keeping track of m, a rolling average of the gradients:
\[m \leftarrow \beta_1 m + (1-\beta_1) \nabla_\theta J_{minibatch}(\theta)\]
\[\theta \leftarrow \theta - \alpha m\]
where \(\beta_1\) is a hyperparameter between 0 and 1 (often set to 0.9). Briefly explain (you don’t need to prove mathematically, just give an intuition) how using \(m\) stops the updates from varying as much. Why might this help with learning?
Adam also uses adaptive learning rates by keeping track of v, a rolling average of the magnitudes of the gradients:
\[m \leftarrow \beta_1 m + (1-\beta_1) \nabla_\theta J_{minibatch}(\theta)\]
\[m \leftarrow \beta_2 v + (1-\beta_2) \nabla_\theta J_{minibatch}(\theta) \circ \nabla_\theta J_{minibatch}(\theta)\]
\[\theta \leftarrow \theta - \alpha \circ m / \sqrt{v}\]
where \(\circ\) and \(/\) denote elementwise multiplication and division (so \(z \circ z\) is elementwise squaring) and \(\beta_2\) is a hyperparameter between 0 and 1 (often set to 0.99). Since Adam divides the update by \(\sqrt{v}\), which of the model parameters will get larger updates? Why might this help with learning?
(h) Parser 模型實作
In q2_parser_model.py implement the neural network classifier governing the dependency parser by filling in the appropriate sections. We will train and evaluate our model on the Penn Treebank (annotated with Universal Dependencies).Run python q2_parser_model.py to train your model and compute predictions on the test data (make sure to turn off debug settings when doing final evaluation).
Hints:
- When debugging, pass the keyword argument - debug=Trueto the main method (it is set to- trueby default). This will cause the code to run over a small subset of the data, so the training the model won’t take as long.
- This code should run within 1 hour on a CPU. 
- When running with - debug=True, you should be able to get a loss smaller than 0.2 and a UAS larger than 65 on the dev set (although in rare cases your results may be lower, there is some randomness when training). When running with- debug=False, you should be able to get a loss smaller than 0.08 on the train set and an Unlabeled Attachment Score larger than 87 on the dev set. For comparison, the model in the original neural dependency parsing paper gets 92.5 UAS. If you want, you can tweak the hyperparameters for your model (hidden layer size, hyperparameters for Adam, number of epochs, etc.) to improve the performance (but you are not required to do so).
2.(h) solution
| 1 | import cPickle |