CS224n assignment 1

2018-12-05

#NLP #Deep Learning

學校課程/ 自然語言處理

CS224n assignment 1

這是 Stanford University 的一門開放式課程，叫做 Natural Language Processing with Deep Learning，所指派的第一個作業。這個作業透過以數學推導類神經網路的運算過程，以及實作wor2vec來進行情感分析，為初學者學習自然語言處理打下基礎。

這個作業主要分為 4 個部分：

Softmax
Neural Network Bascis
word2vec
Sentiment Analysis

1. Softmax

(a) softmax 的性質

Prove that softmax is invariant to constant offsets in the input, that is, for any input vector \(x\) and any constant \(c\), \[softmax(x) = softmax(x + c)\] where \(x + c\) means adding the constant \(c\) to every dimension of \(x\). Remember that \[softmax(x)_i=\dfrac{e^{x_i}}{\sum_j{e^{x_j}}}\] Note: In practice, we make use of this property and choose \(c = − max_i x_i\) when computing softmax probabilities for numerical stability (i.e., subtracting its maximum element from all elements of \(x\)).

solution

證明若將 softmax 的輸入 x 的每一項都加上一個常數後，結果會與原本相同。

\(softmax(x+c)_i=\dfrac{e^{x_i+c}}{\sum_j{e^{x_j+c}}}=\dfrac{e^{x_i}e^c}{\sum_j{e^{x_j}e^c}}=\dfrac{e^ce^{x_i}}{e^c\sum_j{e^{x_j}}}=\dfrac{e^{x_i}}{\sum_j{e^{x_j}}}=softmax(x)\)

這個特性在實際計算 softmax 時常被使用，將輸入 x 的每一項都減去 x 中的最大值，可以減少計算量。

(b) 實作 softmax function

Given an input matrix of N rows and D columns, compute the softmax prediction for each row using the optimization in part (a). Write your implementation in q1_softmax.py. You may test by executing python q1_softmax.py. Note: The provided tests are not exhaustive. Later parts of the assignment will reference this code so it is important to have a correct implementation. Your implementation should also be efficient and vectorized whenever possible (i.e., use numpy matrix operations rather than for loops). A non-vectorized implementation will not receive full credit!

solution

輸入：x -- 一個維度為 N x D 的 numpy 矩陣

輸出：x -- softmax(x)，可以 in-place 修改 x 的值

注意在x的維度為 1 x D 與 N x D (N ≥ 2)時的處理方式不同。

def softmax(x):
    orig_shape = x.shape

    if len(x.shape) > 1:
        # Matrix
        x = np.apply_along_axis(softmax, 1, x)
    else:
        # Vector
        x -= np.amax(x)
        x = np.exp(x) / np.sum(np.exp(x))

    assert x.shape == orig_shape
    return x

2. Neural Network Basics

(a) 推導 sigmoid function 的 gradient

Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e., in some expression where only \(\sigma(x)\), but not \(x\), is present). Assume that the input \(x\) is a scalar for this question. Recall, the sigmoid function is \[\sigma(x)=\dfrac{1}{1+e^{-x}}\]

solution

\(\dfrac{\partial\sigma(x)}{\partial x}=\dfrac{-1}{(1+e^{-x})^2}\times e^{-x}\times (-1)=\dfrac{e^{-x}}{(1+e^{-x})^2}=\dfrac{1}{1+e^{-x}}\times{\dfrac{1+e^{-x}-1}{1+e^{-x}}}=\sigma(x)(1-\sigma(x))\)

(b) 推導 cross entropy loss 的 gradient

Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for evaluation, i.e., find the gradients with respect to the softmax input vector \(\theta\), when the prediction is made by \(\hat y = softmax(\theta)\). Remember the cross entropy function is \[CE(y,\hat{y})=-\sum_i y_i log(\hat{y_i})\] where \(y\) is the one-hot label vector, and \(\hat{y}\) is the predicted probability vector for all classes. (Hint: you might want to consider the fact many elements of \(y\) are zeros, and assume that only the k-th dimension of \(y\) is one.)

solution

計算 \(\dfrac{\partial CE(y,\hat{y})}{\partial \theta}\)

分成兩種情況：

1. m = n

\(\dfrac{\partial CE(y_m,\hat{y_m})}{\partial \theta_n}=\dfrac{\partial -y_nlog\hat{y_n}}{\partial \theta_n}=-y_n \times \dfrac{1}{\hat{y_n}}\times \dfrac{\partial \dfrac{e^{\theta_n}}{\sum_j{e^{\theta j}}}}{\partial \theta_n}=-y_n \times \dfrac{1}{\hat{y_n}}\times\dfrac{e^{\theta n}\sum_j{e^{\theta_j}-e^{\theta n}e^{\theta n}}}{(\sum_j{e^{\theta_j}})^2}\) \(=-y_n \times \dfrac{1}{\hat{y_n}} \times \dfrac{e^{\theta n}}{\sum_j{e^{\theta_j}}}\times\dfrac{\sum_j{e^{\theta j}-e^{\theta n}}}{\sum_j{e^{\theta_j}}}=-y_n \times \dfrac{1}{\hat{y_n}} \times \hat{y_n} \times (1-\hat{y_n}) =-y_n(1-\hat{y_n})\)

2. m ≠ n

\(\dfrac{\partial CE(y_m,\hat{y_m})}{\partial\theta\_n}=\dfrac{- \sum\_{m \neq n}{y_m log\hat{y_m}}}{\partial\theta\_n}=-\sum_{m \neq n} y_m \times \dfrac{1}{\hat{y_m}} \times \dfrac{\partial \dfrac{e^{\theta_m}}{\sum_j{e^{\theta_j}}}}{\partial \theta\_n}=-\sum_{m \neq n} y_m \times \dfrac{1}{\hat{y_m}} \times \dfrac{0-e^{\theta_n}e^{\theta_m}}{(\sum_j{e^{\theta\_j}})^2}\)

\(=-\sum_{m \neq n} y_m \times \dfrac{1}{\hat{y_m}} \times \dfrac{-e^{\theta_n}}{\sum_j{e^{\theta_j}}} \times \dfrac{e^{\theta_m}}{\sum_j{e^{\theta\_j}}}=-\sum_{m \neq n} y_m \times \dfrac{1}{\hat{y_m}} \times (-\hat{y_n}) \times \hat{y\_m}=\sum_{m \neq n}y_m \hat{y_n}\)

因此 \(\dfrac{\partial CE}{\partial \theta_i}=-y_i(1-\hat{y\_i}) + \sum_{k \neq i}y_k \hat{y_i}=-y_i+\hat{y\_i}^2 + \sum_{k \neq i}y_k \hat{y_i}=\sum_k{y_k \hat{y_i}}-y_i=\hat{y_i}\sum_k{y_k}-y_i=\hat{y_i}-y_i\)

(c) Backpropagation

Derive the gradients with respect to the inputs \(x\) to an one-hidden-layer neural network (that is, find \(\frac{\partial J}{\partial x}\) where \(J = CE(y, \hat{y})\) is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is \(y\), and cross entropy cost is used. (Feel free to use \(\sigma'(x)\) as the shorthand for sigmoid gradient, and feel free to define any variables whenever you see fit.)

Neural Network

Recall that the forward propagation is as follows \[h = sigmoid(xW_1 + b_1), \hat{y} = softmax(hW_2 + b_2)\]

Note that here we’re assuming that the input vector (thus the hidden variables and output probabilities) is a row vector to be consistent with the programming assignment. When we apply the sigmoid function to a vector, we are applying it to each of the elements of that vector. \(W_i\) and \(b_i (i = 1, 2)\) are the weights and biases, respectively, of the two layers.

solution

令 \(z_1=xW_1+b_1,z_2=hW_2+b_2\)

則 \(\dfrac{\partial J}{\partial x}=\dfrac{\partial J}{\partial z2}\dfrac{\partial z2}{\partial h}\dfrac{\partial h}{\partial z1}\dfrac{\partial z1}{\partial x}=(\hat{y}-y)\times W_2^\top\times h(1-h)\times W_1^\top\)

(d) 類神經網路的參數數量

How many parameters are there in this neural network, assuming the input is \(Dx\)-dimensional, the output is \(Dy\)-dimensional, and there are \(H\) hidden units?

solution

根據上一題的圖，此神經網路有三個 layer：\(x,h,\hat{y}\)，其中的參數有 \(W_1,b_1,W_2,b_2\) 共四個：

\(W_1: D_x \times H\)

\(b_1: H\)

\(W_2: H \times D_y\)

\(b_2: D_y\)

總共有 \((D_x \times H)+H+(H \times D_y)+D_y=(D_x+1)H+(H+1)D_y\) 個參數。

(e) 實作 sigmoid function

Fill in the implementation for the sigmoid activation function and its gradient in q2_sigmoid.py. Test your implementation using python q2_sigmoid.py. Again, thoroughly test your code as the provided tests may not be exhaustive.

solution

實作出sigmoid以及其梯度sigmoid_grad兩個函式，直接利用(a)小題的結果可以輕鬆實現。

1
2
3

def sigmoid(x):
    s = 1 / (1 + np.exp(-x))
    return s

1
2
3

def sigmoid_grad(s):
    ds = s * (1 - s)
    return ds

(f) Gradient Check

To make debugging easier, we will now implement a gradient checker. Fill in the implementation for gradcheck naive in q2_gradcheck.py. Test your code using python q2_gradcheck.py.

solution

註解中指定使用 centered difference 方法來實作 gradient check，因為他比 forward / backward difference 方法誤差更小。

Forward difference approximation: \[f'(x)\approx \dfrac{f(x+h)−f(x)}{h}\] Central difference approximations \[f'(x)\approx \dfrac{f(x+h)-f(x-h)}{2h}\] Backward difference approximations: \[f'(x)\approx \dfrac{f(x)−f(x−h)}{h}\]

def gradcheck_naive(f, x):

    rndstate = random.getstate()
    random.setstate(rndstate)
    fx, grad = f(x) # Evaluate function value at original point
    h = 1e-4        # Do not change this!

    # Iterate over all indexes ix in x to check the gradient.
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        ix = it.multi_index

        ### YOUR CODE HERE:
        x_upper, x_lower = x.copy(), x.copy()

        x_upper[ix] += h
        random.setstate(rndstate)
        f_upper = f(x_upper)[0]

        x_lower[ix] -= h
        random.setstate(rndstate)
        f_lower = f(x_lower)[0]

        numgrad = (f_upper - f_lower) / (2 * h)
        ### END YOUR CODE

        # Compare gradients
        reldiff = abs(numgrad - grad[ix]) / max(1, abs(numgrad), abs(grad[ix]))
        if reldiff > 1e-5:
            print "Gradient check failed."
            print "First gradient error found at index %s" % str(ix)
            print "Your gradient: %f \t Numerical gradient: %f" % (
                grad[ix], numgrad)
            return

        it.iternext() # Step to next dimension

    print "Gradient check passed!"

(g) 實作類神經網路的 Forward 及 Backward Pass

Now, implement the forward and backward passes for a neural network with one sigmoid hidden layer. Fill in your implementation in q2_neural.py. Sanity check your implementation with python q2_neural.py.

solution

def forward_backward_prop(X, labels, params, dimensions):
    ### Unpack network parameters (do not modify)
    ofs = 0
    Dx, H, Dy = (dimensions[0], dimensions[1], dimensions[2])

    W1 = np.reshape(params[ofs:ofs+ Dx * H], (Dx, H))
    ofs += Dx * H
    b1 = np.reshape(params[ofs:ofs + H], (1, H))
    ofs += H
    W2 = np.reshape(params[ofs:ofs + H * Dy], (H, Dy))
    ofs += H * Dy
    b2 = np.reshape(params[ofs:ofs + Dy], (1, Dy))

    # Note: compute cost based on `sum` not `mean`.
    ### YOUR CODE HERE: forward propagation
    h = sigmoid(np.dot(X, W1) + b1)
    y_hat = softmax(np.dot(h, W2) + b2)
    cost = cost = -np.sum(labels * np.log(y_hat))
    ### END YOUR CODE

    ### YOUR CODE HERE: backward propagation
    d3 = y_hat - labels
    gradW2 = np.dot(h.T, d3)
    gradb2 = np.sum(d3, axis=0)
    dh = np.dot(d3, W2.T)
    grad_h = sigmoid_grad(h) * dh
    gradW1 = np.dot(X.T, grad_h)
    gradb1 = np.sum(grad_h, axis=0)
    ### END YOUR CODE

    ### Stack gradients (do not modify)
    grad = np.concatenate((gradW1.flatten(), gradb1.flatten(),
        gradW2.flatten(), gradb2.flatten()))

    return cost, grad

3. word2vec

(a) 計算 Loss Function 對 \(v_c\) 的 gradient

Assume you are given a “predicted” word vector \(v_c\) corresponding to the center word \(c\) for Skip-Gram, and word prediction is made with the softmax function found in word2vec models

\[\hat{y_o}=p(o|c)=\dfrac{exp(u_o^\top v\_c)}{\sum\limits_{w=1}^V{exp(u_w^\top v_c)}}\]

where \(u_w (w = 1, . . . , V )\) are the “output” word vectors for all words in the vocabulary. Assuming cross entropy cost is applied to this prediction and word o is the expected word (the o-th element of the one-hot label vector is one), derive the gradients with respect to \(v_c\). Hint: It will be helpful to use notation from question 2. For instance, letting \(\hat{y}\) be the vector of softmax predictions for every word, \(y\) as the expected word vector, and the loss function

\[J_{softmax−CE}(o, v_c, U) = CE(y,\hat{y})\]

where \(U = [u_1,u_2, · · · ,u_V ]\) is the matrix of all the output vectors. Make sure you state the orientation of your vectors and matrices.

solution

求 \(\dfrac{\partial J}{\partial v_c}\)

\(J=-\sum_i y_i log\hat{y_i}=-log \dfrac{exp(u_o^\top v\_c)}{\sum\limits_{w=1}^V exp(u_w^\top v_c)}=-log(exp(u_o^\top v\_c))+log(\sum\limits_{w=1}^V exp(u_w^\top v_c))\)

\(\dfrac{\partial J}{\partial v_c}=\dfrac{\partial -log(exp(u_o^\top v_c))}{\partial v\_c}+\dfrac{\partial log(\sum_{w=1}^V exp(u_w^\top v_c))}{\partial v_c}\)

\(\dfrac{-1}{exp(u_o^\top v_c)} \times exp(u_o^\top v_c) \times u_o=-u_o\)
\(\dfrac{1}{\sum\limits_{w=1}^V exp(u_w^\top v\_c)} \times \sum\limits_{x=1}^V exp(u_x^\top v_c) \times u\_x=\sum\limits\_{x=1}^V \dfrac{exp(u_x^\top v\_c)}{\sum\limits_{w=1}^V exp(u_w^\top v_c)}\times u\_x=\sum\limits\_{x=1}^V \hat{y_x} u_x\)

\(\dfrac{\partial J}{\partial v_c}=-u\_o+\sum\limits_{x=1}^V \hat{y_x} u_x=U(\hat{y}-y)\)

(b) 計算 Loss Function 對 \(u_k\) 的 Gradient

As in the previous part, derive gradients for the “output” word vectors \(u_k\)’s (including \(u_o\)).

solution

\(\dfrac{\partial J}{\partial u_w}=\dfrac{\partial -log(exp(u_o^\top v_c))}{\partial u\_w}+\dfrac{\partial log(\sum_{w=1}^V exp(u_w^\top v_c))}{\partial u_w}\)

\(1. w=o\)

\(=-v\_c+\dfrac{1}{\sum_{x=1}^V exp(u_x^\top v\_c)} \times \sum\limits_{w=1}^{V}exp(u_w^\top v_c) \times v_c=v\_c \times \sum\limits_{w=1}^V \dfrac{exp(v_w^\top v\_c)}{\sum\limits_{x=1}^V exp(u_x^\top v\_c)}-v\_c=\sum\limits\_{w=1}^V \hat{y_w}v_c-v_c=v_c(\hat{y}-y)^\top\)

\(2. w\neq o\)

\(=0+\dfrac{1}{\sum_{x=1}^V exp(u_x^\top v\_c)} \times \sum\limits_{w=1}^{V}exp(u_w^\top v_c) \times v_c=v\_c \times \sum\limits_{w=1}^V \dfrac{exp(v_w^\top v\_c)}{\sum\limits_{x=1}^V exp(u_x^\top v\_c)}=\sum\limits\_{w=1}^V \hat{y_w}v_c\)

(c) Negative Sampling

Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector \(v_c\), and the expected output word index is \(o\). Assume that \(K\) negative samples (words) are drawn, and they are \(1, 2, ..., K\) respectively for simplicity of notation \((o \notin {1, . . . , K})\). Again, for a given word, \(o\), denote its output vector as \(u_o\). The negative sampling loss function in this case is

\[J_{neg-sample}(o,V_c,U)=-log(\sigma(u_o^\top v\_c))-\sum\limits_{k=1}^K log(\sigma(-u_k^\top v_c))\]

where \(σ(·)\) is the sigmoid function. After you’ve done this, describe with one sentence why this cost function is much more efficient to compute than the softmax-CE loss (you could provide a speed-up ratio, i.e., the runtime of the softmaxCE loss divided by the runtime of the negative sampling loss).

Note: the cost function here is the negative of what Mikolov et al had in their original paper, because we are doing a minimization instead of maximization in our code.

solution

\(1. \mathbf{\dfrac{\partial J}{\partial v_c}}=-dfrac{1}{\sigma(u_o^\top v_c)}\times \sigma(u_o^\top v_c) \times (1-\sigma(u_o^\top v_c)) \times v\_o - \sum\limits\_{k=1}^K \dfrac{1}{\sigma(-u_k^\top v_c)} \times \sigma(-u_k^\top v_c) \times (1-\sigma(-u_k^\top v_c)) \times (-u_k)\)

\(=-u_o(1-\sigma(u_o^\top v\_c))+\sum\limits\_{k=1}^K u_k(1-\sigma(-u_k^\top v_c))\)

\(2. \mathbf{\dfrac{\partial J}{\partial u_o}}=-\dfrac{1}{\sigma(u_o^\top v_c)}\times \sigma(u_o^\top v_c) \times (1-\sigma(u_o^\top v_c)) \times v_c\)

\(=-v_c(1-\sigma(u_o^\top v_c))=(\sigma(u_o^\top v_c)-1)v_c\)

\(3. \mathbf{\dfrac{\partial J}{\partial u_k}}=-\dfrac{1}{\sigma(-u_k^\top v_c)}\times \sigma(-u_k^\top v_c) \times (1-\sigma(-u_k^\top v_c)) \times (-v_c)\)

\(=(1-\sigma(-u_k^\top v_c))v_c,k=1,2,...,K\)

(d) 計算 Skip-gram 及 CBOW 的 Gradient

Suppose the center word is \(c = w\_t\) and the context words are \([w\_{t−m}, . . ., w_{t−1}, w\_t, w\_{t+1},. . ., w_{t+m}]\), where \(m\) is the context size. Derive gradients for all of the word vectors for Skip-Gram and CBOW given the previous parts. Hint: feel free to use \(F(o, v\_c)\) (where o is the expected word) as a placeholder for the \(J_{softmax−CE}(o, v\_c, ...)\) or \(J_{neg−sample}(o, v_c, ...)\) cost functions in this part — you’ll see that this is a useful abstraction for the coding part. That is, your solution may contain terms of the form \(\frac{\partial F(o,v_c)} {\partial ...}\) . Recall that for skip-gram, the cost for a context centered around \(c\) is

\[J\_{skip-gram}(w\_{t−m...t+m}) = \sum\_{−m≤j≤m,j\neq 0}F(w_{t+j} , v_c)\]

where \(w_{t+j}\) refers to the word at the j-th index from the center. CBOW is slightly different. Instead of using \(v_c\) as the predicted vector, we use \(\hat{v}\) defined below. For (a simpler variant of) CBOW, we sum up the input word vectors in the context

\[\hat{v} = \sum\_{−m≤j≤m,j\neq =0}v_{w_t+j}\]

then the CBOW cost is

\[J\_{CBOW}(w_{c−m...c+m}) = F(w_t, \hat{v})\]

Note: To be consistent with the \(\hat{v}\) notation such as for the code portion, for skip-gram \(\hat{v} = v_c\).

solution

Skip-gram

\(\dfrac{\partial J\_{skip-gram}(w\_{t-m...t+m})}{\partial U}=\sum\_{-m≤j≤m,j\neq 0} \dfrac{\partial F(w\_{t+j},v_c)}{\partial U}\)

\(\dfrac{\partial J\_{skip-gram}(w\_{t-m...t+m})}{\partial v\_c}=\sum\_{-m≤j≤m,j\neq 0} \dfrac{\partial F(w\_{t+j},v_c)}{\partial v_c}\)

\(\dfrac{\partial J\_{skip-gram}(w\_{t-m...t+m})}{\partial v_{t+j}}=0,\forall j \neq c\)

CBOW

\(\dfrac{\partial J\_{CBOW}(w\_{t-m...t+m})}{\partial U}=\dfrac{\partial F(w_t,\hat{v})}{\partial U}\)

\(\dfrac{\partial J\_{CBOW}(w\_{t-m...t+m})}{\partial v\_{w\_{t+j}}}=\dfrac{\partial F(w_t,\hat{v})}{\partial \hat{v}},\forall j \in \{-m,...,-1,1,...,m\}\)

\(\dfrac{\partial J\_{CBOW}(w\_{t-m...t+m})}{\partial U}=0,\forall j \notin \{-m,...,-1,1,...,m\}\)

(e) 實作 Skip-gram

In this part you will implement the word2vec models and train your own word vectors with stochastic gradient descent (SGD). First, write a helper function to normalize rows of a matrix in q3_word2vec.py. In the same file, fill in the implementation for the softmax and negative sampling cost and gradient functions. Then, fill in the implementation of the cost and gradient functions for the skipgram model. When you are done, test your implementation by running python q3_word2vec.py.

Note: If you choose not to implement CBOW (part h), simply remove the NotImplementedError so that your tests will complete.

solution

實作q3_word2vec.py中的以下函式：

normalizeRows
softmaxCostAndGradient
negSamplingCostAndGradient
skipgram

`normalizeRows`

對每個列向量進行正規化，公式：\(\hat{v}=\dfrac{v}{|v|}\)。

def normalizeRows(x):
    length = np.sqrt(np.sum(np.power(x, 2), axis=1))
    x /= length[:,None]
    return x

`softmaxCostAndGradient`

predicted: \(v_c\)、target: \(o\)、outputVector: \(U^\top\)、gradz2: \(\hat{y}-y\)、gradPred: \(\dfrac{\partial J}{\partial v_c}\)、grad: \(\dfrac{\partial J}{\partial u_k}\)

利用 (a)(b) 小題的結果：

\(\dfrac{\partial J}{\partial v_c}=U(\hat{y}-y)\)

\(\dfrac{\partial J}{\partial u_k}=v_c(\hat{y}-y)^\top\)

def softmaxCostAndGradient(predicted, target, outputVectors, dataset):

    ### YOUR CODE HERE
    y_hat = softmax(np.dot(outputVectors, predicted))
    cost = -np.log(y_hat[target])
    gradz2 = y_hat.copy()
    gradz2[target] -= 1.0
    gradPred = np.dot(outputVectors.T, gradz2)
    grad = np.outer(gradz2, predicted)
    ### END YOUR CODE

    return cost, gradPred, grad

`negSamplingCostAndGradient`

利用 (c) 小題的結果：

gradPred: \(\mathbf{\dfrac{\partial J}{\partial v_c}}=-u_o(1-\sigma(u_o^\top v\_c))+\sum\limits\_{k=1}^K u_k(1-\sigma(-u_k^\top v_c))\)

grad: \(\mathbf{\dfrac{\partial J}{\partial u_o}}=(\sigma(u_o^\top v_c)-1)v_c\)

grad: \(\mathbf{\dfrac{\partial J}{\partial u_k}}=(1-\sigma(-u_k^\top v_c))v_c,k=1,2,...,K\)

cost function 為 \(CE(y,\hat{y})\)

def negSamplingCostAndGradient(predicted, target, outputVectors, dataset,
                               K=10):
    # Sampling of indices is done for you. Do not modify this if you
    # wish to match the autograder and receive points!
    indices = [target]
    indices.extend(getNegativeSamples(target, dataset, K))

    ### YOUR CODE HERE
    grad = np.zeros(outputVectors.shape)
    gradPred = np.zeros(predicted.shape)
    z = sigmoid(np.dot(outputVectors[target].T, predicted))
    cost = -np.log(z)
    grad[target] += (z - 1.0) * predicted
    gradPred += (z - 1.0) * outputVectors[target]
    for k in xrange(1, K + 1):
        index = indices[k]
        z = sigmoid(np.dot(-outputVectors[index].T, predicted))
        cost -= np.log(z)
        grad[index] -= (z - 1.0) * predicted
        gradPred -= (z - 1.0) * outputVectors[index]
    ### END YOUR CODE

    return cost, gradPred, grad

`skipgram`

利用 (d) 小題的結果：

Skip-gram

gradOut:\(\dfrac{\partial J\_{skip-gram}(w\_{t-m...t+m})}{\partial U}=\sum\_{-m≤j≤m,j\neq 0} \dfrac{\partial F(w\_{t+j},v_c)}{\partial U}\)

gradIn: \(\dfrac{\partial J\_{skip-gram}(w\_{t-m...t+m})}{\partial v\_c}=\sum\_{-m≤j≤m,j\neq 0} \dfrac{\partial F(w\_{t+j},v_c)}{\partial v_c}\)

gradIn: \(\dfrac{\partial J\_{skip-gram}(w\_{t-m...t+m})}{\partial v_{t+j}}=0,\forall j \neq c\)

def skipgram(currentWord, C, contextWords, tokens, inputVectors, outputVectors,
             dataset, word2vecCostAndGradient=softmaxCostAndGradient):

    cost = 0.0
    gradIn = np.zeros(inputVectors.shape)
    gradOut = np.zeros(outputVectors.shape)

    ### YOUR CODE HERE
    current_word_index = tokens[currentWord]
    vc = inputVectors[current_word_index]

    for j in contextWords:
        target_index = tokens[j]
        c_cost, c_grad_in, c_grad_out = word2vecCostAndGradient(vc, target_index, outputVectors, dataset)
        cost += c_cost
        gradIn[current_word_index] += c_grad_in
        gradOut += c_grad_out
    ### END YOUR CODE

    return cost, gradIn, gradOut

執行python q3_word2vec.py，產生以下結果代表成功。

Testing normalizeRows...
[[ 0.6         0.8       ]
 [ 0.4472136   0.89442719]]

==== Gradient check for skip-gram ====
Gradient check passed!
Gradient check passed!

(f) Stochastic Gradient Descent

Complete the implementation for your SGD optimizer in q3_sgd.py. Test your implementation by running python q3_sgd.py.

solution

step就是 learning rate，原始程式碼已經給定，直接將其乘上 gradient 即可。

def sgd(f, x0, step, iterations, postprocessing=None, useSaved=False,
        PRINT_EVERY=10):

    # Anneal learning rate every several iterations
    ANNEAL_EVERY = 20000

    if useSaved:
        start_iter, oldx, state = load_saved_params()
        if start_iter > 0:
            x0 = oldx
            step *= 0.5 ** (start_iter / ANNEAL_EVERY)

        if state:
            random.setstate(state)
    else:
        start_iter = 0

    x = x0

    if not postprocessing:
        postprocessing = lambda x: x

    expcost = None

    for iter in xrange(start_iter + 1, iterations + 1):
        # Don't forget to apply the postprocessing after every iteration!
        # You might want to print the progress every few iterations.

        cost = None
        ### YOUR CODE HERE
        cost, grad = f(x)
        x -= step * grad
        postprocessing(x)
        ### END YOUR CODE

        if iter % PRINT_EVERY == 0:
            if not expcost:
                expcost = cost
            else:
                expcost = .95 * expcost + .05 * cost
            print "iter %d: %f" % (iter, expcost)

        if iter % SAVE_PARAMS_EVERY == 0 and useSaved:
            save_params(iter, x)

        if iter % ANNEAL_EVERY == 0:
            step *= 0.5

    return x

(g) 訓練詞向量

Show time! Now we are going to load some real data and train word vectors with everything you just implemented! We are going to use the Stanford Sentiment Treebank (SST) dataset to train word vectors, and later apply them to a simple sentiment analysis task. You will need to fetch the datasets first. To do this, run sh get datasets.sh. There is no additional code to write for this part; just run python q3_run.py.

Note: The training process may take a long time depending on the efficiency of your implementation (an efficient implementation takes approximately an hour). Plan accordingly!

When the script finishes, a visualization for your word vectors will appear. It will also be saved as q3_word vectors.png in your project directory. Include the plot in your homework write up. Briefly explain in at most three sentences what you see in the plot.

solution

執行sh get datasets.sh下載 Stanford Sentiment Treebank (SST) 資料集。

接著再執行python q3_run.py開始訓練，預設iteration=40000，每 10 次 iteration 會印出一次當前的 cost，沒意外的話 cost 應該要隨著訓練的過程遞減。

iter 39950: 9.435311
iter 39960: 9.492463
iter 39970: 9.520291
iter 39980: 9.524589
iter 39990: 9.550077
iter 40000: 9.577164
sanity check: cost at convergence should be around or below 10
training took 4038 seconds

訓練時間約為 67 分鐘，訓練使用的處理器為 i7-8750H，可以參考一下。

訓練完成後會產生一張圖q3_word_vectors.png：

q3_word_vectors.png

在q3_run.py中可以發現在原始程式碼中，挑選了一些詞來進行視覺化，透過 SVD 降維後使得能夠在二維平面上觀察這些詞的相對距離。

(h) 實作 CBOW

Implement the CBOW model in q3_word2vec.py. Note: This part is optional but the gradient derivations for CBOW in part (d) are not!.

solution

同樣利用 (d) 小題的結果：

gradOut: \(\dfrac{\partial J\_{CBOW}(w\_{t-m...t+m})}{\partial U}=\dfrac{\partial F(w_t,\hat{v})}{\partial U}\)

gradIn: \(\dfrac{\partial J\_{CBOW}(w\_{t-m...t+m})}{\partial v\_{w\_{t+j}}}=\dfrac{\partial F(w_t,\hat{v})}{\partial \hat{v}},\forall j \in \{-m,...,-1,1,...,m\}\)

gradIn: \(\dfrac{\partial J\_{CBOW}(w\_{t-m...t+m})}{\partial U}=0,\forall j \notin \{-m,...,-1,1,...,m\}\)

def cbow(currentWord, C, contextWords, tokens, inputVectors, outputVectors,
         dataset, word2vecCostAndGradient=softmaxCostAndGradient):

    cost = 0.0
    gradIn = np.zeros(inputVectors.shape)
    gradOut = np.zeros(outputVectors.shape)

    ### YOUR CODE HERE
    indices = [tokens[contextWord] for contextWord in contextWords]
    v_hat = np.sum(inputVectors[indices], axis=0)
    target_index = tokens[currentWord]
    cost, grad_in, gradOut = word2vecCostAndGradient(v_hat, target_index, outputVectors, dataset)
    for j in indices:
        gradIn[j] += grad_in
    ### END YOUR CODE

    return cost, gradIn, gradOut

執行python q3_word2vec.py，在完成 (e)(h) 小題後的結果應如下：

Testing normalizeRows...
[[ 0.6         0.8       ]
 [ 0.4472136   0.89442719]]

==== Gradient check for skip-gram ====
Gradient check passed!
Gradient check passed!

==== Gradient check for CBOW      ====
Gradient check passed!
Gradient check passed!

4. Sentiment Analysis

Now, with the word vectors you trained, we are going to perform a simple sentiment analysis. For each sentence in the Stanford Sentiment Treebank dataset, we are going to use the average of all the word vectors in that sentence as its feature, and try to predict the sentiment level of the said sentence. The sentiment level of the phrases are represented as real values in the original dataset, here we’ll just use five classes:

“very negative (−−)”, “negative (−)”, “neutral”, “positive (+)”, “very positive (++)”

which are represented by 0 to 4 in the code, respectively. For this part, you will learn to train a softmax classifier, and perform train/dev validation to improve generalization.

(a) 以特徵向量來表示一個句子

Implement a sentence featurizer. A simple way of representing a sentence is taking the average of the vectors of the words in the sentence. Fill in the implementation in q4_sentiment.py.

solution

將句子中所有詞的詞向量進行平均，將得到的向量用來表示整個句子。

Inputs:

tokens: 一個 dictionary，key 為詞，value 為該詞的索引

wordVectors: 詞向量

sentence: 一個句子

Output:

sentVector: 句子的特徵向量

def getSentenceFeatures(tokens, wordVectors, sentence):

    sentVector = np.zeros((wordVectors.shape[1],))

    ### YOUR CODE HERE
    for word in sentence:
        index = tokens[word]
        sentVector += wordVectors[index]
    sentVector /= float(len(sentence))
    ### END YOUR CODE

    assert sentVector.shape == (wordVectors.shape[1],)
    return sentVector

(b) 正規化 (Regularization)

Explain in at most two sentences why we want to introduce regularization when doing classification (in fact, most machine learning tasks).

solution

正規化可以避免 overfitting，使得模型更加 generalization。

(c) 尋找超參數 (hyperparameter)

Fill in the hyperparameter selection code in q4_sentiment.py to search for the “optimal” regularization parameter. You need to implement both getRegularizationValues and chooseBestModel. Attach your code for chooseBestModel to your written write-up. You should be able to attain at least 36.5% accuracy on the dev and test sets using the pretrained vectors in part (d).

solution

參考其他人的做法，使用 log function 產生遞增的值。

def getRegularizationValues():
    values = None   # Assign a list of floats in the block below
    ### YOUR CODE HERE
    values = np.logspace(-5, 2, num=100, base=10)
    ### END YOUR CODE
    return sorted(values)

def chooseBestModel(results):
    bestResult = None

    ### YOUR CODE HERE
    bestResult = max(results, key=lambda x: x["dev"])
    ### END YOUR CODE

    return bestResult

(d) 模型比較

Run python q4_sentiment.py --yourvectors to train a model using your word vectors from q3. Now, run python q4_sentiment.py --pretrained to train a model using pretrained GloVe vectors (on Wikipedia data). Compare and report the best train, dev, and test accuracies. Why do you think the pretrained vectors did better? Be specific and justify with 3 distinct reasons.

yourvectors

Reg	Train	Dev	Test
1.00E-05	31.004	32.516	30.452
1.15E-04	31.063	32.516	30.362
1.12E-03	31.133	32.516	30.362
1.10E-02	30.922	32.334	29.955
1.07E-01	30.290	31.789	29.864
1.05E+00	28.816	29.609	27.059

1 2	Best regularization value: 2.26E-05 Test accuracy (%): 30.316742

pretrained GloVe vectors (on Wikipedia data)

Reg	Train	Dev	Test
1.00E-05	39.923	36.421	37.059
1.15E-04	39.958	36.512	37.014
1.12E-03	39.899	36.331	37.014
1.10E-02	39.923	36.331	37.195
1.07E-01	39.782	36.240	37.149
1.05E+00	39.478	36.512	37.330

1 2	Best regularization value: 1.20E+01 Test accuracy (%): 37.556561

觀察兩者的表現可以發現使用 pretrained 的模型效果明顯比較好，可能原因如下：

更高維度的詞向量可能包含更多的資訊
GloVe 使用更大的語料庫進行訓練
word2vec 與 GloVe 的差異 (參考資料)

(e) Regularization Term 的變化在 train set 及 dev set 的表現差異

Plot the classification accuracy on the train and dev set with respect to the regularization value for the pretrained GloVe vectors, using a logarithmic scale on the x-axis. This should have been done automatically. Include q4_reg acc.png in your homework write up. Briefly explain in at most three sentences what you see in the plot.

solution

(f) Confusion Matrix

We will now analyze errors that the model makes (with pretrained GloVe vectors). When you ran python q4_sentiment.py --pretrained, two files should have been generated. Take a look at q4_dev_conf.png and include it in your homework writeup. Interpret the confusion matrix in at most three sentences.

solution

由預測結果與標準答案的數量比較所產生的 confusion matrix，由左上到右下的對角線所經過的格子為 True Positive 的數量。

(g) 為何分類錯誤？

Next, take a look at q4_dev_pred.txt. Choose 3 examples where your classifier made errors and briefly explain the error and what features the classifier would need to classify the example correctly (1 sentence per example). Try to pick examples with different reasons.

solution

將每筆分類結果與標準答案的差距取絕對值，統計如下：

差距	0	1	2	3	4
數量	409	444	188	59	1

可以看到其實大部分的誤差都在 1 分以內，若將標準降低為差距 1 分或以下就算分類正確，準確率會從原來的 37% 大幅提高到 77%，因此整體的預測其實還算是準確的。

Answer	Predicted	Sentence	Possible Issue
0	4	a lackluster , unessential sequel to the classic disney adaptation of j.m. barrie 's peter pan .	唯一一筆預測與實際分數差距4分的資料，看起來 lackluster, unessential 都是負面的詞彙，classic 則是偏向正面的詞彙，而 j.m. barrie 's peter pan 則比較算是雜訊，其餘則是稍微中性的詞語，若將專有名詞去掉或許能提升預測的分數
4	1	the draw -lrb- for `` big bad love '' -rrb- is a solid performance by arliss howard .	太多無意義的符號干擾
4	1	it is amusing , and that 's all it needs to be .	amusing 應該比較偏向正面的詞彙，但是其餘的詞感覺大部分都是冗詞。