CS224n assignment 1
這是 Stanford University 的一門開放式課程,叫做 Natural Language Processing with Deep Learning,所指派的第一個作業。這個作業透過以數學推導類神經網路的運算過程,以及實作wor2vec
來進行情感分析,為初學者學習自然語言處理打下基礎。
這個作業主要分為 4 個部分:
- Softmax
- Neural Network Bascis
- word2vec
- Sentiment Analysis
1. Softmax
(a) softmax 的性質
Prove that softmax is invariant to constant offsets in the input, that is, for any input vector \(x\) and any constant \(c\), \[softmax(x) = softmax(x + c)\] where \(x + c\) means adding the constant \(c\) to every dimension of \(x\). Remember that \[softmax(x)_i=\dfrac{e^{x_i}}{\sum_j{e^{x_j}}}\] Note: In practice, we make use of this property and choose \(c = − max_i x_i\) when computing softmax probabilities for numerical stability (i.e., subtracting its maximum element from all elements of \(x\)).
solution
證明若將 softmax 的輸入 x 的每一項都加上一個常數後,結果會與原本相同。
\(softmax(x+c)_i=\dfrac{e^{x_i+c}}{\sum_j{e^{x_j+c}}}=\dfrac{e^{x_i}e^c}{\sum_j{e^{x_j}e^c}}=\dfrac{e^ce^{x_i}}{e^c\sum_j{e^{x_j}}}=\dfrac{e^{x_i}}{\sum_j{e^{x_j}}}=softmax(x)\)
這個特性在實際計算 softmax 時常被使用,將輸入 x 的每一項都減去 x 中的最大值,可以減少計算量。
(b) 實作 softmax function
Given an input matrix of N rows and D columns, compute the softmax prediction for each row using the optimization in part (a). Write your implementation in q1_softmax.py
. You may test by executing python q1_softmax.py
. Note: The provided tests are not exhaustive. Later parts of the assignment will reference this code so it is important to have a correct implementation. Your implementation should also be efficient and vectorized whenever possible (i.e., use numpy matrix operations rather than for loops). A non-vectorized implementation will not receive full credit!
solution
輸入:x
-- 一個維度為 N x D 的 numpy
矩陣
輸出:x
-- softmax(x),可以 in-place 修改 x
的值
注意在x
的維度為 1 x D 與 N x D (N ≥ 2)時的處理方式不同。
1 | def softmax(x): |
2. Neural Network Basics
(a) 推導 sigmoid function 的 gradient
Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e., in some expression where only \(\sigma(x)\), but not \(x\), is present). Assume that the input \(x\) is a scalar for this question. Recall, the sigmoid function is \[\sigma(x)=\dfrac{1}{1+e^{-x}}\]
solution
\(\dfrac{\partial\sigma(x)}{\partial x}=\dfrac{-1}{(1+e^{-x})^2}\times e^{-x}\times (-1)=\dfrac{e^{-x}}{(1+e^{-x})^2}=\dfrac{1}{1+e^{-x}}\times{\dfrac{1+e^{-x}-1}{1+e^{-x}}}=\sigma(x)(1-\sigma(x))\)
(b) 推導 cross entropy loss 的 gradient
Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for evaluation, i.e., find the gradients with respect to the softmax input vector \(\theta\), when the prediction is made by \(\hat y = softmax(\theta)\). Remember the cross entropy function is \[CE(y,\hat{y})=-\sum_i y_i log(\hat{y_i})\] where \(y\) is the one-hot label vector, and \(\hat{y}\) is the predicted probability vector for all classes. (Hint: you might want to consider the fact many elements of \(y\) are zeros, and assume that only the k-th dimension of \(y\) is one.)
solution
計算 \(\dfrac{\partial CE(y,\hat{y})}{\partial \theta}\)
分成兩種情況:
1. m = n
\(\dfrac{\partial CE(y_m,\hat{y_m})}{\partial \theta_n}=\dfrac{\partial -y_nlog\hat{y_n}}{\partial \theta_n}=-y_n \times \dfrac{1}{\hat{y_n}}\times \dfrac{\partial \dfrac{e^{\theta_n}}{\sum_j{e^{\theta j}}}}{\partial \theta_n}=-y_n \times \dfrac{1}{\hat{y_n}}\times\dfrac{e^{\theta n}\sum_j{e^{\theta_j}-e^{\theta n}e^{\theta n}}}{(\sum_j{e^{\theta_j}})^2}\) \(=-y_n \times \dfrac{1}{\hat{y_n}} \times \dfrac{e^{\theta n}}{\sum_j{e^{\theta_j}}}\times\dfrac{\sum_j{e^{\theta j}-e^{\theta n}}}{\sum_j{e^{\theta_j}}}=-y_n \times \dfrac{1}{\hat{y_n}} \times \hat{y_n} \times (1-\hat{y_n}) =-y_n(1-\hat{y_n})\)
2. m ≠ n
\(\dfrac{\partial CE(y_m,\hat{y_m})}{\partial\theta\_n}=\dfrac{- \sum\_{m \neq n}{y_m log\hat{y_m}}}{\partial\theta\_n}=-\sum_{m \neq n} y_m \times \dfrac{1}{\hat{y_m}} \times \dfrac{\partial \dfrac{e^{\theta_m}}{\sum_j{e^{\theta_j}}}}{\partial \theta\_n}=-\sum_{m \neq n} y_m \times \dfrac{1}{\hat{y_m}} \times \dfrac{0-e^{\theta_n}e^{\theta_m}}{(\sum_j{e^{\theta\_j}})^2}\)
\(=-\sum_{m \neq n} y_m \times \dfrac{1}{\hat{y_m}} \times \dfrac{-e^{\theta_n}}{\sum_j{e^{\theta_j}}} \times \dfrac{e^{\theta_m}}{\sum_j{e^{\theta\_j}}}=-\sum_{m \neq n} y_m \times \dfrac{1}{\hat{y_m}} \times (-\hat{y_n}) \times \hat{y\_m}=\sum_{m \neq n}y_m \hat{y_n}\)
因此 \(\dfrac{\partial CE}{\partial \theta_i}=-y_i(1-\hat{y\_i}) + \sum_{k \neq i}y_k \hat{y_i}=-y_i+\hat{y\_i}^2 + \sum_{k \neq i}y_k \hat{y_i}=\sum_k{y_k \hat{y_i}}-y_i=\hat{y_i}\sum_k{y_k}-y_i=\hat{y_i}-y_i\)
(c) Backpropagation
Derive the gradients with respect to the inputs \(x\) to an one-hidden-layer neural network (that is, find \(\frac{\partial J}{\partial x}\) where \(J = CE(y, \hat{y})\) is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is \(y\), and cross entropy cost is used. (Feel free to use \(\sigma'(x)\) as the shorthand for sigmoid gradient, and feel free to define any variables whenever you see fit.)
Recall that the forward propagation is as follows \[h = sigmoid(xW_1 + b_1), \hat{y} = softmax(hW_2 + b_2)\]
Note that here we’re assuming that the input vector (thus the hidden variables and output probabilities) is a row vector to be consistent with the programming assignment. When we apply the sigmoid function to a vector, we are applying it to each of the elements of that vector. \(W_i\) and \(b_i (i = 1, 2)\) are the weights and biases, respectively, of the two layers.
solution
令 \(z_1=xW_1+b_1,z_2=hW_2+b_2\)
則 \(\dfrac{\partial J}{\partial x}=\dfrac{\partial J}{\partial z2}\dfrac{\partial z2}{\partial h}\dfrac{\partial h}{\partial z1}\dfrac{\partial z1}{\partial x}=(\hat{y}-y)\times W_2^\top\times h(1-h)\times W_1^\top\)
(d) 類神經網路的參數數量
How many parameters are there in this neural network, assuming the input is \(Dx\)-dimensional, the output is \(Dy\)-dimensional, and there are \(H\) hidden units?
solution
根據上一題的圖,此神經網路有三個 layer:\(x,h,\hat{y}\),其中的參數有 \(W_1,b_1,W_2,b_2\) 共四個:
\(W_1: D_x \times H\)
\(b_1: H\)
\(W_2: H \times D_y\)
\(b_2: D_y\)
總共有 \((D_x \times H)+H+(H \times D_y)+D_y=(D_x+1)H+(H+1)D_y\) 個參數。
(e) 實作 sigmoid function
Fill in the implementation for the sigmoid activation function and its gradient in q2_sigmoid.py
. Test your implementation using python q2_sigmoid.py
. Again, thoroughly test your code as the provided tests may not be exhaustive.
solution
實作出sigmoid
以及其梯度sigmoid_grad
兩個函式,直接利用(a)小題的結果可以輕鬆實現。
1 | def sigmoid(x): |
1 | def sigmoid_grad(s): |
(f) Gradient Check
To make debugging easier, we will now implement a gradient checker. Fill in the implementation for gradcheck naive in q2_gradcheck.py
. Test your code using python q2_gradcheck.py
.
solution
註解中指定使用 centered difference 方法來實作 gradient check,因為他比 forward / backward difference 方法誤差更小。
Forward difference approximation: \[f'(x)\approx \dfrac{f(x+h)−f(x)}{h}\] Central difference approximations \[f'(x)\approx \dfrac{f(x+h)-f(x-h)}{2h}\] Backward difference approximations: \[f'(x)\approx \dfrac{f(x)−f(x−h)}{h}\]
1 | def gradcheck_naive(f, x): |
(g) 實作類神經網路的 Forward 及 Backward Pass
Now, implement the forward and backward passes for a neural network with one sigmoid hidden layer. Fill in your implementation in q2_neural.py
. Sanity check your implementation with python q2_neural.py
.
solution
1 | def forward_backward_prop(X, labels, params, dimensions): |
3. word2vec
(a) 計算 Loss Function 對 \(v_c\) 的 gradient
Assume you are given a “predicted” word vector \(v_c\) corresponding to the center word \(c\) for Skip-Gram, and word prediction is made with the softmax function found in word2vec
models
\[\hat{y_o}=p(o|c)=\dfrac{exp(u_o^\top v\_c)}{\sum\limits_{w=1}^V{exp(u_w^\top v_c)}}\]
where \(u_w (w = 1, . . . , V )\) are the “output” word vectors for all words in the vocabulary. Assuming cross entropy cost is applied to this prediction and word o is the expected word (the o-th element of the one-hot label vector is one), derive the gradients with respect to \(v_c\). Hint: It will be helpful to use notation from question 2. For instance, letting \(\hat{y}\) be the vector of softmax predictions for every word, \(y\) as the expected word vector, and the loss function
\[J_{softmax−CE}(o, v_c, U) = CE(y,\hat{y})\]
where \(U = [u_1,u_2, · · · ,u_V ]\) is the matrix of all the output vectors. Make sure you state the orientation of your vectors and matrices.
solution
求 \(\dfrac{\partial J}{\partial v_c}\)
\(J=-\sum_i y_i log\hat{y_i}=-log \dfrac{exp(u_o^\top v\_c)}{\sum\limits_{w=1}^V exp(u_w^\top v_c)}=-log(exp(u_o^\top v\_c))+log(\sum\limits_{w=1}^V exp(u_w^\top v_c))\)
\(\dfrac{\partial J}{\partial v_c}=\dfrac{\partial -log(exp(u_o^\top v_c))}{\partial v\_c}+\dfrac{\partial log(\sum_{w=1}^V exp(u_w^\top v_c))}{\partial v_c}\)
\(\dfrac{-1}{exp(u_o^\top v_c)} \times exp(u_o^\top v_c) \times u_o=-u_o\)
\(\dfrac{1}{\sum\limits_{w=1}^V exp(u_w^\top v\_c)} \times \sum\limits_{x=1}^V exp(u_x^\top v_c) \times u\_x=\sum\limits\_{x=1}^V \dfrac{exp(u_x^\top v\_c)}{\sum\limits_{w=1}^V exp(u_w^\top v_c)}\times u\_x=\sum\limits\_{x=1}^V \hat{y_x} u_x\)
\(\dfrac{\partial J}{\partial v_c}=-u\_o+\sum\limits_{x=1}^V \hat{y_x} u_x=U(\hat{y}-y)\)
(b) 計算 Loss Function 對 \(u_k\) 的 Gradient
As in the previous part, derive gradients for the “output” word vectors \(u_k\)’s (including \(u_o\)).
solution
\(\dfrac{\partial J}{\partial u_w}=\dfrac{\partial -log(exp(u_o^\top v_c))}{\partial u\_w}+\dfrac{\partial log(\sum_{w=1}^V exp(u_w^\top v_c))}{\partial u_w}\)
\(1. w=o\)
\(=-v\_c+\dfrac{1}{\sum_{x=1}^V exp(u_x^\top v\_c)} \times \sum\limits_{w=1}^{V}exp(u_w^\top v_c) \times v_c=v\_c \times \sum\limits_{w=1}^V \dfrac{exp(v_w^\top v\_c)}{\sum\limits_{x=1}^V exp(u_x^\top v\_c)}-v\_c=\sum\limits\_{w=1}^V \hat{y_w}v_c-v_c=v_c(\hat{y}-y)^\top\)
\(2. w\neq o\)
\(=0+\dfrac{1}{\sum_{x=1}^V exp(u_x^\top v\_c)} \times \sum\limits_{w=1}^{V}exp(u_w^\top v_c) \times v_c=v\_c \times \sum\limits_{w=1}^V \dfrac{exp(v_w^\top v\_c)}{\sum\limits_{x=1}^V exp(u_x^\top v\_c)}=\sum\limits\_{w=1}^V \hat{y_w}v_c\)
(c) Negative Sampling
Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector \(v_c\), and the expected output word index is \(o\). Assume that \(K\) negative samples (words) are drawn, and they are \(1, 2, ..., K\) respectively for simplicity of notation \((o \notin {1, . . . , K})\). Again, for a given word, \(o\), denote its output vector as \(u_o\). The negative sampling loss function in this case is
\[J_{neg-sample}(o,V_c,U)=-log(\sigma(u_o^\top v\_c))-\sum\limits_{k=1}^K log(\sigma(-u_k^\top v_c))\]
where \(σ(·)\) is the sigmoid function. After you’ve done this, describe with one sentence why this cost function is much more efficient to compute than the softmax-CE loss (you could provide a speed-up ratio, i.e., the runtime of the softmaxCE loss divided by the runtime of the negative sampling loss).
Note: the cost function here is the negative of what Mikolov et al had in their original paper, because we are doing a minimization instead of maximization in our code.
solution
\(1. \mathbf{\dfrac{\partial J}{\partial v_c}}=-dfrac{1}{\sigma(u_o^\top v_c)}\times \sigma(u_o^\top v_c) \times (1-\sigma(u_o^\top v_c)) \times v\_o - \sum\limits\_{k=1}^K \dfrac{1}{\sigma(-u_k^\top v_c)} \times \sigma(-u_k^\top v_c) \times (1-\sigma(-u_k^\top v_c)) \times (-u_k)\)
\(=-u_o(1-\sigma(u_o^\top v\_c))+\sum\limits\_{k=1}^K u_k(1-\sigma(-u_k^\top v_c))\)
\(2. \mathbf{\dfrac{\partial J}{\partial u_o}}=-\dfrac{1}{\sigma(u_o^\top v_c)}\times \sigma(u_o^\top v_c) \times (1-\sigma(u_o^\top v_c)) \times v_c\)
\(=-v_c(1-\sigma(u_o^\top v_c))=(\sigma(u_o^\top v_c)-1)v_c\)
\(3. \mathbf{\dfrac{\partial J}{\partial u_k}}=-\dfrac{1}{\sigma(-u_k^\top v_c)}\times \sigma(-u_k^\top v_c) \times (1-\sigma(-u_k^\top v_c)) \times (-v_c)\)
\(=(1-\sigma(-u_k^\top v_c))v_c,k=1,2,...,K\)
(d) 計算 Skip-gram 及 CBOW 的 Gradient
Suppose the center word is \(c = w\_t\) and the context words are \([w\_{t−m}, . . ., w_{t−1}, w\_t, w\_{t+1},. . ., w_{t+m}]\), where \(m\) is the context size. Derive gradients for all of the word vectors for Skip-Gram and CBOW given the previous parts. Hint: feel free to use \(F(o, v\_c)\) (where o is the expected word) as a placeholder for the \(J_{softmax−CE}(o, v\_c, ...)\) or \(J_{neg−sample}(o, v_c, ...)\) cost functions in this part — you’ll see that this is a useful abstraction for the coding part. That is, your solution may contain terms of the form \(\frac{\partial F(o,v_c)} {\partial ...}\) . Recall that for skip-gram, the cost for a context centered around \(c\) is
\[J\_{skip-gram}(w\_{t−m...t+m}) = \sum\_{−m≤j≤m,j\neq 0}F(w_{t+j} , v_c)\]
where \(w_{t+j}\) refers to the word at the j-th index from the center. CBOW is slightly different. Instead of using \(v_c\) as the predicted vector, we use \(\hat{v}\) defined below. For (a simpler variant of) CBOW, we sum up the input word vectors in the context
\[\hat{v} = \sum\_{−m≤j≤m,j\neq =0}v_{w_t+j}\]
then the CBOW cost is
\[J\_{CBOW}(w_{c−m...c+m}) = F(w_t, \hat{v})\]
Note: To be consistent with the \(\hat{v}\) notation such as for the code portion, for skip-gram \(\hat{v} = v_c\).
solution
- Skip-gram
\(\dfrac{\partial J\_{skip-gram}(w\_{t-m...t+m})}{\partial U}=\sum\_{-m≤j≤m,j\neq 0} \dfrac{\partial F(w\_{t+j},v_c)}{\partial U}\)
\(\dfrac{\partial J\_{skip-gram}(w\_{t-m...t+m})}{\partial v\_c}=\sum\_{-m≤j≤m,j\neq 0} \dfrac{\partial F(w\_{t+j},v_c)}{\partial v_c}\)
\(\dfrac{\partial J\_{skip-gram}(w\_{t-m...t+m})}{\partial v_{t+j}}=0,\forall j \neq c\)
- CBOW
\(\dfrac{\partial J\_{CBOW}(w\_{t-m...t+m})}{\partial U}=\dfrac{\partial F(w_t,\hat{v})}{\partial U}\)
\(\dfrac{\partial J\_{CBOW}(w\_{t-m...t+m})}{\partial v\_{w\_{t+j}}}=\dfrac{\partial F(w_t,\hat{v})}{\partial \hat{v}},\forall j \in \{-m,...,-1,1,...,m\}\)
\(\dfrac{\partial J\_{CBOW}(w\_{t-m...t+m})}{\partial U}=0,\forall j \notin \{-m,...,-1,1,...,m\}\)
(e) 實作 Skip-gram
In this part you will implement the word2vec models and train your own word vectors with stochastic gradient descent (SGD). First, write a helper function to normalize rows of a matrix in q3_word2vec.py
. In the same file, fill in the implementation for the softmax and negative sampling cost and gradient functions. Then, fill in the implementation of the cost and gradient functions for the skipgram model. When you are done, test your implementation by running python q3_word2vec.py
.
Note: If you choose not to implement CBOW (part h), simply remove the NotImplementedError so that your tests will complete.
solution
實作q3_word2vec.py
中的以下函式:
normalizeRows
softmaxCostAndGradient
negSamplingCostAndGradient
skipgram
normalizeRows
對每個列向量進行正規化,公式:\(\hat{v}=\dfrac{v}{|v|}\)。
1 | def normalizeRows(x): |
softmaxCostAndGradient
predicted
: \(v_c\)、target
: \(o\)、outputVector
: \(U^\top\)、gradz2
: \(\hat{y}-y\)、gradPred
: \(\dfrac{\partial J}{\partial v_c}\)、grad
: \(\dfrac{\partial J}{\partial u_k}\)
利用 (a)(b) 小題的結果:
\(\dfrac{\partial J}{\partial v_c}=U(\hat{y}-y)\)
\(\dfrac{\partial J}{\partial u_k}=v_c(\hat{y}-y)^\top\)
1 | def softmaxCostAndGradient(predicted, target, outputVectors, dataset): |
negSamplingCostAndGradient
利用 (c) 小題的結果:
gradPred
: \(\mathbf{\dfrac{\partial J}{\partial v_c}}=-u_o(1-\sigma(u_o^\top v\_c))+\sum\limits\_{k=1}^K u_k(1-\sigma(-u_k^\top v_c))\)
grad
: \(\mathbf{\dfrac{\partial J}{\partial u_o}}=(\sigma(u_o^\top v_c)-1)v_c\)
grad
: \(\mathbf{\dfrac{\partial J}{\partial u_k}}=(1-\sigma(-u_k^\top v_c))v_c,k=1,2,...,K\)
cost function 為 \(CE(y,\hat{y})\)
1 | def negSamplingCostAndGradient(predicted, target, outputVectors, dataset, |
skipgram
利用 (d) 小題的結果:
- Skip-gram
gradOut
:\(\dfrac{\partial J\_{skip-gram}(w\_{t-m...t+m})}{\partial U}=\sum\_{-m≤j≤m,j\neq 0} \dfrac{\partial F(w\_{t+j},v_c)}{\partial U}\)
gradIn
: \(\dfrac{\partial J\_{skip-gram}(w\_{t-m...t+m})}{\partial v\_c}=\sum\_{-m≤j≤m,j\neq 0} \dfrac{\partial F(w\_{t+j},v_c)}{\partial v_c}\)
gradIn
: \(\dfrac{\partial J\_{skip-gram}(w\_{t-m...t+m})}{\partial v_{t+j}}=0,\forall j \neq c\)
1 | def skipgram(currentWord, C, contextWords, tokens, inputVectors, outputVectors, |
執行python q3_word2vec.py
,產生以下結果代表成功。
1 | Testing normalizeRows... |
(f) Stochastic Gradient Descent
Complete the implementation for your SGD optimizer in q3_sgd.py
. Test your implementation by running python q3_sgd.py
.
solution
step
就是 learning rate,原始程式碼已經給定,直接將其乘上 gradient 即可。
1 | def sgd(f, x0, step, iterations, postprocessing=None, useSaved=False, |
(g) 訓練詞向量
Show time! Now we are going to load some real data and train word vectors with everything you just implemented! We are going to use the Stanford Sentiment Treebank (SST) dataset to train word vectors, and later apply them to a simple sentiment analysis task. You will need to fetch the datasets first. To do this, run sh get datasets.sh
. There is no additional code to write for this part; just run python q3_run.py
.
Note: The training process may take a long time depending on the efficiency of your implementation (an efficient implementation takes approximately an hour). Plan accordingly!
When the script finishes, a visualization for your word vectors will appear. It will also be saved as q3_word vectors.png
in your project directory. Include the plot in your homework write up. Briefly explain in at most three sentences what you see in the plot.
solution
執行sh get datasets.sh
下載 Stanford Sentiment Treebank (SST) 資料集。
接著再執行python q3_run.py
開始訓練,預設iteration=40000
,每 10 次 iteration 會印出一次當前的 cost,沒意外的話 cost 應該要隨著訓練的過程遞減。
1 | iter 39950: 9.435311 |
訓練時間約為 67 分鐘,訓練使用的處理器為 i7-8750H,可以參考一下。
訓練完成後會產生一張圖q3_word_vectors.png
:
在q3_run.py
中可以發現在原始程式碼中,挑選了一些詞來進行視覺化,透過 SVD 降維後使得能夠在二維平面上觀察這些詞的相對距離。
(h) 實作 CBOW
Implement the CBOW model in q3_word2vec.py
. Note: This part is optional but the gradient derivations for CBOW in part (d) are not!.
solution
同樣利用 (d) 小題的結果:
gradOut
: \(\dfrac{\partial J\_{CBOW}(w\_{t-m...t+m})}{\partial U}=\dfrac{\partial F(w_t,\hat{v})}{\partial U}\)
gradIn
: \(\dfrac{\partial J\_{CBOW}(w\_{t-m...t+m})}{\partial v\_{w\_{t+j}}}=\dfrac{\partial F(w_t,\hat{v})}{\partial \hat{v}},\forall j \in \{-m,...,-1,1,...,m\}\)
gradIn
: \(\dfrac{\partial J\_{CBOW}(w\_{t-m...t+m})}{\partial U}=0,\forall j \notin \{-m,...,-1,1,...,m\}\)
1 | def cbow(currentWord, C, contextWords, tokens, inputVectors, outputVectors, |
執行python q3_word2vec.py
,在完成 (e)(h) 小題後的結果應如下:
1 | Testing normalizeRows... |
4. Sentiment Analysis
Now, with the word vectors you trained, we are going to perform a simple sentiment analysis. For each sentence in the Stanford Sentiment Treebank dataset, we are going to use the average of all the word vectors in that sentence as its feature, and try to predict the sentiment level of the said sentence. The sentiment level of the phrases are represented as real values in the original dataset, here we’ll just use five classes:
“very negative (−−)”, “negative (−)”, “neutral”, “positive (+)”, “very positive (++)”
which are represented by 0 to 4 in the code, respectively. For this part, you will learn to train a softmax classifier, and perform train/dev validation to improve generalization.
(a) 以特徵向量來表示一個句子
Implement a sentence featurizer. A simple way of representing a sentence is taking the average of the vectors of the words in the sentence. Fill in the implementation in q4_sentiment.py
.
solution
將句子中所有詞的詞向量進行平均,將得到的向量用來表示整個句子。
Inputs:
tokens
: 一個 dictionary,key 為詞,value 為該詞的索引
wordVectors
: 詞向量
sentence
: 一個句子
Output:
sentVector
: 句子的特徵向量
1 | def getSentenceFeatures(tokens, wordVectors, sentence): |
(b) 正規化 (Regularization)
Explain in at most two sentences why we want to introduce regularization when doing classification (in fact, most machine learning tasks).
solution
正規化可以避免 overfitting,使得模型更加 generalization。
(c) 尋找超參數 (hyperparameter)
Fill in the hyperparameter selection code in q4_sentiment.py
to search for the “optimal” regularization parameter. You need to implement both getRegularizationValues
and chooseBestModel
. Attach your code for chooseBestModel
to your written write-up. You should be able to attain at least 36.5% accuracy on the dev and test sets using the pretrained vectors in part (d).
solution
參考其他人的做法,使用 log function 產生遞增的值。
1 | def getRegularizationValues(): |
1 | def chooseBestModel(results): |
(d) 模型比較
Run python q4_sentiment.py --yourvectors
to train a model using your word vectors from q3. Now, run python q4_sentiment.py --pretrained
to train a model using pretrained GloVe vectors (on Wikipedia data). Compare and report the best train, dev, and test accuracies. Why do you think the pretrained vectors did better? Be specific and justify with 3 distinct reasons.
- yourvectors
Reg | Train | Dev | Test |
---|---|---|---|
1.00E-05 | 31.004 | 32.516 | 30.452 |
1.15E-04 | 31.063 | 32.516 | 30.362 |
1.12E-03 | 31.133 | 32.516 | 30.362 |
1.10E-02 | 30.922 | 32.334 | 29.955 |
1.07E-01 | 30.290 | 31.789 | 29.864 |
1.05E+00 | 28.816 | 29.609 | 27.059 |
1 | Best regularization value: 2.26E-05 |
- pretrained GloVe vectors (on Wikipedia data)
Reg | Train | Dev | Test |
---|---|---|---|
1.00E-05 | 39.923 | 36.421 | 37.059 |
1.15E-04 | 39.958 | 36.512 | 37.014 |
1.12E-03 | 39.899 | 36.331 | 37.014 |
1.10E-02 | 39.923 | 36.331 | 37.195 |
1.07E-01 | 39.782 | 36.240 | 37.149 |
1.05E+00 | 39.478 | 36.512 | 37.330 |
1 | Best regularization value: 1.20E+01 |
觀察兩者的表現可以發現使用 pretrained 的模型效果明顯比較好,可能原因如下:
- 更高維度的詞向量可能包含更多的資訊
- GloVe 使用更大的語料庫進行訓練
- word2vec 與 GloVe 的差異 (參考資料)
(e) Regularization Term 的變化在 train set 及 dev set 的表現差異
Plot the classification accuracy on the train and dev set with respect to the regularization value for the pretrained GloVe vectors, using a logarithmic scale on the x-axis. This should have been done automatically. Include q4_reg acc.png
in your homework write up. Briefly explain in at most three sentences what you see in the plot.
solution
(f) Confusion Matrix
We will now analyze errors that the model makes (with pretrained GloVe vectors). When you ran python q4_sentiment.py --pretrained
, two files should have been generated. Take a look at q4_dev_conf.png
and include it in your homework writeup. Interpret the confusion matrix in at most three sentences.
solution
由預測結果與標準答案的數量比較所產生的 confusion matrix,由左上到右下的對角線所經過的格子為 True Positive 的數量。
(g) 為何分類錯誤?
Next, take a look at q4_dev_pred.txt
. Choose 3 examples where your classifier made errors and briefly explain the error and what features the classifier would need to classify the example correctly (1 sentence per example). Try to pick examples with different reasons.
solution
將每筆分類結果與標準答案的差距取絕對值,統計如下:
差距 | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
數量 | 409 | 444 | 188 | 59 | 1 |
可以看到其實大部分的誤差都在 1 分以內,若將標準降低為差距 1 分或以下就算分類正確,準確率會從原來的 37% 大幅提高到 77%,因此整體的預測其實還算是準確的。
Answer | Predicted | Sentence | Possible Issue |
---|---|---|---|
0 | 4 | a lackluster , unessential sequel to the classic disney adaptation of j.m. barrie 's peter pan . | 唯一一筆預測與實際分數差距4分的資料,看起來 lackluster, unessential 都是負面的詞彙,classic 則是偏向正面的詞彙,而 j.m. barrie 's peter pan 則比較算是雜訊,其餘則是稍微中性的詞語,若將專有名詞去掉或許能提升預測的分數 |
4 | 1 | the draw -lrb- for `` big bad love '' -rrb- is a solid performance by arliss howard . | 太多無意義的符號干擾 |
4 | 1 | it is amusing , and that 's all it needs to be . | amusing 應該比較偏向正面的詞彙,但是其餘的詞感覺大部分都是冗詞。 |