文江博客开发文档斯坦福 cs224d 深度学习与自然语言处理讲义 fix1 文章详情

文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

3 word2vec(40 分+5 附加分)

发布于 2025-02-18 23:44:03 字数 19257 浏览 0 评论 0 收藏 0

(part a) (3 分)
假设你得到一个关联到中心词的预测词向量，并且这个词向量使用 skip-gram 方法生成，预测词使用的是 softmax 预测函数，它能够在 word2vec 模型中被找到。

式中，代表第 w 个词，是词库中全体词汇的输出词向量。假设为交叉熵损失函数，且词是被预测的词汇（noe-hot/独热模型的标记向量中第个元素为 1），求解预测词向量的所对应的梯度。
提示：问题 2 中的标记法将有助于此问题的解答。比如：设为各个词汇使用 softmax 函数预测得到的向量，为期望词向量，而损失函数可以表示为：

其中，是全体输出向量形成的矩阵，确保你已经规定好你的向量和矩阵的方向。

旁边：是的，旁白我已经不知道写什么了，感谢党感谢祖国吧。

解答： 设为词汇 softmax 预测结果的列向量，是同样形为列向量的独热标签，那么有：

或者等同于：

(part b) (3 分)
条件仍然如前一题所描述，求解输出词向量的梯度（包括在内）

旁白：我还是安安静静在天朝搬砖吧

解答：

或者等同于：

(part c) (6 分)
仍然延续(part a) 和(part b)，假设我们使用为预测的向量使用负采样损失的计算方式，并且设定期望输出词为。假设获得了个负样例（词），并且被记为，分别作为这些样例的标签。那么，对于一个给定的词，将其输出向量记作。这里，负采样损失函数如下：

其中，为 sigmoid 激活函数。

当你完成上述操作之后，尝试简要描述这个损失函数比 softmax-CE 损失函数计算更为有效的原因（你可以给出递增式的学习率，即，给出 softmax-CE 损失函数的计算时间除以负采样损失函数的计算时间的结果）。

注释：由于我们打算计算目标函数的最小值而不是最大值，这里提到的损失函数与 Mikolov 等人最先在原版论文中描述的正好相反。

旁白：突然想起来，小时候好焦虑，长大后到底去清华还是去北大，后来发现多虑了。我想如果当初走了狗屎运进了贵 T 大贵 P 大，也一定完不成学业。

解答：

(part d) (8 分)
试得到由 skip-gram 和 CBOW 算法分别算出的全部词向量的梯度，前提步骤和词内容集合[wordc-m,…,wordc-1,wordc,wordc+1,…,wordc+m]都已给出，其中，是窗口的大小。将词的输入和输出词向量分别记为和。
提示：可以随意使用函数（其中代表词汇）作为这一部分中或损失函数的占位符——你将在编程部分看到一个非常有用的抽象类，那意味着你的解决方法可以用这样的形式表达：
回忆 skip-gram 算法，以为中心周边内容的损失值计算如下：

其中，代表距离中心词的第 j 个词。
CBOW 略有不同，不同于使用作为预测向量，我们以为底，在 CBOW 中（一个小小的变体），我们计算上下文输入词向量的和:

于是，CBOW 的损失函数定义为：

注释：为了符合在诸如代码部分中的各种表达规范，在 skip-gram 方法中，令：。

旁白：我诚实一点，这个部分真的是烦了课件抄下来的。

解答： 为了表达得更为清晰，我们将词库中全部词汇的全部输出向量集合记作，给定一个损失函数，我们可以很容易获得以下引出结果：
和
对于 skip-gram 方法，一个内容窗口的损失梯度为：

同样地，对于 CBOW 则有：

(part e) (12 分)
在这一部分，你将实现 word2vec 模型，并且使用随机梯度下降方法（SGD）训练属于你自己的词向量。首先，在代码 q3_word2vec.py 中编写一个辅助函数对矩阵中的每一行进行归一化。同样在这个文件中，完成对 softmax、负采样损失函数以及梯度计算函数的实现。然后，完成面向 skip-gram 的梯度损失函数。当你完成这些的时候，使用命令： python q3_word2vec.py 对编写的程序进行测试。
注释：如果你选择不去实现 CBOW(h 部分)，只需简单地删除对 NotImplementedError 错误的捕获即可完成你的测试。

旁白：前方高能预警，代码量爆炸了！

import numpy as np
import random

from q1_softmax import softmax
from q2_gradcheck import gradcheck_naive
from q2_sigmoid import sigmoid, sigmoid_grad

def normalizeRows(x):
    """ 
        行归一化函数 
    """

    N = x.shape[0]
    x /= np.sqrt(np.sum(x**2, axis=1)).reshape((N,1)) + 1e-30

    return x

def test_normalize_rows():
    print "Testing normalizeRows..."
    x = normalizeRows(np.array([[3.0,4.0],[1, 2]])) 
    # 结果应该是 [[0.6, 0.8], [0.4472, 0.8944]]
    print x
    assert (np.amax(np.fabs(x - np.array([[0.6,0.8],[0.4472136,0.89442719]]))) <= 1e-6)
    print ""

def softmaxCostAndGradient(predicted, target, outputVectors, dataset):
    """ 
        word2vec 的 Softmax 损失函数 
    """                                                   

    # 输入:                                                         
    # - predicted: 预测词向量的 numpy 数组
    # - target: 目标词的下标              
    # - outputVectors: 所有 token 的"output"向量（行形式) 
    # - dataset: 用来做负例采样的，这里其实没用着         

    # 输出:                                                        
    # - cost: 输出的互熵损失    
    # - gradPred: the gradient with respect to the predicted word   
    #        vector                                                
    # - grad: the gradient with respect to all the other word        
    #        vectors                                               

    probabilities = softmax(predicted.dot(outputVectors.T))
    cost = -np.log(probabilities[target])
    delta = probabilities
    delta[target] -= 1
    N = delta.shape[0]
    D = predicted.shape[0]
    grad = delta.reshape((N,1)) * predicted.reshape((1,D))
    gradPred = (delta.reshape((1,N)).dot(outputVectors)).flatten()

    return cost, gradPred, grad

def negSamplingCostAndGradient(predicted, target, outputVectors, dataset, 
    K=10):
    """ 
        Word2vec 模型负例采样后的损失函数和梯度
    """

    grad = np.zeros(outputVectors.shape)
    gradPred = np.zeros(predicted.shape)

    indices = [target]
    for k in xrange(K):
        newidx = dataset.sampleTokenIdx()
        while newidx == target:
            newidx = dataset.sampleTokenIdx()
        indices += [newidx]

    labels = np.array([1] + [-1 for k in xrange(K)])
    vecs = outputVectors[indices,:]

    t = sigmoid(vecs.dot(predicted) * labels)
    cost = -np.sum(np.log(t))

    delta = labels * (t - 1)
    gradPred = delta.reshape((1,K+1)).dot(vecs).flatten()
    gradtemp = delta.reshape((K+1,1)).dot(predicted.reshape(
        (1,predicted.shape[0])))
    for k in xrange(K+1):
        grad[indices[k]] += gradtemp[k,:]

     t = sigmoid(predicted.dot(outputVectors[target,:]))
     cost = -np.log(t)
     delta = t - 1

     gradPred += delta * outputVectors[target, :]
     grad[target, :] += delta * predicted

     for k in xrange(K):
         idx = dataset.sampleTokenIdx()

         t = sigmoid(-predicted.dot(outputVectors[idx,:]))
         cost += -np.log(t)
         delta = 1 - t

         gradPred += delta * outputVectors[idx, :]
         grad[idx, :] += delta * predicted


    return cost, gradPred, grad


def skipgram(currentWord, C, contextWords, tokens, inputVectors, outputVectors, 
    dataset, word2vecCostAndGradient = softmaxCostAndGradient):
    """ Skip-gram model in word2vec """

    # skip-gram 模型的实现

    # 输入:                                                         
    # - currrentWord: 当前中心词所对应的串           
    # - C: 上下文大小（词窗大小)                          
    # - contextWords: 最多 2*C 个词                             
    # - tokens: 对应词向量中词下标的字典                
    # - inputVectors: "input" word vectors (as rows) for all tokens           
    # - outputVectors: "output" word vectors (as rows) for all tokens         
    # - word2vecCostAndGradient: the cost and gradient function for a prediction vector given the target word vectors, could be one of the two cost functions you implemented above

    # 输出:                                                   
    # - cost: skip-gram 模型算得的损失值   
    # - grad: 词向量对应的梯度 


    currentI = tokens[currentWord]
    predicted = inputVectors[currentI, :]

    cost = 0.0
    gradIn = np.zeros(inputVectors.shape)
    gradOut = np.zeros(outputVectors.shape)
    for cwd in contextWords:
        idx = tokens[cwd]
        cc, gp, gg = word2vecCostAndGradient(predicted, idx, outputVectors, dataset)
        cost += cc
        gradOut += gg
        gradIn[currentI, :] += gp

    return cost, gradIn, gradOut


def word2vec_sgd_wrapper(word2vecModel, tokens, wordVectors, dataset, C, word2vecCostAndGradient = softmaxCostAndGradient):
    batchsize = 50
    cost = 0.0
    grad = np.zeros(wordVectors.shape)
    N = wordVectors.shape[0]
    inputVectors = wordVectors[:N/2,:]
    outputVectors = wordVectors[N/2:,:]
    for i in xrange(batchsize):
        C1 = random.randint(1,C)
        centerword, context = dataset.getRandomContext(C1)

        if word2vecModel == skipgram:
            denom = 1
        else:
            denom = 1

        c, gin, gout = word2vecModel(centerword, C1, context, tokens, inputVectors, outputVectors, dataset, word2vecCostAndGradient)
        cost += c / batchsize / denom
        grad[:N/2, :] += gin / batchsize / denom
        grad[N/2:, :] += gout / batchsize / denom

    return cost, grad

def test_word2vec():
    # Interface to the dataset for negative sampling
    dataset = type('dummy', (), {})()
    def dummySampleTokenIdx():
        return random.randint(0, 4)

    def getRandomContext(C):
        tokens = ["a", "b", "c", "d", "e"]
        return tokens[random.randint(0,4)], [tokens[random.randint(0,4)] \
           for i in xrange(2*C)]
    dataset.sampleTokenIdx = dummySampleTokenIdx
    dataset.getRandomContext = getRandomContext

    random.seed(31415)
    np.random.seed(9265)
    dummy_vectors = normalizeRows(np.random.randn(10,3))
    dummy_tokens = dict([("a",0), ("b",1), ("c",2),("d",3),("e",4)])
    print "==== Gradient check for skip-gram ===="
    gradcheck_naive(lambda vec: word2vec_sgd_wrapper(skipgram, dummy_tokens, vec, dataset, 5), dummy_vectors)
    gradcheck_naive(lambda vec: word2vec_sgd_wrapper(skipgram, dummy_tokens, vec, dataset, 5, negSamplingCostAndGradient), dummy_vectors)
    print "\n==== Gradient check for CBOW      ===="
    gradcheck_naive(lambda vec: word2vec_sgd_wrapper(cbow, dummy_tokens, vec, dataset, 5), dummy_vectors)
    gradcheck_naive(lambda vec: word2vec_sgd_wrapper(cbow, dummy_tokens, vec, dataset, 5, negSamplingCostAndGradient), dummy_vectors)

    print "\n=== Results ==="
    print skipgram("c", 3, ["a", "b", "e", "d", "b", "c"], dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset)
    print skipgram("c", 1, ["a", "b"], dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset, negSamplingCostAndGradient)
    print cbow("a", 2, ["a", "b", "c", "a"], dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset)
    print cbow("a", 2, ["a", "b", "a", "c"], dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset, negSamplingCostAndGradient)

if __name__ == "__main__":
    test_normalize_rows()
    test_word2vec()

(f) (4 分) 在代码 q3_sgd.py 中完成对随即梯度下降优化函数的实现。并且在该代码中运行测试你的实现。

旁白：想到这篇文章有可能会被无数可以智商碾压我的大神看到，就脸一阵发烫。

# 实现随机梯度下降

# 随机梯度下降每 1000 轮，就保存一下现在训练得到的参数
SAVE_PARAMS_EVERY = 1000

import glob
import os.path as op
import cPickle as pickle
import sys

def load_saved_params():
    """
        载入之前的参数以免从头开始训练
    """
    st = 0
    for f in glob.glob("saved_params_*.npy"):
        iter = int(op.splitext(op.basename(f))[0].split("_")[2])
        if (iter > st):
            st = iter

    if st > 0:
        with open("saved_params_%d.npy" % st, "r") as f:
            params = pickle.load(f)
            state = pickle.load(f)
        return st, params, state
    else:
        return st, None, None

def save_params(iter, params):
    with open("saved_params_%d.npy" % iter, "w") as f:
        pickle.dump(params, f)
        pickle.dump(random.getstate(), f)

def sgd(f, x0, step, iterations, postprocessing = None, useSaved = False, PRINT_EVERY=10, ANNEAL_EVERY = 20000):
    """ 随机梯度下降 """
    ###########################################################
    # 输入
    #   - f: 需要最优化的函数
    #   - x0: SGD 的初始值
    #   - step: SGD 的步长
    #   - iterations: 总得迭代次数
    #   - postprocessing: 参数后处理（比如 word2vec 里需要对词向量做归一化处理）
    #   - PRINT_EVERY: 指明多少次迭代以后输出一下状态
    # 输出: 
    #   - x: SGD 完成后的输出参数                   #
    ###########################################################

    if useSaved:
        start_iter, oldx, state = load_saved_params()
        if start_iter > 0:
            x0 = oldx;
            step *= 0.5 ** (start_iter / ANNEAL_EVERY)

        if state:
            random.setstate(state)
    else:
        start_iter = 0

    x = x0

    if not postprocessing:
        postprocessing = lambda x: x

    expcost = None

    for iter in xrange(start_iter + 1, iterations + 1):
        cost, grad = f(x)
        x = x - step * grad
        x = postprocessing(x)

        if iter % PRINT_EVERY == 0:
            print "Iter#{}, cost={}".format(iter, cost)
            sys.stdout.flush()

        if iter % SAVE_PARAMS_EVERY == 0 and useSaved:
            save_params(iter, x)

        if iter % ANNEAL_EVERY == 0:
            step *= 0.5

    return x

(part g) (4 分)
开始秀啦！现在我们将要载入真实的数据并使用你已经实现的手段训练词向量！我们将使用 Stanford Sentiment Treebank (SST) 数据集来进行词向量的训练，之后将他们应用到情感分析任务中去。在这一部分中，无需再编写更多的代码；只需要运行命令 python q3 run.py 即可。
注释：训练过程所占用的时间可能会很长，这取决于你所实现的程序的效率（ 一个拥有优异效率的实现程序大约需要占用 1 个小时 ）。努力去接近这个目标！
当脚本编写完成，需要完成对词向量的可视化显示。相应的结果同样被保存下来，如项目目录中的图片 q3 word_vectors.png 所示。 包括在你作业中绘制的坐标图。 简明解释最多三个句子在你的坐标图中的显示状况。
解答：

(part h) 附加题（5 分）

在代码 q3_word2vec.py 中完成对 CBOW 的实现。注释：这部分内容是可选的，但是在 d 部分中关于 CBOW 的梯度推导在这里并不适用！

def cbow(currentWord, C, contextWords, tokens, inputVectors, outputVectors, 
    dataset, word2vecCostAndGradient = softmaxCostAndGradient):
    """
        word2vec 的 CBOW 模型
    """

    cost = 0
    gradIn = np.zeros(inputVectors.shape)
    gradOut = np.zeros(outputVectors.shape)


     D = inputVectors.shape[1]
     predicted = np.zeros((D,))

     indices = [tokens[cwd] for cwd in contextWords]
     for idx in indices:
         predicted += inputVectors[idx, :]

     cost, gp, gradOut = word2vecCostAndGradient(predicted, tokens[currentWord], outputVectors, dataset)
     gradIn = np.zeros(inputVectors.shape)
     for idx in indices:
         gradIn[idx, :] += gp


    return cost, gradIn, gradOut

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

3 word2vec(40 分+5 附加分)

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。