文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

第一部分：随机投影（使用词向量）

发布于 2025-01-01 12:38:39 字数 10057 浏览 0 评论 0 收藏 0

本节的目的是用单词向量的具体例子，来说明随机投影保留结构的想法！

要在机器学习中使用语言（例如，Skype 翻译器如何在语言之间进行翻译，或 Gmail 智能回复如何自动为你的电子邮件建议可能的回复），我们需要将单词表示为向量。

我们可以使用 Google 的 Word2Vec 或 Stanford 的 GloVe 将单词表示为 100 维向量。例如，这里是“python”这个词在 GloVe 中的向量：

vecs[wordidx['python']]

'''
array([ 0.2493,  0.6832, -0.0447, -1.3842, -0.0073,  0.651 , -0.3396,
       -0.1979, -0.3392,  0.2669, -0.0331,  0.1592,  0.8955,  0.54  ,
       -0.5582,  0.4624,  0.3672,  0.1889,  0.8319,  0.8142, -0.1183,
       -0.5346,  0.2416, -0.0389,  1.1907,  0.7935, -0.1231,  0.6642,
       -0.7762, -0.4571, -1.054 , -0.2056, -0.133 ,  0.1224,  0.8846,
        1.024 ,  0.3229,  0.821 , -0.0694,  0.0242, -0.5142,  0.8727,
        0.2576,  0.9153, -0.6422,  0.0412, -0.6021,  0.5463,  0.6608,
        0.198 , -1.1393,  0.7951,  0.4597, -0.1846, -0.6413, -0.2493,
       -0.4019, -0.5079,  0.8058,  0.5336,  0.5273,  0.3925, -0.2988,
        0.0096,  0.9995, -0.0613,  0.7194,  0.329 , -0.0528,  0.6714,
       -0.8025, -0.2579,  0.4961,  0.4808, -0.684 , -0.0122,  0.0482,
        0.2946,  0.2061,  0.3356, -0.6417, -0.6471,  0.1338, -0.1257,
       -0.4638,  1.3878,  0.9564, -0.0679, -0.0017,  0.5296,  0.4567,
        0.6104, -0.1151,  0.4263,  0.1734, -0.7995, -0.245 , -0.6089,
       -0.3847, -0.4797], dtype=float32)
'''

目标：使用随机性将此值从 100 维减少到 20。检查相似的单词是否仍然组合在一起。

更多信息：如果你对词嵌入感兴趣并想要更多细节，我在这里提供了一个更长的学习小组（带有代码演示）。

风格说明：我使用可折叠标题和 jupyter 主题。

加载数据

import pickle
import numpy as np
import re
import json

np.set_printoptions(precision=4, suppress=True)

该数据集可从这里获得。要从命令行下载和解压缩文件，你可以运行：

wget http://files.fast.ai/models/glove_50_glove_100.tgz 
tar xvzf glove_50_glove_100.tgz

你需要更新以下路径，来指定存储数据的位置。

path = "../data/"

vecs = np.load(path + "glove_vectors_100d.npy")

with open(path + "words.txt") as f:
    content = f.readlines()
words = [x.strip() for x in content] 

wordidx = json.load(open(path + "wordsidx.txt"))

我们的数据的样子

我们有一个单词的长列表。

len(words)

# 400000

words[:10]

# ['the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s"]

words[600:610]

'''
['together',
 'congress',
 'index',
 'australia',
 'results',
 'hard',
 'hours',
 'land',
 'action',
 'higher']
'''

wordidx 允许我们查找单词来找出它的索引：

wordidx['python']

# 20019

words[20019]

# 'python'

作为向量的单词

单词“python”由 100 维向量表示：

vecs[wordidx['python']]

'''
array([ 0.2493,  0.6832, -0.0447, -1.3842, -0.0073,  0.651 , -0.3396,
       -0.1979, -0.3392,  0.2669, -0.0331,  0.1592,  0.8955,  0.54  ,
       -0.5582,  0.4624,  0.3672,  0.1889,  0.8319,  0.8142, -0.1183,
       -0.5346,  0.2416, -0.0389,  1.1907,  0.7935, -0.1231,  0.6642,
       -0.7762, -0.4571, -1.054 , -0.2056, -0.133 ,  0.1224,  0.8846,
        1.024 ,  0.3229,  0.821 , -0.0694,  0.0242, -0.5142,  0.8727,
        0.2576,  0.9153, -0.6422,  0.0412, -0.6021,  0.5463,  0.6608,
        0.198 , -1.1393,  0.7951,  0.4597, -0.1846, -0.6413, -0.2493,
       -0.4019, -0.5079,  0.8058,  0.5336,  0.5273,  0.3925, -0.2988,
        0.0096,  0.9995, -0.0613,  0.7194,  0.329 , -0.0528,  0.6714,
       -0.8025, -0.2579,  0.4961,  0.4808, -0.684 , -0.0122,  0.0482,
        0.2946,  0.2061,  0.3356, -0.6417, -0.6471,  0.1338, -0.1257,
       -0.4638,  1.3878,  0.9564, -0.0679, -0.0017,  0.5296,  0.4567,
        0.6104, -0.1151,  0.4263,  0.1734, -0.7995, -0.245 , -0.6089,
       -0.3847, -0.4797], dtype=float32)
'''

这让我们可以做一些有用的计算。例如，我们可以使用距离度量，看到两个单词有多远：

from scipy.spatial.distance import cosine as dist

较小的数字意味着两个单词更接近，较大的数字意味着它们更加分开。

相似单词之间的距离很短：

dist(vecs[wordidx["puppy"]], vecs[wordidx["dog"]])

# 0.27636240676695256

dist(vecs[wordidx["queen"]], vecs[wordidx["princess"]])

# 0.20527545040329642

并且无关词之间的距离很高：

dist(vecs[wordidx["celebrity"]], vecs[wordidx["dusty"]])

# 0.98835787578057777

dist(vecs[wordidx["avalanche"]], vecs[wordidx["antique"]])

# 0.96211070091611983

偏见

有很多偏见的机会：

dist(vecs[wordidx["man"]], vecs[wordidx["genius"]])

# 0.50985148631697985

dist(vecs[wordidx["woman"]], vecs[wordidx["genius"]])

# 0.6897833082810727

我只是检查了几对词之间的距离，因为这是说明这个概念的快速而简单的方式。这也是一种非常嘈杂的方法，研究人员用更系统的方式解决这个问题。

我在这个学习小组上更深入地讨论了偏见。

可视化

让我们可视化一些单词！

我们将使用 Plotly，一个制作交互式图形的 Python 库（注意：以下所有内容都是在不创建帐户的情况下完成的，使用免费的离线版 Plotly）。

方法

import plotly
import plotly.graph_objs as go    
from IPython.display import IFrame

def plotly_3d(Y, cat_labels, filename="temp-plot.html"):
    trace_dict = {}
    for i, label in enumerate(cat_labels):
        trace_dict[i] = go.Scatter3d(
            x=Y[i*5:(i+1)*5, 0],
            y=Y[i*5:(i+1)*5, 1],
            z=Y[i*5:(i+1)*5, 2],
            mode='markers',
            marker=dict(
                size=8,
                line=dict(
                    color='rgba('+ str(i*40) + ',' + str(i*40) + ',' + str(i*40) + ', 0.14)',
                    width=0.5
                ),
                opacity=0.8
            ),
            text = my_words[i*5:(i+1)*5],
            name = label
        )

    data = [item for item in trace_dict.values()]
    layout = go.Layout(
        margin=dict(
            l=0,
            r=0,
            b=0,
            t=0
        )
    )

    plotly.offline.plot({
        "data": data,
        "layout": layout,
    }, filename=filename)

def plotly_2d(Y, cat_labels, filename="temp-plot.html"):
    trace_dict = {}
    for i, label in enumerate(cat_labels):
        trace_dict[i] = go.Scatter(
            x=Y[i*5:(i+1)*5, 0],
            y=Y[i*5:(i+1)*5, 1],
            mode='markers',
            marker=dict(
                size=8,
                line=dict(
                    color='rgba('+ str(i*40) + ',' + str(i*40) + ',' + str(i*40) + ', 0.14)',
                    width=0.5
                ),
                opacity=0.8
            ),
            text = my_words[i*5:(i+1)*5],
            name = label
        )

    data = [item for item in trace_dict.values()]
    layout = go.Layout(
        margin=dict(
            l=0,
            r=0,
            b=0,
            t=0
        )
    )

    plotly.offline.plot({
        "data": data,
        "layout": layout
    }, filename=filename)

此方法将挑选出 3 个维度，最能将我们的类别彼此分开（存储在 dist_btwn_cats 中），同时最小化给定类别中单词的距离（存储在 dist_within_cats 中）。

def get_components(data, categories, word_indices):
    num_components = 30
    pca = decomposition.PCA(n_components=num_components).fit(data.T)
    all_components = pca.components_
    centroids = {}
    print(all_components.shape)
    for i, category in enumerate(categories):
        cen = np.mean(all_components[:, i*5:(i+1)*5], axis = 1)
        dist_within_cats = np.sum(np.abs(np.expand_dims(cen, axis=1) - all_components[:, i*5:(i+1)*5]), axis=1)
        centroids[category] = cen
    dist_btwn_cats = np.zeros(num_components)
    for category1, averages1 in centroids.items():
        for category2, averages2 in centroids.items():
            dist_btwn_cats += abs(averages1 - averages2)
            clusterness = dist_btwn_cats / dist_within_cats
    comp_indices = np.argpartition(clusterness, -3)[-3:]
    return all_components[comp_indices]

准备数据

让我们绘制几个不同类别的单词：

my_words = [
            "maggot", "flea", "tarantula", "bedbug", "mosquito", 
            "violin", "cello", "flute", "harp", "mandolin",
            "joy", "love", "peace", "pleasure", "wonderful",
            "agony", "terrible", "horrible", "nasty", "failure", 
            "physics", "chemistry", "science", "technology", "engineering",
            "poetry", "art", "literature", "dance", "symphony",
           ]

categories = [
              "bugs", "music", 
              "pleasant", "unpleasant", 
              "science", "arts"
             ]

同样，我们需要使用 wordidx 字典查找单词的索引：

my_word_indices = np.array([wordidx[word] for word in my_words])

vecs[my_word_indices].shape

# (30, 100)

现在，我们将组合我们的单词与我们整个单词集中的前 10,000 个单词（其中一些单词已经存在），并创建嵌入矩阵。

embeddings = np.concatenate((vecs[my_word_indices], vecs[:10000,:]), axis=0); embeddings.shape

# (10030, 100)

在 3D 中查看单词

单词有 100 个维度，我们需要一种在 3D 中可视化它们的方法。

我们将使用主成分分析（PCA），这是一种广泛使用的技术，具有许多应用，包括在较低维度可视化高维数据集！

PCA

from collections import defaultdict
from sklearn import decomposition

components = get_components(embeddings, categories, my_word_indices)
plotly_3d(components.T[:len(my_words),:], categories, "pca.html")

# (30, 10030)

IFrame('pca.html', width=600, height=400)

随机投影

Johnson-Lindenstrauss 引理：（来自维基百科）高维空间中的一小组点可以嵌入到更低维度的空间中，使点之间的距离几乎保留（使用随机投影证明）。

有用的是，能够以保持距离的方式减少数据的维度。 Johnson-Lindenstrauss 引理是这种类型的经典结果。

embeddings.shape

# (10030, 100)

rand_proj = embeddings @ np.random.normal(size=(embeddings.shape[1], 40)); rand_proj.shape

# (10030, 40)

# pca = decomposition.PCA(n_components=3).fit(rand_proj.T)
# components = pca.components_
components = get_components(rand_proj, categories, my_word_indices)
plotly_3d(components.T[:len(my_words),:], categories, "pca-rand-proj.html")

# (30, 10030)

IFrame('pca-rand-proj.html', width=600, height=400)

分享到QQ

分享到微博