三十六、Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings [2019]

发布于 2023-07-17 23:38:25 字数 42876 浏览 0 评论 0 收藏 0

传统的 word embedding 是静态的，每个单词都有单个 embedding 向量，与上下文无关。这带来了几个问题，最明显的是，一个多义词的所有意义都必须共享相同的representation 。最近的工作，即 deep neural language model （如 ELMo 和 BERT ），已经成功地创建了 contextualized word representation ，word vector 对它们所处在的context 敏感。用 contextualized representation 取代 static embedding ，在一系列不同的 NLP 任务中产生了显著的改善，包括从问答任务到共指消解co-reference resolution 任务。
contextualized word representation 的成功表明，尽管只用语言建模任务进行训练，但它们学到了高度transferable 的、task-agnostic 的语言属性。事实上，在 frozen contextualized representation 上训练的线性模型可以预测单词的语言属性linguistic property （例如，part-of-speech tag ），几乎与 SOTA 模型一样好。尽管如此，这些 representation 仍然没有得到很好的理解：
- 首先，这些contextualized word representation 到底有多么的 contextual ？
- 其次，BERT 和 ELMo 是否有无限多的 context-specific representation 可以分配给每个单词，还是说单词基本上是分配到有限数量的 word-sense representation 中的一个？
论文 《How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings》 通过研究 ELMo、BERT 和 GPT-2 的每一层的 representation space 的 geometry 来回答这个问题。论文的分析产生了一些令人惊讶的发现：
- 在所有三个模型的所有层中，所有单词的 contextualized word representation 都不是各向同性isotropic 的：它们在方向上不是均匀分布的。相反，它们是各向异性anisotropic 的，在向量空间中占据一个狭窄的锥体。GPT-2 最后一层的各向异性是如此的极端，以至于两个随机的单词平均而言会有几乎完美的余弦相似性！鉴于各向同性对 static embedding 有理论上的和经验上的好处（《All-but-the-top: Simple and effective postprocessing for word representations》），contextualized representation 中各向异性的程度令人惊讶。
  即，contextualized work embedding 的各向异性。
- 同一单词在不同 context 中的出现具有 non-identical vector representations 。在向量相似性被定义为余弦相似性的情况下，同一单词的不同 representations 之间在上层中的不相似性更大。这表明，就像 LSTM 的上层产生更加 task-specific 的 representation 一样（《Linguistic knowledge and transferability of contextual representations》），contextualizing model 的上层产生更加 context-specific 的 representation 。
  即，不同层的 contextualization 程度不同，更高层产生更加 contextualized 的 embedding 。
- context-specificity 在ELMo、BERT 和 GPT-2 中表现得非常不同：
  - 在ELMo中，随着上层的context-specificity 增加，同一句子中的单词的 representation 越来越相似。
  - 在 BERT 中，同一句子中的单词的 representation 在上层变得更加不相似，但平均而言仍然比随机采样的单词更加相似。
  - 然而在GPT-2 中，同一句子中的单词的 representation ，并不比两个随机采样的单词更加相似。
  即，同一个句子中的不同单词，在 ELMo 的更高层中越来越相似、在 BERT 的更高层中更加不相似（但是比随机的单词更相似）、在 GPT-2 中与随机单词的相似性差不多。
- 在对各向异性的影响进行调整后，平均来说，一个单词的contextualized representation中只有不到 5% 的方差可以由其第一主成分解释 first principal component 。这一点在所有模型的所有 layer 上都成立。这表明：contextualized representation 并不对应于有限数量的 word-sense representation ，即使在最好的情况下，static embedding 也只是contextualized embedding 的糟糕的替代物。尽管如此，通过提取一个单词的contextualized representation 的 first principal component 而创建的 static embedding 在许多 word vector benchmark 上超越了 GloVe embedding 和 FastText embedding 。
这些洞察有助于证明为什么使用 contextualized representation 能在许多 NLP 任务中带来如此显著的改进。
相关工作：
- Static Word Embedding：Skip-gram with negative sampling: SGNS 和 GloVe 是生成 static word embedding 的最著名的模型之一。虽然在实践中它们迭代式地学习 embedding ，但已经理论上证明，它们都隐式地分解了一个 word-context matrix ，该矩阵包含 co-occurrence statistic 。static word embedding 的一个显著问题是，由于它们为每个单词创建了一个单一的 representation ，所以一个多义词的所有意义必须共享单个向量。
- Contextualized Word Representation ：鉴于 static word embedding 的局限性，最近的工作试图创建 context-sensitive word representation 。ELMo, BERT, GPT-2 是深度神经语言模型，它们经过微调从而应用于广泛的下游 NLP 任务。它们的内部 word representation 被称作 contextualized word representation ，因为 word representation 是整个输入句子的一个函数。这种方法的成功表明，这些 representation 捕获了语言的高度可迁移transferable 的、和任务无关 task-agnostic 的属性。
  - ELMo 通过拼接一个双层 biLSTM 的 internal states ，从而创建每个 token 的 contextualized representation 。这个双层 biLSTM 在双向语言建模任务上训练。
  - 相比之下，BERT 和 GPT-2 分别是双向的和单向的 transformer-based 的语言模型。12 层的 BERT (base, cased) 和 12 层的 GPT-2 的每个 transformer layer 通过关注输入句子的不同部分来创建每个 token 的 contextualized representation 。
- Probing Task：之前对 contextualized word representation 的分析主要限于 probing task 。这涉及到训练线性模型来预测单词的句法（例如，part-of-speech tag ）和语义（例如，word relation ）属性。probing 模型的前提是，如果一个简单的线性模型可以被训练来准确预测语言属性，那么 representation 需要隐式地编码这一信息。然而这些分析发现，contextualized representation 编码了语义信息和句法信息，但它们无法回答这些representation 有多么地 contextual ，以及它们在多大程度上可以被 static word embedding 所取代。因此，我们在本文中的工作与大多数对 contextualized representation 的剖析明显不同。它更类似于 《The strange geometry of skip-gram with negative sampling》，后者研究了 static word embedding space 的几何特性。

36.1 方法

Contextualizing Model：我们研究的 contextualizing model 是 ELMo 、BERT 和 GPT-2 。我们选择 BERT_base ，因为它在层数和维度方面与 GPT-2 最具可比性。所有模型都是在各自的语言建模任务中预训练过的。
尽管 ELMo 、BERT 和 GPT-2 分别有 2、12 和12 个隐藏层，但我们也将每个 contextualizing model的输入层作为其第 0 层。这是因为第 0 层不是 contextualized 的，使其成为比较后续层所做的 contextualization 的有用 baseline。
数据：为了分析 contextualized word representation ，我们需要 input sentence 来馈入我们的 pretrained model 。我们的输入数据来自 2012-2016 年的 SemEval 语义文本相似性任务。我们使用这些数据集是因为它们包含了一些句子，在这些句子中出现了相同的单词但是具有不同的 context 。例如，单词 "dog" 出现在 "A panda dog is running on the road." 和 "A dog is trying to get bacon off his back." 。如果一个模型在这两个句子中为"dog" 生成了相同的 representation ，我们可以推断出没有 contextualization ；反之，如果生成了不同的 representation ，我们可以推断出它们在某种程度上被 contextualized 。
相同的单词出现在不同的 context 中，那么该单词如果具有不同的 embedding，则表示没有 contextualized 。
利用这些数据集，我们将单词映射到它们出现的 sentence list 、以及它们在这些句子中出现的索引。在我们的分析中，我们不考虑那些出现在少于 5 个 unique context 中的单词。
衡量 Contextuality ：我们用三个不同的指标来衡量一个 word representation 的 contextual 程度：自相似性 self-similarity 、句内相似性 intra-sentence similarity 、最大可解释方差 maximum explainable variance 。
- 自相似性：令单词 $w$ $ w $ 出现在每个句子 ${s_{1}, \dots, s_{n}}$ $ \{s_1,\cdots,s_n\} $ 中的位置为 ${i_{1}, \dots, i_{n}}$ $ \{i_1,\cdots,i_n\} $ ，其中 $w = s_{1} [i_{1}] = \dots, = s_{n} [i_{n}]$ $ w=s_1[i_1]=\cdots,=s_n[i_n] $ 。令 $f_{l} (s, i)$ $ f_l(s,i) $ 为一个函数，它将 $s [i]$ $ s[i] $ 映射到模型 $f$ $ f $ 中第 $l$ $ l $ 层的对应的 word representation 。单词 $w$ $ w $ 在 layer $l$ $ l $ 中的自相似性定义为：
  $\begin{matrix} (26) & {SelfSim}_{l} (w) = \frac{1}{n^{2} - n} \sum_{j} \sum_{k \neq j} \cos (f_{l} (s_{j}, i_{j}), f_{l} (s_{k}, i_{k})) \end{matrix}$
  换句话说，一个单词 $w$ $ w $ 在第 $l$ $ l $ 层中的自相似性是其在 $n$ $ n $ 个 unique context 中的 contextualized representation 的平均余弦相似度。
  - 如果第 $l$ $ l $ 层完全不对 representation 进行 contextualize ，那么 ${SelfSim}_{l} (w) = 1$ $ \text{SelfSim}_l(w)=1 $ （即，representation 在所有 contexts 中都是相同的）。
  - 对于 $w$ $ w $ 来说，representation 越 contextualized ，我们期望它的自相似性就越低。
  这里有个前提条件：单词的 context 之间是均匀分布的（而不是集中在某些 context 上）。
- 句内相似性：令 $s$ $ s $ 为一个序列 $< w_{1}, \dots, w_{n} >$ $$ ，包含 $n$ $ n $ 个单词。句子 $s$ $ s $ 在 layer $l$ $ l $ 的句内相似性定义为：
  $\begin{matrix} (27) & \begin{matrix} {IntraSim}_{l} (s) = \frac{1}{n} \sum_{i} \cos ({\vec{s}}_{l}, f_{l} (s, i)) \\ {\vec{s}}_{l} = \frac{1}{n} \sum_{i} f_{l} (s, i) \end{matrix} \end{matrix}$
  换句话说，一个句子的句内相似性是其 word representation 和 sentence vector 之间的平均余弦相似度，而 sentence vector 只是这些 word vector 的平均值。
  - 如果 ${IntraSim}_{l} (s)$ $ \text{IntraSim}_l(s) $ 、以及所有的 $w \in s$ $ w\in s $ 的 ${SelfSim}_{l} (w)$ $ \text{SelfSim}_l(w) $ 都很低，那么模型通过如下的方式来 contextualize 单词：给每个单词一个 context-specific representation ，并且同一个句子中的不同 word 具有不同的 word representation 。
  - 如果 ${IntraSim}_{l} (s)$ $ \text{IntraSim}_l(s) $ 很高，但是 ${SelfSim}_{l} (w)$ $ \text{SelfSim}_l(w) $ 很低，这表明同一个句子中的不同 word 的 representation 之间区别很小，同一个句子中的单词仅仅是通过使其在向量空间中的 representation 收敛到一个很小的区域从而实现 contextualization 。
- 最大可解释方差：令 $[f_{l} (s_{1}, i_{1}), \dots, f_{l} (s_{n}, i_{n})] \in R^{d \times n}$ $ [f_l(s_1,i_1),\cdots,f_l(s_n,i_n)]\in \mathbb R^{d\times n} $ 为单词 $w$ $ w $ 在第 $l$ $ l $ 层的 occurrence matrix， $d$ $ d $ 为 representation 维度。令 $σ_{1}, \dots, σ_{m}$ $ \sigma_1,\cdots,\sigma_m $ 为该矩阵的从大到小排列的奇异值 singular value 。那么最大可解释方差被定义为：
  $\begin{matrix} (28) & {MEV}_{l} (w) = \frac{σ_{1}^{2}}{\sum_{i} σ_{i}^{2}} \end{matrix}$
  ${MEV}_{l} (w)$ $ \text{MEV}_l(w) $ 是在给定层中， $w$ $ w $ 的 contextualized representation 中可由其第一主成分解释的方差比例。它为我们提供了一个关于 word static embedding 可以在多大程度上取代 word contextualized representation 的上限。
  - ${MEV}_{l} (w)$ $ \text{MEV}_l(w) $ 越接近于 0 ，static embedding 的 replacement 就越差。
  - ${MEV}_{l} (w)$ $ \text{MEV}_l(w) $ 等于 1 ，那么 static embedding 将是 contextualized representation 的完美替代。
针对各向异性 Anisotropy做调整：在讨论 contextuality 时，考虑各向同性isotropy 是很重要的。例如：
- 如果 word vector 是完全各向同性的（即方向均匀），那么 ${SelfSim}_{l} (w) = 0.95$ $ \text{SelfSim}_l(w)=0.95 $ 将表明 $w$ $ w $ 的 representation 被很差地 contextualized 。
- 然而，考虑到这样的情况，即 word vector 是如此的各向异性，任何两个单词的平均余弦相似度为 0.99 。此时 ${SelfSim}_{l} (w) = 0.95$ $ \text{SelfSim}_l(w)=0.95 $ 将表明 $w$ $ w $ 的 representation 被很好地 contextualized 。这是因为 $w$ $ w $ 在不同 context 中的 representation 平均而言比两个随机选择的单词更加不相似。
为了调整各向异性的影响，我们使用了三个 anisotropic baseline ，每个 baseline 对应于我们的一个 contextuality 指标。
- 对于自相似性和句内相似性，baseline 来自均匀随机采样的单词在不同 context 下的 representation 的平均余弦相似度。在一个给定的层中，word representation 的各向异性越大，这个 baseline 就越接近于 1 。
  均匀随机采样了两个单词，然后计算它们之间的 embedding 的余弦相似度。采样多次并计算期望值。
- 对于最大可解释方差，baseline 是均匀随机采样的 word representation 中被其第一主成分解释的方差比例。在一个给定的层中，，word representation 的各向异性越大，这个 baseline 就越接近于 1 。即使是随机采样的单词，主成分也能解释很大一部分的方差。
  均匀随机采样了一个 word 集合，然后计算它们的 embedding 矩阵的第一主成分。采样多次并计算期望值。
由于 contextuality 指标是针对 contextualizing model的每一层计算的，所以我们也为每一层计算单独的 baseline。然后我们从每个指标值中减去其各自的 baseline ，得到 anisotropy-adjusted contexuality 指标。例如，anisotropy-adjusted 的自相似性为：
$\begin{matrix} (29) & \begin{matrix} Baseline (f_{l}) = E_{x, y \sim U (O)} [\cos (f_{l} (x), f_{l} (y))] \\ {SelfSim}_{l}^{*} (w) = {SelfSim}_{l} (w) - Baseline (f_{l}) \end{matrix} \end{matrix}$
其中： $O$ $ \mathcal O $ 表示所有的 word occurrence 的集合。
除非另有说明，本文其余部分中提到的 contextuality 指标是指 anisotropy-adjusted 的指标，其中原始指标和基线都是用 1K 个均匀随机采样的 word representation 估计的。

36.2 实验

(An)Isotropy：
- 在所有 non-input layers 中，contextualized representation 是各向异性 anisotropic 的。
  如果来自某一层的 word representation 是各向同性 isotropic 的（即各方向均匀），那么均匀随机采样的单词之间的平均余弦相似度将是 0 （《A simple but tough-to-beat baseline for sentence embeddings》）。这个平均余弦相似度越接近于 1 ，表示各向异性越大。
  各向异性的几何解释是：
  - word representation 都在向量空间中占据一个狭窄的锥体，而不是在所有方向上都是均匀的。
  - 各向异性越大，这个锥体就越窄（《The strange geometry of skip-gram with negative sampling》）。
  如下图所示，这意味着在 BERT 、ELMo 和GPT-2 的几乎所有层中，所有单词的 representation 都在向量空间中占据一个狭窄的锥体。唯一的例外是 ELMo 的 input layer ，它产生 static character-level embedding ，而不使用 contextual 信息甚至 positional 信息（《Deep contextualized word representations》）。
  然而，应该注意的是，并非所有的 static embedding 都一定是各向同性的：《The strange geometry of skip-gram with negative sampling》发现，同样是 static 的 skipgram embedding 并不是各向同性的。
- contextualized representation 通常在较高的层中更加各向异性。
  如下图所示，对于 GPT-2 ，均匀随机的单词之间的平均余弦相似度在第 2 层到第 8 层大致为 0.6 ，但从第 8 层到第 12 层呈指数级增长。事实上，GPT-2 的最后一层的 word representation 是如此的各向异性，以至于任何两个单词的平均余弦相似度几乎都是 1.0 。
  这种模式也适用于 BERT 和 ELMo ，不过也有例外：例如，BERT 的倒数第二层的各向异性比最后一层高得多。
  对于static word embedding ，各向同性有理论上的和经验上的好处。
  - 在理论上，各向同性允许在训练期间进行更强的 "self-normalization" （《A simple but tough-to-beat baseline for sentence embeddings》）。
  - 而在实践中，从 static word embedding 中减去 mean vector 会导致在几个下游 NLP 任务上的改进（《All-but-the-top: Simple and effective postprocessing for word representations》）。
  因此，在contextualized word representation 中看到的极端程度的各向异性是令人惊讶的，特别是在较高的层。如下图所示，对于所有三个模型，contextualized hidden layer representation 几乎都比 input layer representation 更加各向异性，而后者没有纳入 context 。这表明高度的各向异性是 contextualization 过程所固有的，或者至少是 contextualization 过程的副产品。
Context-Specificity：
- contextualized word representation 在更高的 layer 上更加 context-specific 。
  根据定义，在一个给定模型的给定层中，一个单词的 self-similarity 是它在不同 context中的 representations 的平均余弦相似度（根据各向异性进行调整）。
  - 如果 self-similarity 为 1 ，那么这些 representations 就完全没有 context-specific 。
  - 如果 self-similarity 为 0 ，那么这些 representations 就具有最大的 context-specific 。
  在下图中，我们绘制了 BERT、ELMo和GPT-2 各层中均匀随机采样的单词的平均 self-similarity 。例如，ELMo 的 input layer 的自相似度是1.0，因为该层的 representation 是 static character-level embedding 。
  在所有三个模型中，layer 越高，平均 self-similarity 越低。换句话说，layer 越高，contextualized representation 更加 context-specific 。这一发现具有直观的意义。在图像分类模型中，lower layer 识别更多的通用特征（如，边缘），而upper layer 识别更加 class-specific 的特征（《How transferable are features in deep neural networks?》）。同样，在 NLP 任务上训练的 LSTM 的 upper layer 会学习更加 task-specific 的representation （《Linguistic knowledge and transferability of contextual representations》）。因此，由此可见，神经语言模型的 upper layer 会学习更加 context-specific 的 representation ，从而更准确地在给定 context 的条件下预测 next word 。
  在所有三个模型中，GPT-2 的 representation 是最context-specific 的，GPT-2 最后一层的 representation 几乎是最大化地 context-specific 的。
- 停用词 stopword （如，"the", "of", "to"）具有最 context-specific 的representation。
  在所有的层中，stopwords 的self-similarity 是所有单词中最低的，这意味着它们的 contextualized representation 是最 context-specific 的。例如，在 ELMo 的各层中，平均 self-similarity 最低的单词是 "and" 、"of"、"’s"、"the"、"to"。鉴于这些词不是多义词，这相对来说是令人惊讶的。这一发现表明：一个单词所出现的各种 context 的 variety ，而不是其固有的多义性，是推动其 contextualized representation 的 variation 的原因。这回答了我们在 introduction 章节中提出的一个问题：ELMo 、BERT 和 GPT-2 并不是简单地将有限数量的 word-sense representation 中的一个分配给每个单词；否则，在具有如此少的 word sense 的单词的 representation 中就不会有如此多的变化。
- context-specificity 在 ELMo 、BERT 和 GPT-2 中的表现非常不同。
  如前所述，在 ELMo 、BERT 和 GPT-2 的 upper layers ，contextualized representation 更加 context-specific 。然而，这种 increased context-specificity 在向量空间中是如何体现的？同一句子中的 word representation 是否会收敛到一个点上，或者它们在与其他context 中的 representation 不同的同时，仍然保持着彼此之间的不同？为了回答这个问题，我们可以度量一个句子的句内相似度。
  从定义可以看出，在给定模型的给定层中，一个句子的句内相似性是其每个 word representation 与它们的均值之间的平均余弦相似度，并根据各向异性进行调整。如下图所示，我们绘制了 500 个均匀随机采样的句子的平均句内相似度。
  - 在 ELMo 中，同一句子中的单词在 upper layers 的相似度更高：随着句子中的 word representation 在 upper layers 变得更加 context-specific ，句内相似度也在上升。这表明，在实践中，ELMo 最终将 《A synopsis of linguistic theory》 的分布假说背后的直觉延伸到了 sentence level ：因为同一句子中的单词共享相同的 context ，它们的 contextualized representation 也应该是相似的。
  - 在 BERT 中，同一句子中的单词在 upper layers 中彼此更加不相似：随着句子中的单词的representation 在 upper layers 中变得更加 context-specific ，它们彼此渐行渐远，尽管也有例外（见下图中第 12 层）。然而，在所有层中，同一句子中的单词的平均相似性仍然大于随机选择的单词的平均相似性（即 anisotropy baseline ）。这表明，与 ELMo 相比，BERT 有一个更细微的 contextualization ，它认识到，虽然周围的句子告知了一个单词的含义，但同一句子中的两个单词不一定有相似的含义。
  - 在 GPT-2 中，同一句子中的 word representation 并不比随机采样的单词更相似：平均而言，未调整的句内相似度与 anisotropy baseline 大致相同，因此从下图中可以看出，在 GPT-2 的大多数层中，anisotropy-adjusted 的句内相似度接近于 0 。事实上，句内相似度在 input layer 中是最高的，该层完全不对单词进行 contextualize 。这与 ELMo 和BERT 形成了鲜明的对比，ELMo 和 BERT 在除了一个层之外的所有其他层的平均句内相似度都高于 0.20 。
  正如前面讨论 BERT 时指出的，这种行为仍然具有直观的意义：同一个句子中的两个单词不一定有相似的含义，即使它们共享相同的 context 。GPT-2 的成功表明，高的句内相似性并不是 contextualization 所固有的。同一句子中的不同单词可以有高度contextualized 的 representation ，而这些 representation 并不比两个随机的 word representation 更相似。然而，目前还不清楚这些句内相似性的差异是否可以追溯到模型结构的差异，我们把这个问题留给未来的工作。
  高度的各向异性是 contextualization 过程所固有的。
Static vs. Contextualized：
- 平均来说，一个单词的 contextualized representation 中只有不到 5% 的方差可以被 static embedding 所解释。
  从定义可以看出，对于给定模型的给定层，一个单词的最大可解释方差（maximum explainable variance: MEV ）是其 contextualized representation 中可由其第一主成分解释的方差的比例。这为我们提供了一个关于 static embedding 能多好地取代一个单词的 contextualized representation 的上限。因为 contextualized representation 是各向异性的，所有单词的大部分变化都可以由单个向量来解释。我们对原始 MEV 针对各向异性进行调整：计算均匀随机采样的 word representation 的第一主成分所解释的方差比例，然后从原始 MEV 中减去这一比例。在下图中，我们绘制了均匀随机采样的单词上的平均 anisotropy-adjusted MEV 。
  随机采样一个单词，然后计算该单词在指定层上的 MEV 。然后重复这一过程多次，得到平均 MEV 。
  平均而言，在 ELMo、BERT 或 GPT-2 中，没有任何一层可以通过 static embedding 来解释超过 5% 的 word contextualized representation 方差。虽然在下图不可见，但许多单词的原始 MEV 实际上低于 anisotropy baseline ：也就是说，与单个单词的所有 representation 的方差相比，所有单词的 representation 的方差有更大的比例可以由单个向量来解释。注意，5% 的阈值代表了最好的情况，而且在理论上不能保证 word vector 会与最大化 MEV 的 static embedding 相似。
  单个单词的所有 representation 的方差：一个单词在指定层的 MEV；所有单词的 representation 的方差：所有单词在指定层的 MEV 。
  这表明，contextualizing model 并不是简单地将有限数量的 word-sense representation 中的一个分配给每个单词，否则，被解释的方差的比例会高得多。 ELMo 和 BERT 的所有层的平均原始 MEV 甚至低于 5% 。只有 GPT-2 的原始 MEV 相对较大，由于极高的各向异性，第 2 至 11 层的平均 MEV 约为 30% 。
- lower layers 的 contextualized representation 的主成分在许多 benchmark 上优于 GloVe 和 FastText 。
  如前所述，我们可以通过在给定层中抽取其 contextualized representation 的第一个主成分（principal component: PC ）来为每个单词创建 static embedding 。在下表中，我们绘制了这些 PC static embedding 在几个 benchmark 任务中的表现。这些任务包括语义相似性semantic similarity 、analogy solving 、concept categorization ：Sim-Lex999 、MEN 、WS353、RW、SemEval-2012、Google analogy solving 、MSR analogy solving 、BLESS 、AP 。我们在下表中不考虑第 3 ~ 10 层，因为它们的性能介于第 2 层和第 11 层之间。
  - 表现最好的 PC static embedding 属于 BERT 的第一层，尽管 BERT 和 ELMo 的其他层的 embedding 在大多数 benchmark 上也优于 GloVe 和 FastText 。
  - 对于所有三个 contextualizing model ，从 lower layer 创建的 PC static embedding 比从 upper layer 创建的更有效。
  - 使用 GPT-2 创建的 static embedding 也明显比来自 ELMo 和 BERT 的 static embedding 表现更差。
    GPT-2 创建的 static embedding 甚至要比 GloVe 和 FastText 更差。
  鉴于 upper layer 比 lower layer 更加 context-specific ，而且 GPT-2 的 representation 比 ELMo 和 BERT 的更加 context-specific （见 Figure 2 ），这表明高度 context-specific 的 representation 的 PC static embedding 在传统 benchmark 上不那么有效。那些 less context-specific representation （如，来自 BERT 第一层的 representation ）派生的 PC static embedding ，则要有效得多。
未来工作：
- 首先，正如本文前面所指出的，《All-but-the-top: Simple and effective postprocessing for word representations》发现，使 static embedding 更加各向同性（通过从每个 embedding 中减去其均值），导致下游任务的性能有惊人的改善。鉴于各向同性对 static embedding 有好处，它也可能对 contextualized word representation 有好处，尽管后者在高度各向异性的情况下已经取得了明显的改善。因此，在 language modelling objective 中加入各向异性的惩罚可能会产生更好的结果，如鼓励 contextualized representation 更加各向同性。
- 另一个方向是从 contextualized word representation 中生成 static word representation 。虽然后者提供了卓越的性能，但在生产中部署像 BERT 这样的大型模型，在内存和运行时间方面往往存在挑战。相比之下， static word representation 更容易部署。我们的工作表明，不仅有可能从 contextualizing model 中提取 static representation ，而且与传统的 static embedding （如 GloVe 和 FastText ）相比，这些 extracted vector 往往在各种任务中表现得更好。
  在推断期间：static embedding 执行的是 lookup operation ，因此速度更快；而 contextualized embedding 执行的是前向传播，因此速度很慢。