如何为针对命名实体识别的分类器形成特征向量?
我有一组标签(不同于传统的名称、地点、对象等)。就我而言,它们是特定于领域的,我将它们称为:实体、操作、事件。我想使用它们作为提取更多命名实体的种子。
我偶然发现了这篇论文:Isozaki 等人的“用于命名实体识别的高效支持向量分类器”。虽然我喜欢使用支持向量机进行命名实体识别的想法,但我一直困惑于如何对特征向量进行编码。对于他们的论文,他们是这样说的:
例如,“乔治·赫伯特·布什总统说克林顿 是 。 。 。 ” 的分类如下:“总统”= 其他,“乔治”= 人物开头,“赫伯特”=人物中间,“布什”=人物结尾,“说”= 其他,“克林顿”= 单身人士,“是” = 其他。这样,一个人名字的第一个单词就被标记为 PERSON-BEGIN。最后一个词被标记为 PERSON-END。其他的话在 名字是PERSON-MIDDLE。如果一个人的名字是用 单个单词,它被标记为 PERSON-SINGLE。如果一个词没有 属于任何命名实体,它被标记为 OTHER。由于 IREX 脱 细化了 8 个 NE 类,单词被分为 33 个类别。
每个样本由 15 个特征表示,因为每个单词有 3 个特征 特征(词性标签、字符类型和单词本身), 前面的两个单词和后面的两个单词也用于 上下文依赖。尽管不常见的功能通常会被删除 为了防止过度拟合,我们使用所有特征,因为 SVM 很强大。 每个样本都由一个长的二进制向量表示,即一个序列 0(假)和 1(真)。例如上例中的“布什” 由下面描述的向量x=x[1]...x[D]表示。仅有的 15 个元素为 1。
x[1] = 0 // Current word is not ‘Alice’
x[2] = 1 // Current word is ‘Bush’
x[3] = 0 // Current word is not ‘Charlie’
x[15029] = 1 // Current POS is a proper noun
x[15030] = 0 // Current POS is not a verb
x[39181] = 0 // Previous word is not ‘Henry’
x[39182] = 1 // Previous word is ‘Herbert
我不太明白这里的二元向量是如何构造的。我知道我错过了一个微妙的点,但有人可以帮助我理解这一点吗?
I have a set of tags (different from the conventional Name, Place, Object etc.). In my case, they are domain-specific and I call them: Entity, Action, Incident. I want to use these as a seed for extracting more named-entities.
I came across this paper: "Efficient Support Vector Classifiers for Named Entity Recognition" by Isozaki et al. While I like the idea of using Support Vector Machines for doing named-entity recognition, I am stuck on how to encode the feature vector. For their paper, this is what they say:
For instance, the words in “President George Herbert Bush said Clinton
is . . . ” are classified as follows: “President” = OTHER, “George” =
PERSON-BEGIN, “Herbert” = PERSON-MIDDLE, “Bush” = PERSON-END, “said” =
OTHER, “Clinton” = PERSON-SINGLE, “is”
= OTHER. In this way, the first word of a person’s name is labeled as PERSON-BEGIN. The last word is labeled as PERSON-END. Other words in
the name are PERSON-MIDDLE. If a person’s name is expressed by a
single word, it is labeled as PERSON-SINGLE. If a word does not
belong to any named entities, it is labeled as OTHER. Since IREX de-
fines eight NE classes, words are classified into 33 categories.Each sample is represented by 15 features because each word has three
features (part-of-speech tag, character type, and the word itself),
and two preceding words and two succeeding words are also used for
context dependence. Although infrequent features are usually removed
to prevent overfitting, we use all features because SVMs are robust.
Each sample is represented by a long binary vector, i.e., a sequence
of 0 (false) and 1 (true). For instance, “Bush” in the above example
is represented by a vector x = x[1] ... x[D] described below. Only
15 elements are 1.
x[1] = 0 // Current word is not ‘Alice’
x[2] = 1 // Current word is ‘Bush’
x[3] = 0 // Current word is not ‘Charlie’
x[15029] = 1 // Current POS is a proper noun
x[15030] = 0 // Current POS is not a verb
x[39181] = 0 // Previous word is not ‘Henry’
x[39182] = 1 // Previous word is ‘Herbert
I don't really understand how the binary vector here is being constructed. I know I am missing a subtle point but can someone help me understand this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
他们省略了一袋词库构建步骤。
基本上,您已经从训练集中的(非罕见)单词到索引构建了一个映射。假设您的训练集中有 20k 个独特的单词。您将获得从训练集中的每个单词到 [0, 20000] 的映射。
然后,特征向量基本上是几个非常稀疏的向量的串联,其中 1 对应于特定单词,19,999 个 0,然后 1 对应于特定 POS,另外 50 个 0 对应于非活动 POS。这通常称为单热编码。 http://en.wikipedia.org/wiki/One-hot
所以你的特征向量大小约为 100k,还有一点额外的 POS 和 char 标签,几乎完全是 0,除了根据您的特征到索引映射选择的位置中的 15 个 1 之外。
There is a bag of words lexicon building step that they omit.
Basically you have build a map from (non-rare) words in the training set to indicies. Let's say you have 20k unique words in your training set. You'll have mapping from every word in the training set to [0, 20000].
Then the feature vector is basically a concatenation of a few very sparse vectors that have a 1 corresponding to a particular word, and 19,999 0s, and then 1 for a particular POS, and 50 other 0s for non-active POS. This is generally called a one hot encoding. http://en.wikipedia.org/wiki/One-hot
So your feature vector is about size 100k with a little extra for POS and char tags, and is almost entirely 0s, except for 15 1s in positions picked according to your feature to index mappings.