用于词性标注器的 Baum-Welch 算法
每个人。 我正在使用 Baum-Welch 算法来训练词性标注器,它完全是无监督的方式。 问题来了: 当我得到标签结果时,我只得到一个数字序列。 我不知道哪个标签代表 VV、NN、DT。 我该如何解决这个问题?
everyone.
I'm using the Baum-Welch algorithm to train a pos tagger,it is totally in the unsupervised way.
Here comes the problem:
When i get the label result, I only get a sequence of numbers.
I can't figure out which label stands for VV,NN,DT.
How can I solve this problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一般来说,没有办法做到这一点。鲍姆-韦尔奇将找到具有相似分布的单词使用类别,但没有特别的理由假设这些类别将以任何直接的方式映射到任何特定语言理论所提出的类别。因此,无监督词性标注器主要适用于您关心单词或短语的等价类但不关心分配的特定标签的应用程序。
不过,如果您确实需要人类可读的标签(例如,在开发过程中,评估您得到的结果是否可信),我会手动标记几十个句子。然后,您可以将 BW 派生的标记器应用到已标记的迷你语料库,以诱导类编号和 POS 标签之间的映射。
In general, there's no way to do that. Baum-Welch will find classes of word uses that have similar distributions, but there's no particular reason to suppose that those classes will map in any straightforward way to categories posited by any specific linguistic theory. Therefore, unsupervised POS taggers are mainly useful for applications where you care about equivalence classes of words or phrases but not about the specific tags being assigned.
If you really need human-readable labels, though (e.g., during development, to evaluate whether the results you're getting are even remotely plausible), I'd hand-tag a few dozen sentences. Then you could apply your B-W-derived tagger to that labeled mini-corpus to induce a mapping between class numbers and POS labels.