使用句子级别相似性进行释义识别
我是 NLP(自然语言处理)的新人。作为一个启动项目,我正在开发一个释义识别器(一个可以识别两个相似句子的系统)。对于该识别器,我将在三个级别应用各种措施,即:词汇、句法和语义。在词汇层面,有多种相似性度量,如余弦相似性、匹配系数、杰卡德系数等。对于这些度量,我使用谢菲尔德大学开发的 simMetrics 包,其中包含许多相似性度量。但对于 Levenshtein 距离和 Jaro-Winkler 距离度量,代码仅位于字符级别,而我需要句子级别的代码(即将单个单词视为一个单位而不是逐个字符) 。此外,SimMetrics
中没有用于计算曼哈顿距离的代码。对于如何在上述措施的句子级别开发所需的代码(或有人向我提供代码),有什么建议吗?
预先非常感谢您花时间和精力帮助我。
I'm a new entrant to NLP (Natural Language Processing). As a start up project, I'm developing a paraphrase recognizer (a system which can recognize two similar sentences). For that recognizer I'm going to apply various measures at three levels, namely: lexical, syntactic and semantic. At the lexical level, there are multiple similarity measures like cosine similarity, matching coefficient, Jaccard coefficient, et cetera. For these measures I'm using the simMetrics
package developed by the University of Sheffield which contains a lot of similarity measures. But for the Levenshtein distance and Jaro-Winkler distance measures, the code is only at character level, whereas I require code at the sentence level (i.e. considering a single word as a unit instead of character-wise). Additionally, there is no code for computing the Manhattan distance in SimMetrics
. Are there any suggestions for how I could develop the required code (or someone provide me the code) at the sentence level for the above mentioned measures?
Thanks a lot in advance for your time and effort helping me.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我已经在 NLP 领域工作了几年,我完全同意那些提供答案/评论的人。这确实是一个很难破解的难题!但是,我仍然提供一些提示:
(1)词汇相似性:如果你开发一个字符级或单词级语言模型,而不是试图将 Jaro-Winkler 距离推广到句子级,它可能会更有成效,并计算对数似然。让我进一步解释一下:基于语料库训练你的语言模型。然后取出大量已被注释为与语料库中的句子相似/不相似的候选句子。计算每个测试句子的对数似然,并建立一个截止值来确定相似性。
(2) 句法相似性:到目前为止,只有文体相似性才能捕捉到这一点。为此,您需要使用 PCFG 解析树(或 TAG 解析树。TAG = 树邻接语法,CFG 的概括)。
(3)语义相似性:我一时之间只能想到使用Wordnet等资源,识别同义词集之间的相似性。但这也不简单。您的第一个问题是确定两个(或更多)句子中的哪些单词是“对应单词”,然后再继续检查它们的语义。
I have been working in the area of NLP for a few years now, and I completely agree with those who have provided answers/comments. This really is a hard nut to crack! But, let me still provide a few pointers:
(1) Lexical similarity: Instead of trying to generalize Jaro-Winkler distance to sentence-level, it is probably much more fruitful if you develop a character-level or word-level language model, and compute the log-likelihood. Let me explain further: train your language model based on a corpus. Then take a whole lot of candidate sentences that have been annotated as similar/dissimilar to the sentences in the corpus. Compute the log-likelihood for each of these test sentences, and establish a cut-off value to determine similarity.
(2) Syntactic similarity: So far, only stylometric similarities can manage to capture this. For this, you will need to use PCFG parse trees (or TAG parse trees. TAG = tree adjoining grammar, a generalization of CFGs).
(3) Semantic similarity: off the top of my head, I can only think of using resources such as Wordnet, and identifying the similarity between synsets. But this is not simple either. Your first problem will be to identify which words from the two (or more) sentences are "corresponding words", before you can proceed to check their semantics.
正如克里斯所建议的,这对于初学者来说是一个不平凡的项目。我建议你开始做一些更简单的事情(如果相对无聊的话),比如分块。
查看 Python NLTK 库的文档和书籍 - 其中一些示例与您正在寻找的内容很接近。例如,遏制:一个陈述包含另一个陈述是否合理。请注意那里的“似是而非”,现有技术还不足以做出简单的是/否甚至概率。
As Chris suggests, this is a non-trivial project for a beginner. I would suggest you start of something simpler (if relatively boring) such as chunking.
Have a look at the docs and books for the Python NLTK library - there are some samples that are close to what you are looking for. For example, containment: is it plausible that one statement contains another. note the 'plausible' there, the state of the art isn't good enough for a simple yes/no or even a probability.