检测和比较短语的算法
我有一些非英语文本。我想对它们进行风格比较。
比较风格的一种方法是寻找相似的短语。如果我在一本书中多次发现“钓鱼、滑雪和徒步旅行”,而在另一本书“钓鱼、徒步旅行和滑雪”中,风格上的相似之处就表明是同一位作者。不过,我还需要能够找到“钓鱼,甚至滑雪或徒步旅行”。理想情况下,我还会找到“钓鱼、徒步旅行和滑雪”,但因为它们是非英语文本(通用希腊语),所以很难允许同义词,这方面并不重要。
(1) 检测此类短语,然后 (2) 以在其他文本中不过分严格的方式搜索它们(以便找到“钓鱼,甚至滑雪或徒步旅行”)的最佳方法是什么?
I have a couple of non-English texts. I would like to perform stylistic comparisons on them.
One method of comparing style is to look for similar phrases. If I find in one book "fishing, skiing and hiking" a couple of times and in another book "fishing, hiking and skiing" the similarity in style points to one author. I need to also be able to find "fishing and even skiing or hiking" though. Ideally I would also find "angling, hiking and skiing" but because they are non-English texts (Koine Greek), synonyms are harder to allow for and this aspect is not vital.
What is the best way to (1) go about detecting these sorts of phrases and then (2) searching for them in a way that is not overly rigid in other texts (so as to find "fishing and even skiing or hiking")?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
技术细节:
对于词汇,您有几种获得良好词汇的可能性。不幸的是,我不记得名字了。其中之一是删除经常出现且无处不在的单词。相反,您应该保留少数文本中出现的生僻单词。然而,保留一篇文本中精确出现的单词是没有用的。
对于邻接矩阵,邻接性是通过计算您正在考虑的单词的距离(计算分隔它们的单词数)来测量的。例如,让我们使用您的文本=)
这些完全是虚构的值:
A(方法,比较) += 1.0
A(方法、相似度) += 0.5
A(method, Greek) += 0.0
你主要需要一个“典型距离”。例如,您可以说,在 20 个分隔词之后,这些词就不能再被认为是相邻的。
经过一点归一化之后,只需在两个文本的邻接矩阵之间做一个 L2 距离,看看它们有多接近。之后您可以做更奇特的事情,但这应该会产生可接受的结果。现在,如果您有同义词,您可以以一种很好的方式更新邻接关系。例如,如果您输入“美丽的少女”,那么
A(美丽、少女) += 1.0
A(华丽、少女) += 0.9
A(公平,少女) += 0.8
A(崇高,少女)+= 0.8
...
Technical details :
For the vocabulary, you have several possibilities to get a good vocabulary. Unfortunately, I can't remember the names. One of them consists of deleting words that are present often and everywhere. On the contrary, you should keep rare words that are present in few texts. However, there is no use in conserving words present exactly in one text.
For the adjacency matrix, the adjacency is measured is done by counting how far the words you are considering are (couting the number of words separating them). For example, let's use your very text =)
These are entirely made up values :
A(method, comparing) += 1.0
A(method, similarity) += 0.5
A(method, Greek) += 0.0
You mainly need a "typical distance". You can say for example that after 20 separation-words, then the words can't be considered adjacent anymore.
After a bit of normalization, just make a L2 distance between the adjacency matrix of two texts to see how close they are. You can do fancier stuff afterwards, but this should yield acceptable results. Now, if you have synonyms, you can update the adjacency in a nice way. For example, if you have in input "beautiful maiden", then
A(beautiful, maiden) += 1.0
A(magnificent, maiden) += 0.9
A(fair, maiden) += 0.8
A(sublime, maiden) += 0.8
...
您可能应该使用一些字符串相似性度量,例如 Jaccard、骰子 或 余弦相似度。您可以在单词、(单词或字符级)n-gram 或引理上尝试这些方法。 (对于诸如 Koinè Greek 之类高度变形的语言,如果您有良好的词形还原器,我建议您使用词元。)
除非您有像 WordNet 这样将同义词映射在一起的东西,否则捕获同义词很困难。
You should probably use some string similarity measure such as Jaccard, Dice or cosine similarity. You could try these either on words, on (word or character-level) n-grams or on lemmas. (For a highly inflected language such as Koinè Greek, I would suggest using lemmas if you have a good lemmatizer for it.)
Catching synonyms is hard unless you have something like WordNet, which maps synonyms together.
我会遵循两个准则:
钓鱼
与钓鱼
非常接近。作为一个自学人工智能,我会使用(至少在开始时)神经网络。有一个简单且完全有效的示例(在Python中)可以找到这里,其目标正是“数据挖掘”。当然,您可能希望用其他语言来实现。
关于你的两个具体问题:
您问题的其他答案已经详细介绍了这一点(他们的作者似乎比我在这个主题上了解更多!),但再说一次:我会从简单开始只需使用一个神经网络来告诉您两个术语的接近程度。然后我将继续进行“波浪式”优化(例如,如果它是英文文本,则仅使用单词的词根,或者根据文本的其他元数据(例如年份)调整分数可能会有一些用处) ,或作者,或地理起源,或完全改变匹配算法......)直到您对结果感到满意。
我想说这相当于要求人工智能返回所有短语,其“接近度分数”超过给定阈值
!
I would follow two guidelines:
angling
is a very close match forfishing
.As a self-learning AI, I would use (at least for a start) a neural network. There is an easy and fully working example (in python) that can be found here and that targets precisely "data mining". You might wish to implement in some other language, of course.
About your two specific questions:
Other answers to your question have gone in details about this (and their authors seem to know way more than I do on the subject!), but again: I would start easy and simply use a neural network that tells you how close two terms are. Then I would proceed with "waves" of optimisation (for example - if it was an English text - using only the root of the word, or maybe it is of some use to tweak the score according to some other metadata of the text like year, or author, or geographical origin, or yet changing the matching algorithm altogether...) until you are satisfied with the outcome.
I would say this is equivalent to ask the AI to return all phrases whose "proximity score" is over a given threshold.
HTH!