英语文本词典比较
让我们想象一下,我们可以建立一个统计表,统计每个单词在某些英文文本或书籍中的使用量。我们可以收集图书馆中每本文本/书籍的统计数据。 比较这些统计数据的最简单方法是什么?我们如何找到具有统计上非常相似的词典的文本组/簇?
Let's imagine, we can build a statistics table, how much each word is used in some English text or book. We can gather statistics for each text/book in library.
What is the simplest way to compare these statistics with each other? How can we find group/cluster of texts with very statistically similar lexicon?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
首先,您需要规范化词典(即确保两个词典具有相同词汇)。
然后,您可以使用相似性度量,例如 Hellenger 距离 或 余弦相似度 比较两个词典。
研究一下机器学习包,例如 Weka 也可能是个好主意。
这本书是机器学习的绝佳资源,您可以觉得有用。
First, you'd need to normalize the lexicon (i.e ensure that both lexicons have the same vocabulary).
Then you could use a similarity metric like the Hellenger distance or the cosine similarity to compare the two lexicons.
It may also be a good idea to look into machine learning packages such as Weka.
This book is an excellent source for machine learning and you may find it useful.
我首先会看看 Lucene (http://lucene.apache.org/java/docs/index.html) 必须提供什么。之后,您将需要使用机器学习方法并查看 http://en.wikipedia.org/wiki /信息检索。
I would start by seeing what Lucene (http://lucene.apache.org/java/docs/index.html ) had to offer. After that you will need to use a machine learning method and look at http://en.wikipedia.org/wiki/Information_retrieval.
您可以考虑 Kullback Leibler 距离。有关参考,请参阅 Cover 和 Thomas 的第 18 页:
第二章,封面和托马斯
You might consider Kullback Leibler distance. For reference, see page 18 of Cover and Thomas:
Chapter 2, Cover and Thomas