如何从文本语料库中提取语义相关性
目标是评估大型文本语料库中术语之间的语义相关性,例如“警察”和“犯罪”应该比“警察”和“山地”具有更强的语义相关性,因为它们倾向于共同使用。发生在同一上下文中。
我读过的最简单的方法包括提取 IF-IDF 信息来自语料库。
很多人使用潜在语义分析来查找语义相关性。
我遇到过 Lucene 搜索引擎: http://lucene.apache.org/
你认为它可以吗?适合提取IF-IDF吗?
在技术和软件工具方面(首选 Java),您会推荐什么来完成我正在尝试做的事情?
提前致谢!
穆隆
The goal is to assess semantic relatedness between terms in a large text corpus, e.g. 'police' and 'crime' should have a stronger semantic relatedness than 'police' and 'mountain' as they tend to co-occur in the same context.
The simplest approach I've read about consists of extracting IF-IDF information from the corpus.
A lot of people use Latent Semantic Analysis to find semantic correlations.
I've come across the Lucene search engine: http://lucene.apache.org/
Do you think it is suitable to extract IF-IDF?
What would you recommend to do what I'm trying to do, both in terms of technique and software tools (with a preference for Java)?
Thanks in advance!
Mulone
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
是的,Lucene 获取 TF-IDF 数据。 Carrot^2 算法是基于 Lucene 构建的语义提取程序的示例。我提到这一点是因为,作为第一步,他们创建了一个相关矩阵。当然,您可能可以轻松地自己构建这个矩阵。
如果您处理大量数据,您可能需要使用 Mahout 来处理较难的线性代数部分。
Yes, Lucene gets TF-IDF data. The Carrot^2 algorithm is an example of a semantic extraction program built on Lucene. I mention it since, as a first step, they create a correlation matrix. Of course, you probably can build this matrix yourself easily.
If you deal with a ton of data, you may want to use Mahout for the harder linear algebra parts.
如果你有lucene索引的话就很容易了。例如,要获得相关性,您可以使用简单的公式 count(term1 和 term2)/ count(term1)* count(term2)。其中 count 是搜索结果中的点击次数。此外,您可以轻松计算其他语义指标,例如 chi^2、信息增益。您所需要的只是获取公式并将其从
Query
转换为count
项It is very easy if you have lucene index. For example to get correllation you can use simple formula count(term1 and term2)/ count(term1)* count(term2). Where count is hits from you search results. Moreover you can easility calculate other semntica metrics such as chi^2, info gain. All you need is to get formula and convert it to terms of
count
fromQuery