在 Solr 中实现术语关联挖掘的最简单方法是什么?
关联挖掘似乎对于在文本语料库中检索相关术语给出了良好的结果。关于这个主题有很多著作,包括著名的 LSA 方法。挖掘关联的最直接方法是构建文档 X 术语的共现矩阵并查找同一文档中最常出现的术语。在我之前的项目中,我通过 TermDocs 迭代直接在 Lucene 中实现它(我通过调用 IndexReader.termDocs(Term))。但我在 Solr 中看不到类似的东西。
因此,我的需求是:
- 检索特定字段内最相关的术语。
- 要检索特定字段中最接近指定术语的术语。
我将按照以下方式对答案进行评分:
- 理想情况下,我希望找到能够直接满足特定需求的 Solr 组件,即直接获取关联术语的组件。
- 如果这是不可能的,我正在寻找获取指定字段的共现矩阵信息的方法。
- 如果这也不是一个选项,我想知道最直接的方法 1) 获取所有术语 2) 获取这些术语出现的文档的 ID(数字)。
Association mining seems to give good results for retrieving related terms in text corpora. There are several works on this topic including well-known LSA method. The most straightforward way to mine associations is to build co-occurrence matrix of docs X terms
and find terms that occur in the same documents most often. In my previous projects I implemented it directly in Lucene by iteration over TermDocs (I got it by calling IndexReader.termDocs(Term)). But I can't see anything similar in Solr.
So, my needs are:
- To retrieve the most associated terms within particular field.
- To retrieve the term, that is closest to the specified one within particular field.
I will rate answers in the following way:
- Ideally I would like to find Solr's component that directly covers specified needs, that is, something to get associated terms directly.
- If this is not possible, I'm seeking for the way to get co-occurrence matrix information for specified field.
- If this is not an option too, I would like to know the most straightforward way to 1) get all terms and 2) get ids (numbers) of documents these terms occur in.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以将 Lucene(或 Solr)索引导出到 Mahout,然后使用潜在狄利克雷分配。如果 LDA 与 LSA 不够接近,无法满足您的需求,您可以从 Mahout 中获取相关矩阵,然后使用 Mahout 进行奇异值分解。
我不知道 Solr 有任何 LSA 组件。
You can export a Lucene (or Solr) index to Mahout, and then use Latent Dirichlet Allocation. If LDA is not close enough to LSA for your needs, you can just take the correlation matrix from Mahout, and then use Mahout to take the singular value decomposition.
I don't know of any LSA components for Solr.
由于我的问题仍然没有答案,我必须写下自己的想法并接受它。尽管如此,如果有人提出更好的解决方案,我会很乐意接受它而不是我的。
我将选择共现矩阵,因为它是关联挖掘的最重要部分。一般来说,Solr 提供了以某种方式构建此矩阵所需的所有函数,尽管它们不如直接使用 Lucene 访问那么高效。为了构建矩阵,我们需要:
使用标准 Solr 组件可以轻松完成这两项任务。
检索术语 TermsComponent 或 分面搜索。我们只能获取顶级术语(默认情况下)或所有术语(通过设置要采用的最大术语数,有关详细信息,请参阅特定功能的文档)。
获取包含相关术语的文档只需搜索该术语即可。这里的弱点是我们每个术语需要 1 个请求,并且可能有数千个术语。另一个弱点是简单搜索和分面搜索都不提供有关找到的文档中当前术语出现次数的信息。
有了这个,构建共现矩阵就很容易了。要挖掘关联,可以使用其他软件,例如 Weka 或编写自己的实现比如说,Apriori 算法。
Since there are still no answers to my questions, I have to write my own thoughts and accept it. Nevertheless, if someone propose better solution, I'll happily accept it instead of mine.
I'll go with co-occurrence matrix, since it is the most principal part of association mining. In general, Solr provides all needed functions for building this matrix in some way, though they are not as efficient as direct access with Lucene. To construct matrix we need:
Both these tasks may be easily done with standard Solr components.
To retrieve terms TermsComponent or faceted search may be used. We can get only top terms (by default) or all terms (by setting max number of terms to take, see documentation of particular feature for details).
Getting documents with the term in question is simply search for this term. The weak point here is that we need 1 request per term, and there may be thousands of terms. Another weak point is that neither simple, nor faceted search do not provide information about the count of occurrences of the current term in found document.
Having this, it is easy to build co-occurrence matrix. To mine association it is possible to use other software like Weka or write own implementation of, say, Apriori algorithm.
您可以在以下查询中获取找到的文档中当前术语的出现次数:
You can get the count of occurrences of the current term in found document in the following query: