对非常短的文档进行长查询

发布于 2025-01-03 06:29:25 字数 486 浏览 3 评论 0原文

就 Lucene/Solr 而言,我刚从幼儿园出来,所以我可能试图完全错误地使用它,但我希望有人能指出我正确的方向。

我的文档(少于 3,000 份)是分类法的简短陈述。全部都是单句,有些不超过 4-6 个单词。每个文档只有一个字段,因此跨多个字段进行搜索并不是我想要研究的路线。我想做的是查询工作相关文档的内容并返回相关的分类声明。

目前,我正在使用 Solr 附带的默认示例设置,并添加了来自 Wordnet 的动词同义词,因为执行的操作是我试图识别的内容(即“按照规格更改服装”的分类声明)。

基本的单词匹配按预期工作,但我想让事情变得更复杂一些。由于查询太长,在搜索小文档时,我永远不会获得高相关性分数。我确信这个问题可以通过以某种方式标准化分数来解决,所以我并不真正关心分数的出现,而是关心正在识别的实际陈述(文档)。

我会更好地动态索引文档(当前是长查询)并查询每个分类语句并对结果进行编译/排序,或者我可以以其他方式有效地对小文档执行这些长查询吗?我认为这可能会带来它自己的困难。

I'm fresh out of the nursery as far as Lucene/Solr are concerned, so I may be trying to utilize it completely wrong, but I hope someone can point me in the right direction.

My documents (less than 3,000) are short statements from a taxonomy. All are single sentences, with some having no more than 4-6 words long. There is only one field for each document, so searching across multiple fields is not a route I would be looking into. What I would like to do is query the contents of a work related document and have the taxonomy statements that are relevant returned.

Currently I am using the default example setup that came with Solr with added verb synonyms from Wordnet since performed actions are what I am trying to identify (i.e. taxonomy statement of 'Alter garments to specifications').

Basic word matching works as expected, but I would like to make things somewhat more sophisticated. Since the queries are so long I never end up with a high relevancy scores when searching against the tiny documents. I'm sure this can be resolved by normalizing scores in some fashion so I am not real concerned about the scores coming out, but the actual statements (documents) that are being identified.

Would I be better off indexing the documents (currently the long queries) on the fly and querying each taxonomy statement and compiling/sorting the results or can I perform these long queries on the tiny documents effectively in some other fashion? I presume this may present it's own difficulties.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

a√萤火虫的光℡ 2025-01-10 06:29:25

我看不出你在这里想做什么,我的意思是你的短文档索引肯定会受到信息湖的影响,而长查询将使每个结果几乎平坦地放在它前面,甚至通过添加每个术语来扩展文档我认为 Wordnet 同义词会令人困惑和误导,我的建议是检查其他可能的查询形式。

I see no end to what are you trying to do here, i mean your short documents index will definitely suffer from lake of information, and a long query will make every result almost flat in front of it, even expanding the document by adding every term with Wordnet synonyms will be confusing and misleading i think, my advice is to chack other possible forms of the query.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文