短语的 termfreq

发布于 2024-12-29 07:58:26 字数 649 浏览 2 评论 0原文

我在以下示例中使用 SOLR 4.x termfreq 功能来查找字段 CONTENTS 中的“自动归零放大器”。

http://localhost:8080/solr/select/?fl=contents,documentPageId,termfreq%28contents,%27autozero%20amplifiers%27%29&defType=func&q=termfreq%28contents,%27autozero %20amplifiers%27%29&fq=documentId%3A49667

我得到以下段落的零频率，其中包含短语“自动调零放大器”。

我必须对 solrconfig.xml 或 schema.xml 做什么才能在短语上使用 termfreq 而不仅仅是一个单词“amplifier”？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

Spring初心 2025-01-05 07:58:26

除非您让 Lucene 将“自动归零放大器”视为一项术语，否则您无法使用术语向量来获取您要查找的内容。您可以使用 KeywordTokenizerFactory 进行索引，这实际上并不对单词进行标记，而是将整个文本流保留为一个标记。但是，例如，如果您感兴趣的字段包含以下文本，

 "The quick brown fox jumps over the lazy dog"

您如何定义术语边界？

 The quick
 The quick brown
 quick brown
 quick brown fox jumps
 over the lazy dog
 .....

对于单一价值领域，这种组合呈指数增长。由于我一直在回答您与导致此问题的术语向量相关的一些问题，我的猜测是您正在尝试弯曲Solr/Lucene来计算单词/集合大型文档中的单词。您可以考虑将 Solr 与 Hadoop 集成，让 Hadoop 为您完成所有计数。哎呀！每个 Hadoop 示例都会讨论字数统计和字数统计。行数.. Solr + Hadoop = 大数据爱或者也许您可以在自己的应用程序层中执行此操作。

我没有太多关于您的应用程序数据量、需求目标等的信息。所以这充其量只是一个建议。

Unless you let Lucene consider "autozero amplifiers" as one term, you can't use term vectors to get what you are looking for. You could use KeywordTokenizerFactory for indexing, which doesn't actually tokenize the words, it preserves the entire stream of text as one token. But if, for instance, the field you are interested in is containing following text,

 "The quick brown fox jumps over the lazy dog"

how do you define your term boundaries ?

 The quick
 The quick brown
 quick brown
 quick brown fox jumps
 over the lazy dog
 .....

the combination grows exponentially for a singe field of value. Since I have been answering some of your questions related to term vectors leading up to this one, my guess is that you are trying to bend Solr/Lucene to count word/set of words in a large document. You could consider integrating Solr with Hadoop, let Hadoop do all the counting for you. Heck! every Hadoop example talks about word count & line count.. Solr + Hadoop = Big Data Love or perhaps you could do it in your own app layer.

I don't have much info on your application data volume, requirement goals etc.. so this is a suggestion at best.

回复收藏 0 原文