短语的 termfreq

发布于 2024-12-29 07:58:26 字数 649 浏览 2 评论 0原文

我在以下示例中使用 SOLR 4.x termfreq 功能来查找字段 CONTENTS 中的“自动归零放大器”。

http://localhost:8080/solr/select/?fl=contents,documentPageId,termfreq%28contents,%27autozero%20amplifiers%27%29&defType=func&q=termfreq%28contents,%27autozero %20amplifiers%27%29&fq=documentId%3A49667

我得到以下段落的零频率,其中包含短语“自动调零放大器”。

我必须对 solrconfig.xml 或 schema.xml 做什么才能在短语上使用 termfreq 而不仅仅是一个单词“amplifier”?

I'm using SOLR 4.x termfreq feature in the following example to find "autozero amplifiers" in a field CONTENTS.

http://localhost:8080/solr/select/?fl=contents,documentPageId,termfreq%28contents,%27autozero%20amplifiers%27%29&defType=func&q=termfreq%28contents,%27autozero%20amplifiers%27%29&fq=documentId%3A49667

I am getting zero frequency for the following paragraph which contains the phrase "autozero amplifiers".

What do I have to do either to solrconfig.xml or schema.xml in order to use termfreq on a phrase not just one word "amplifier"?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

Spring初心 2025-01-05 07:58:26

除非您让 Lucene 将“自动归零放大器”视为一项术语,否则您无法使用术语向量来获取您要查找的内容。您可以使用 KeywordTokenizerFactory 进行索引,这实际上并不对单词进行标记,而是将整个文本流保留为一个标记。但是,例如,如果您感兴趣的字段包含以下文本,

 "The quick brown fox jumps over the lazy dog"

您如何定义术语边界?

 The quick
 The quick brown
 quick brown
 quick brown fox jumps
 over the lazy dog
 .....

对于单一价值领域,这种组合呈指数增长。由于我一直在回答您与导致此问题的术语向量相关的一些问题,我的猜测是您正在尝试弯曲Solr/Lucene来计算单词/集合大型文档中的单词。您可以考虑将 Solr 与 Hadoop 集成,让 Hadoop 为您完成所有计数。哎呀!每个 Hadoop 示例都会讨论字数统计和字数统计。行数.. Solr + Hadoop = 大数据爱 或者也许您可以在自己的应用程序层中执行此操作。

我没有太多关于您的应用程序数据量、需求目标等的信息。所以这充其量只是一个建议。

Unless you let Lucene consider "autozero amplifiers" as one term, you can't use term vectors to get what you are looking for. You could use KeywordTokenizerFactory for indexing, which doesn't actually tokenize the words, it preserves the entire stream of text as one token. But if, for instance, the field you are interested in is containing following text,

 "The quick brown fox jumps over the lazy dog"

how do you define your term boundaries ?

 The quick
 The quick brown
 quick brown
 quick brown fox jumps
 over the lazy dog
 .....

the combination grows exponentially for a singe field of value. Since I have been answering some of your questions related to term vectors leading up to this one, my guess is that you are trying to bend Solr/Lucene to count word/set of words in a large document. You could consider integrating Solr with Hadoop, let Hadoop do all the counting for you. Heck! every Hadoop example talks about word count & line count.. Solr + Hadoop = Big Data Love or perhaps you could do it in your own app layer.

I don't have much info on your application data volume, requirement goals etc.. so this is a suggestion at best.

何以畏孤独 2025-01-05 07:58:26

您可以分别对两个单词尝试以下技巧

  1. termfreq() 并执行 sum() 来获取其计数。

  2. 此外,您可以使用 if() 来检查您的值。

希望这听起来很适合您的要求。

You may try the following trick

  1. termfreq() on both the words individually and do the sum() to get the count of it.

  2. Further, you may use if() to check your values.

Hope, this sounds good for your requirement.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文