Apache Solr topTerms (LukeRequestHandler) 未给出正确的令牌计数

发布于 2024-12-14 20:25:35 字数 913 浏览 4 评论 0原文

我正在使用 Solr 4 trunk 版本,已经使用了几天。

根据 LukeRequestHandler 的 Wiki 页面(第一个示例输出),我们应该获取每个或任何指定字段的标记计数。我想用它来计算所有文档中每个单词出现的次数。例如,如果单词“is”出现在两个 MS Word 文档中,第一个文档中出现两次,第二个文档中出现三次,我将得到如下输出

<lst name="text">
  <str name="type">text</str>
  <str name="schema">IT-M---------</str>
  <str name="index">(unstored field)</str>
  <int name="docs">2</int>
  <int name="distinct">42</int>
  <lst name="topTerms">
    <int name="is">5</int>

:两个文件。然而我实际得到的是 2。我认为这是因为它明显(通过文档)总共出现了两次。

但同样,根据维基百科,我们应该得到所有文档的总计数,这正是我真正想要的。


如何获取所有索引文档中每个单词出现的总次数?


参考:

http://wiki.apache.org/solr/LukeRequestHandler

I am using the Solr 4 trunk build, a couple days old.

According to the Wiki page for the LukeRequestHandler (first example output), we're supposed to get a count of the tokens for each or any specified field. I want to use this to make a count of the number of times each word in all my documents appears. For example, if the word 'is' appears in two MS Word documents, twice in the first and three times in the second, I would get an output like this:

<lst name="text">
  <str name="type">text</str>
  <str name="schema">IT-M---------</str>
  <str name="index">(unstored field)</str>
  <int name="docs">2</int>
  <int name="distinct">42</int>
  <lst name="topTerms">
    <int name="is">5</int>

That's because the word "is" occurs a total of five times across the two documents. However what I actually get is <int name="is">2</int>. I presume this is because it occurs distinctly (by document) a total of two times.

But again, according to the Wiki, we're supposed to get a total count, summed across all the documents, which is what I actually want.


How can I get a total number of times each and every word in all indexed documents appears?


Reference:

http://wiki.apache.org/solr/LukeRequestHandler

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

是伱的 2024-12-21 20:25:35

TermsComponent 返回的文档频率是与该术语匹配的唯一文档的数量,包括具有以下特征的任何文档:已标记为删除但尚未从索引中删除。

TermVectorComponent 提供有关在字段上设置 termVector 属性时存储的文档的信息。
TVC可以返回词向量、词频率、逆文档频率以及位置和偏移信息。

tv.tf - 返回文档中每个术语的文档术语频率信息。

<lst name="termVectors">
  <lst name="doc-5">
    <str name="uniqueKey">MA147LL/A</str>
    <lst name="includes">
      <lst name="cable">
        <int name="tf">1</int>
      </lst>
      <lst name="earbud">
        <int name="tf">5</int>
      </lst>
      <lst name="headphones">
        <int name="tf">1</int>
      </lst>
      <lst name="usb">
        <int name="tf">1</int>
      </lst>
    </lst>
  </lst>
  ...............
</lst>

Doc frequencies returned by TermsComponent are the number of unique documents that match the term, including any documents that have been marked for deletion but not yet removed from the index.

TermVectorComponent provides the information about documents that is stored when setting the termVector attribute on a field.
TVC can return the term vector, the term frequency, inverse document frequency, and position and offset information.

tv.tf - Return document term frequency info per term in the document.

<lst name="termVectors">
  <lst name="doc-5">
    <str name="uniqueKey">MA147LL/A</str>
    <lst name="includes">
      <lst name="cable">
        <int name="tf">1</int>
      </lst>
      <lst name="earbud">
        <int name="tf">5</int>
      </lst>
      <lst name="headphones">
        <int name="tf">1</int>
      </lst>
      <lst name="usb">
        <int name="tf">1</int>
      </lst>
    </lst>
  </lst>
  ...............
</lst>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文