ElasticSearch 计数差异

发布于 2025-01-10 16:05:24 字数 2048 浏览 4 评论 0原文

我有一个 ElasticSearch (v7.4) 集群，有 3 个主节点和 4 个数据节点。在收集有关文档数量的统计信息时，我遇到了一些明显的不一致之处：

GET https:////_count 返回的文档数量：66717419 （24 个分片）
GET 返回的文档数量https:///: 66717419（同上）
现在我检查了 id 以 0 开头的文档数量：

curl 'https://<my_ip>/<my_index>/_search?track_total_hits=true' -H 'content-type: application/json' -d '{
  "query": {
    "prefix": {
      "id": {
        "value": "0"
      }
    }
  },
  "track_total_hits": true, "size": 0
}'
{"took":5,"timed_out":false,"_shards":{"total":24,"successful":24,"skipped":0,"failed":0},"hits":{"total":{"value":57565,"relation":"eq"},"max_score":null,"hits":[]}}

对所有 [0-9a-f] 重复相同的查询也会为每个字母/数字返回 57k 到 58k 之间的数字hits.total.value（如上面 0 的示例查询所示）。
对任何其他字母重复相同的查询将返回 0 个结果（如预期）。
这些总计总计约 912k 文档 (16*57k)

因此，我看到 ~900k 文档 的 id 以 [0-9a- 中的任何一个开头f], 0 以其他 id 开头。同时，ES 报告索引中总共有66M 文档。差异从何而来？可以有没有id的文件吗？ ES 会以某种方式计算已删除或更新的文档吗？

根据 ID 字段文档

每个文档都有一个唯一标识它的_id

是否与分片有关？然而，从上面显示的结果来看，我的每个查询似乎都命中了所有 24 个分片。

Count API 或搜索 API 似乎没有表明在这方面的任何特殊行为。还有什么可以解释这些数字呢？

更新： 指数统计：

health status index            uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   <my_index> BUWfFDsBQAGcl64-J7gzHQ  24   1   66717419     23791236      1.6tb          873gb

原文

I have an ElasticSearch (v7.4) cluster with 3 master nodes and 4 data nodes. When gathering statistics about the number of documents, I have come across a few apparent inconsistencies:

The number of documents as returned by GET https://<my_ip>/<my_index>/_count: 66717419 (24 shards)
The number of documents as returned by GET https://<my_ip>/<my_idnex/_search?track_total_hits=true: 66717419 (same as above)
Now I checked the number of documents where the id starts with 0:

curl 'https://<my_ip>/<my_index>/_search?track_total_hits=true' -H 'content-type: application/json' -d '{
  "query": {
    "prefix": {
      "id": {
        "value": "0"
      }
    }
  },
  "track_total_hits": true, "size": 0
}'
{"took":5,"timed_out":false,"_shards":{"total":24,"successful":24,"skipped":0,"failed":0},"hits":{"total":{"value":57565,"relation":"eq"},"max_score":null,"hits":[]}}

Repeating the same query for all of [0-9a-f] also returns numbers between 57k and 58k for each of these letters/digits for hits.total.value (just as shown in the example query for 0 above).
Repeating the same query for any other letters returns 0 results (as expected).
These totals sum up to ~912k total documents (16*57k)

So I see ~900k documents that have an id starting with any of [0-9a-f], 0 starting with other ids. At the same time, ES reports a total of 66M documents in the index.
Where does the discrepancy come from? Can there be documents with no id? Does ES count deleted or updated documents somehow?

According to the ID Field documentation

Each document has an _id that uniquely identifies it

Could it be related to sharding? From the results shown above, however, it looks like each of my queries hits all 24 shards.

The documentation for the Count API or the Search API don't seem to indicate any peculiar behaviour in that regard. What else could explain these numbers?

Update:
The index statistics:

health status index            uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   <my_index> BUWfFDsBQAGcl64-J7gzHQ  24   1   66717419     23791236      1.6tb          873gb

分享到QQ

分享到微博