ElasticSearch 计数差异
我有一个 ElasticSearch (v7.4) 集群,有 3 个主节点和 4 个数据节点。在收集有关文档数量的统计信息时,我遇到了一些明显的不一致之处:
GET https://
返回的文档数量:66717419 (24 个分片)/ /_count GET 返回的文档数量https://
/ : 66717419(同上) - 现在我检查了 id 以
0
开头的文档数量:
curl 'https://<my_ip>/<my_index>/_search?track_total_hits=true' -H 'content-type: application/json' -d '{
"query": {
"prefix": {
"id": {
"value": "0"
}
}
},
"track_total_hits": true, "size": 0
}'
{"took":5,"timed_out":false,"_shards":{"total":24,"successful":24,"skipped":0,"failed":0},"hits":{"total":{"value":57565,"relation":"eq"},"max_score":null,"hits":[]}}
- 对所有
[0-9a-f]
重复相同的查询也会为每个字母/数字返回 57k 到 58k 之间的数字hits.total.value
(如上面0
的示例查询所示)。 - 对任何其他字母重复相同的查询将返回 0 个结果(如预期)。
- 这些总计总计约 912k 文档 (16*57k)
因此,我看到 ~900k 文档 的 id
以 [0-9a- 中的任何一个开头f]
, 0 以其他 id 开头。同时,ES 报告索引中总共有66M 文档。 差异从何而来?可以有没有id的文件吗? ES 会以某种方式计算已删除或更新的文档吗?
根据 ID 字段文档
每个文档都有一个唯一标识它的_id
是否与分片有关?然而,从上面显示的结果来看,我的每个查询似乎都命中了所有 24 个分片。
Count API 或搜索 API 似乎没有表明在这方面的任何特殊行为。还有什么可以解释这些数字呢?
更新: 指数统计:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open <my_index> BUWfFDsBQAGcl64-J7gzHQ 24 1 66717419 23791236 1.6tb 873gb
I have an ElasticSearch (v7.4) cluster with 3 master nodes and 4 data nodes. When gathering statistics about the number of documents, I have come across a few apparent inconsistencies:
- The number of documents as returned by
GET https://<my_ip>/<my_index>/_count
: 66717419 (24 shards) - The number of documents as returned by
GET https://<my_ip>/<my_idnex/_search?track_total_hits=true
: 66717419 (same as above) - Now I checked the number of documents where the id starts with
0
:
curl 'https://<my_ip>/<my_index>/_search?track_total_hits=true' -H 'content-type: application/json' -d '{
"query": {
"prefix": {
"id": {
"value": "0"
}
}
},
"track_total_hits": true, "size": 0
}'
{"took":5,"timed_out":false,"_shards":{"total":24,"successful":24,"skipped":0,"failed":0},"hits":{"total":{"value":57565,"relation":"eq"},"max_score":null,"hits":[]}}
- Repeating the same query for all of
[0-9a-f]
also returns numbers between 57k and 58k for each of these letters/digits forhits.total.value
(just as shown in the example query for0
above). - Repeating the same query for any other letters returns 0 results (as expected).
- These totals sum up to ~912k total documents (16*57k)
So I see ~900k documents that have an id
starting with any of [0-9a-f]
, 0 starting with other ids. At the same time, ES reports a total of 66M documents in the index.
Where does the discrepancy come from? Can there be documents with no id? Does ES count deleted or updated documents somehow?
According to the ID Field documentation
Each document has an _id that uniquely identifies it
Could it be related to sharding? From the results shown above, however, it looks like each of my queries hits all 24 shards.
The documentation for the Count API or the Search API don't seem to indicate any peculiar behaviour in that regard. What else could explain these numbers?
Update:
The index statistics:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open <my_index> BUWfFDsBQAGcl64-J7gzHQ 24 1 66717419 23791236 1.6tb 873gb
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论