搜索引擎不精确计数(大约 xxx 结果)
当你在 Google 中搜索时(我几乎可以肯定 Altavista 也做了同样的事情),它会显示“关于 xxxx 的结果 1-10”...
这一直让我感到惊讶...“关于”是什么意思?
他们如何粗略地数数?
我确实理解为什么他们不能在合理的时间内得出一个精确的数字,但他们是如何达到这个“近似”数字的呢?
我确信这背后有很多我错过的理论......
When you search in Google (i'm almost sure that Altavista did the same thing) it says "Results 1-10 of about xxxx"...
This has always amazed me... What does it mean "about"?
How can they count roughly?
I do understand why they can't come up with a precise figure in a reasonable time, but how do they even reach this "approximate" one?
I'm sure there's a lot of theory behind this one that I missed...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
它很可能类似于大多数 SQL 系统在其查询计划中使用的估计行数排序; 表中的行数(确切地知道上次收集统计信息的时间,但通常不是最新的),乘以估计的选择性(通常基于通过对一些小子集进行采样计算出的统计分布模型)行)。
PostgreSQL 手册中有一节介绍规划器使用的统计信息这是相当丰富的信息,至少如果您点击 pg_stats 和其他各个部分的链接。 我确信这并没有真正描述谷歌所做的事情,但它至少显示了一个模型,您可以在其中获得前 N 行并估计可能还有多少行。
Most likely it's similar to the sort of estimated row counts used by most SQL systems in their query planning; a number of rows in the table (known exactly as of the last time statistics were collected, but generally not up-to-date), multiplied by an estimated selectivity (usually based on a sort of statistical distribution model calculated by sampling some small subset of rows).
The PostgreSQL manual has a section on statistics used by the planner that is fairly informative, at least if you follow the links out to pg_stats and various other sections. I'm sure that doesn't really describe what google does, but it at least shows one model where you could get the first N rows and an estimate of how many more there might be.
与你的问题无关,但让我想起我的一个朋友在进行简单的自我搜索时开的一个小笑话(不要告诉我你从来没有用谷歌搜索过你的名字)。 他说了这样的话
“哇,仅仅 0.22 秒就得到了大约 5,000 个结果!现在,想象一下一分钟、一小时、一天内有多少结果!”
Not relevant to your question, but reminds of a little joke a friend of mine made when doing a simple ego-search (and don't tell me you've never Googled your name). He said something like
"Wow, about 5,000 results in just 0.22 seconds! Now, imagine how many results this is in one minute, one hour, one day!"
我想这个估计是基于统计数据的。 他们不会计算所有相关的页面匹配,所以他们(我会)做的是根据一些启发式计算出与查询匹配的页面的大致百分比,然后将其用作计数的基础。
一种启发式方法可能是进行样本计数 - 随机抽取 1000 个左右的页面样本,看看匹配的百分比是多少。 不需要太多样本就能得到具有统计显着性的答案。
I imagine the estimate is based on statistics. They aren't going to count all of the relevant page matches, so what they (I would) do is work out roughly what percentage of pages would match the query, based on some heuristic, and then use that as the basis for the count.
One heuristic might be to do a sample count - take a random sample of 1000 or so pages and see what percentage matched. It wouldn't take too many in the sample to get a statisically significant answer.
尚未提及的一件事是重复数据删除。 一些搜索引擎(我不确定 Google 具体是如何做到的)将使用启发式方法来尝试确定两个不同的 URL 是否包含相同(或极其相似)的内容,从而得出重复的结果。
如果有 156 个唯一 URL,但其中 9 个已被标记为其他结果的重复项,则更简单地说“大约 150 个结果”,而不是“156 个结果,其中包含 147 个唯一结果和 9 个重复项”。
One thing that hasn't been mentioned yet is deduplication. Some search engines (I'm not sure exactly how Google in particular does it) will use heuristics to try and decide if two different URLs contain the same (or extremely similar) content, and are thus duplicate results.
If there are 156 unique URLs, but 9 of those have been marked as duplicates of other results, it is simpler to say "about 150 results" rather than something like "156 results which contains 147 unique results and 9 duplicates".
返回准确数量的结果不值得花费大量精力来精确计算。 由于知道有 1,004,345 个结果而不是“大约 1,000,000 个”并没有太多增值,因此从最终用户体验的角度来看,更快地返回结果比计算总数的额外时间更重要。
来自谷歌自己:
"Google 对搜索结果总数的计算是估计值。我们知道大概的数字很有价值,通过提供估计值而不是精确的帐户,我们可以更快地返回高质量的搜索结果。”
Returning an exact number of results is not worth the overhead to accurately calculate. Since there's not much of a value add from knowing there was 1,004,345 results rather than 'about 1,000,000', it's more important from an end user experience perspective to return the results faster rather than the additional time to calculate the total.
From Google themselves:
"Google's calculation of the total number of search results is an estimate. We understand that a ballpark figure is valuable, and by providing an estimate rather than an exact account, we can return quality search results faster."