搜索引擎如何进行“AND”运算?手术?
请考虑以下搜索结果:
- Google for 'David' - 0.28 秒内5.91 亿次点击
- Google for 'John' - 0.18 秒内7.85 亿次点击
好的。页面被索引,只需要查找索引表中的计数和前几项,所以速度是可以理解的。
现在考虑以下使用 AND 运算的搜索:
- Google 搜索“David John”(“David”AND“John”) - 0.25 秒内点击量达到 1.73 亿次 strong>
这让我很兴奋;) 搜索引擎到底是如何能够这么快地得到海量数据集的 AND 运算结果的呢?我看到以下两种执行任务的方法,两种方法都很糟糕:
- 你进行“大卫”的搜索。获取巨大的临时表并在其上搜索“John”。但是,临时表未按“John”索引,因此需要进行强力搜索。无论您拥有什么硬件,都不会在 0.25 秒内完成计算。
- 按所有可能的单词建立索引 像“大卫·约翰”这样的组合。然后 我们面临着按键数量的组合爆炸 连谷歌都没有存储空间 有能力处理这个问题。
您可以 AND 一起 任意数量的搜索短语,您仍然可以在 0.5 秒内获得答案!如何?
Consider the following search results:
- Google for 'David' - 591 millions hits in 0.28 sec
- Google for 'John' - 785 millions hits in 0.18 sec
OK. Pages are indexed, it only needs to look up the count and the first few items in the index table, so speed is understandable.
Now consider the following search with AND operation:
- Google for 'David John' ('David' AND 'John') - 173 millions hits in 0.25 sec
This makes me ticked ;) How on earth can search engines get the result of AND operations on gigantic datasets so fast? I see the following two ways to conduct the task and both are terrible:
- You conduct the search of 'David'. Take the gigantic temp table and conduct a search of 'John' on it. HOWEVER, the temp table is not indexed by 'John', so brute force search is needed. That just won't compute within 0.25 sec no matter what HW you have.
- Indexing by all possible word
combinations like 'David John'. Then
we face a combinatorial explosion on the number of keys and
not even Google has the storage
capacity to handle that.
And you can AND together as many search phrases as you want and you still get answers under a 0.5 sec! How?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
马库斯写的关于谷歌在许多机器上并行处理查询的内容是正确的。
此外,还有信息检索算法可以使这项工作变得更容易一些。经典的方法是构建一个由帖子列表组成的倒排索引 - 按顺序包含该术语的所有文档中每个术语的列表。
当搜索包含两个术语的查询时,从概念上讲,您将获取两个术语(“david”和“john”)中每个术语的发布列表,并沿着它们行走,查找两个列表中的文档。如果两个列表的排序方式相同,则可以在 O(N) 内完成。当然,N 仍然很大,这就是为什么这将在数百台机器上并行完成。
此外,可能还有其他技巧。例如,如果排名最高的文档在列表中的位置较高,那么算法可能会决定无需遍历整个列表即可找到 10 个最佳结果。然后它会猜测结果的剩余数量(基于两个列表的大小)。
What Markus wrote about Google processing the query on many machines in parallel is correct.
In addition, there are information retrieval algorithms that make this job a little bit easier. The classic way to do it is to build an inverted index which consists of postings lists - a list for each term of all the documents that contain that term, in order.
When a query with two terms is searched, conceptually, you would take the postings lists for each of the two terms ('david' and 'john'), and walk along them, looking for documents that are in both lists. If both lists are ordered the same way, this can be done in O(N). Granted, N is still huge, which is why this will be done on hundreds of machines in parallel.
Also, there may be additional tricks. For example, if the highest-ranked documents were placed higher on the lists, then maybe the algorithm could decide that it found the 10 best results without walking the entire lists. It would then guess at the remaining number of results (based on the size of the two lists).
我认为你从错误的角度来处理这个问题。
Google 在单台机器上没有表/索引。相反,他们在服务器上大量划分数据集。报告表明涉及多达 1000 台物理机在每个查询中!
有了如此大的计算能力,“简单地”(非常讽刺地使用)只需确保每台机器在几分之一秒内完成其工作即可。
阅读有关 Google 技术和基础设施的内容非常鼓舞人心且具有很高的教育意义。我建议阅读 BigTable、MapReduce 和 Google 文件系统。
Google 有一个其出版物档案,其中包含大量有关其技术的有趣信息。 metafilter 上的这个帖子还提供了对大量硬件的一些见解需要运行搜索引擎。
I think you're approaching the problem from the wrong angle.
Google doesn't have a tables/indices on a single machine. Instead they partition their dataset heavily across their servers. Reports indicate that as many as 1000 physical machines are involved in every single query!
With that amount of computing power it's "simply" (used highly ironically) a matter of ensuring that every machine completes their work in fractions of a second.
Reading about Google technology and infrastructure is very inspiring and highly educational. I'd recommend reading up on BigTable, MapReduce and the Google File System.
Google have an archive of their publications available with lots of juicy information about their techologies. This thread on metafilter also provides some insight to the enourmous amount of hardware needed to run a search engine.
我不知道谷歌是如何做到的,但我可以告诉你当客户需要类似的东西时我是如何做到的:
它以倒排索引开始,如 Avi 所描述的。这只是一个表格,列出了每个文档中的每个单词、文档 ID、单词以及该单词在该文档中的相关性得分。 (另一种方法是单独索引单词的每次出现及其位置,但在本例中不需要这样做。)
从这里开始,它甚至比 Avi 的描述更简单 - 无需为每个术语进行单独搜索。标准数据库摘要操作可以轻松地在一次传递中完成此操作:
这将返回具有“David”和“John”分数(即两个单词都出现)的所有文档的 ID,按相关性的近似值排序,并将采用无论您要查找多少个术语,执行时间大约相同,因为
IN
性能不会受到目标集大小的太大影响,并且它使用简单的count
确定所有术语是否匹配。请注意,这种简单化的方法只是将“David”分数和“John”分数相加来确定整体相关性;它不考虑顺序/接近度/等等。的名称考虑在内。再次,我确信谷歌确实将其纳入他们的分数中,但我的客户不需要它。
I don't know how google does it, but I can tell you how I did it when a client needed something similar:
It starts with an inverted index, as described by Avi. That's just a table listing, for every word in every document, the document id, the word, and a score for the word's relevance in that document. (Another approach is to index each appearance of the word individually along with its position, but that wasn't required in this case.)
From there, it's even simpler than Avi's description - there's no need to do a separate search for each term. Standard database summary operations can easily do that in a single pass:
This will return the IDs of all documents which have scores for both 'David' and 'John' (i.e., both words appear), ordered by some approximation of relevance and will take about the same time to execute regardless of how many or how few terms you're looking for, since
IN
performance is not affected much by the size of the target set and it's using a simplecount
to determine whether all terms were matched or not.Note that this simplistic method just adds the 'David' score and the 'John' score together to determine overall relevance; it doesn't take the order/proximity/etc. of the names into account. Once again, I'm sure that google does factor that into their scores, but my client didn't need it.
我几年前在 16 位机器上做了类似的事情。该数据集的上限约为 110,000 条记录(这是一个墓地,因此对埋葬的限制有限),因此我设置了一系列位图,每个位图包含 128K 位。
搜索“david”导致我在其中一个位图中设置相关位,以表示该记录中包含单词“david”。在第二个位图中对“john”执行相同的操作。
然后您需要做的就是两个位图的二进制“与”,生成的位图会告诉您哪些记录号中同时包含“david”和“john”。快速扫描生成的位图可以返回与这两个术语匹配的记录列表。
不过这个技术对 google 不起作用,所以考虑一下我的 0.02 美元价值。
I did something similar to this years ago on a 16 bit machine. The dataset had an upper limit of around 110,000 records (it was a cemetery, so finite limit on burials) so I setup a series of bitmaps each containing 128K bits.
The search for "david" resulting in me setting the relevant bit in one of the bitmaps to signify that the record had the word "david" in it. Did the same for 'john' in a second bitmap.
Then all you need to do is a binary 'and' of the two bitmaps, and the resulting bitmap tells you which record numbers had both 'david' and 'john' in them. Quick scan of the resulting bitmap gives you back the list of records that match both terms.
This technique wouldn't work for google though, so consider this my $0.02 worth.