倒排索引和普通旧索引有什么区别?
在软件工程中,我们一直在创建索引(例如,在数据库中),但我也听到很多人谈论倒排索引。两者之间有什么本质上的不同吗?它们听起来像是同一件事。
In software engineering we create indexes all the time (e.g., in databases) but I also hear a lot of people talk about inverted indices. Is there something fundamentally different between the two? They sound like the same thing.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
一种常见用途是“...允许快速全文搜索。”
这两种类型表示方向性。一个带您向前浏览索引,另一个带您向后(反向)浏览索引。就是这样。这里没有什么秘密需要揭开。否则,这两种类型是相同的,问题只是您拥有什么信息,以及您想要查找什么信息。
为了解决您的询问,我我认为实际上没有办法知道为什么使用它是今天的样子。定义哪个是
正向
、哪个是反向
很重要,这样我们就可以就它们进行对话,并且每个人都知道我们在谈论哪个方向关于。想想“左”和“右”这两个词:它们是相对的。哪一个并不重要,只是每个人都需要同意哪一个是“左”,哪一个是“右”才能使这些词有意义。如果作为一种文化,我们决定左右翻转,那么你会遇到同样的问题,因为商定的含义已经改变,所以你会在弄清楚“右转”和“左转”是什么时遇到同样的问题。然而,命名是任意的,所以哪个是哪个(本身)并不重要 - 重要的是我们都同意其含义。在你的评论中,你问“请不要只定义术语”,你没有抓住要点,我认为你只是在措辞上挂了钩,而它们之间绝对没有区别。
为了方便未来的读者,我现在将提供几个“正向”和“反向”索引示例:
示例 1:Web 搜索
如果您认为反向索引类似于 数学中函数的逆,其中逆是具有不同形式的特殊事物,那么您就错了:但这里的情况并非如此。
在搜索引擎中,您有一个文档列表(网站上的页面),您可以在其中输入一些关键字并获取结果。
正向索引(或只是索引)是文档列表,以及其中出现哪些单词。在网络搜索示例中,Google 抓取网络,构建文档列表,找出每个页面中出现的单词。
倒排索引是单词列表,以及其中包含的文档他们出现了。在网络搜索示例中,您提供单词列表(您的搜索查询),然后 Google 生成文档(搜索结果链接)。
它们都是索引——问题只是你要往哪个方向走。正向是从文档->到->单词,反向是从单词->到->文档。
示例 2:DNS
另一个示例是 DNS 查找(获取主机名并返回 IP 地址)和反向查找(获取 IP 地址并返回主机名)。
示例 3:一本书
一本书后面的索引实际上是一个倒排索引,如上面示例所定义 - 单词列表以及在书中的位置。在书中,目录就像一个正向索引:它是本书包含的文档(章节)的列表,除了列出这些部分中的单词之外,目录只是给出这些文档(章节)中包含的内容的名称/一般描述。
示例 4:您的手机
您手机中的转发索引是您的联系人列表,以及与这些联系人关联的电话号码(手机号码、家庭号码、工作号码)。 倒排索引允许您手动输入电话号码,当您点击“拨号”时,您会看到此人的姓名,而不是号码,因为您的手机已获取该电话号码并找到您与其关联的联系人。
One common use is "...to allow fast full-text searching."
The two types denote directionality. One takes you forward through the index, and the other takes you backward (the inverse) through the index. That's it. There's no mystery to uncover here. Otherwise the two types are identical, it's just a question of what information you have, and as a result what information you're trying to find.
To address your inquiry, I don't think there's actually a way to know why the use is what it is today. The only reason it's important to define which is
forward
and which one isinverted
is so that we can all have a conversation about them, and everyone knows which direction we're talking about. Think about the terms "left" and "right": they are relative. Which is which doesn't matter, except that everyone needs to agree which one is "left" and which one is "right" in order for the words to have meaning. If, as a culture, we decided to flip left and right, then you'd have the same issue figuring out what a "right turn" vs a "left turn" is since the agreed upon meaning had changed. However, the naming is arbitrary, so which one is which (in and of itself) doesn't matter - what matters is that we all agree on the meaning.In your comment where you ask, "please don't just define the terms", you're missing the point, and I think you're just getting hung up on the wording when there is absolutely no difference between them.
For the benefit of future readers, I will now provide several "forward" and "inverted" index examples:
Example 1: Web search
If you're thinking that the inverse of an index is something like the inverse of a function in mathematics, where the inverse is a special thing that has a different form, then you're mistaken: that's not the case here.
In a search engine you have a list of documents (pages on web sites), where you enter some keywords and get results back.
A forward index (or just index) is the list of documents, and which words appear in them. In the web search example, Google crawls the web, building the list of documents, figuring out which words appear in each page.
The inverted index is the list of words, and the documents in which they appear. In the web search example, you provide the list of words (your search query), and Google produces the documents (search result links).
They are both indexes - it's just a question of which direction you're going. Forward is from documents->to->words, inverted is from words->to->documents.
Example 2: DNS
Another example is a DNS lookup (which takes a host name, and returns an IP address) and a reverse lookup (which takes an IP address, and gives you the host name).
Example 3: A book
The index in the back of a book is actually an inverted index, as defined by the examples above - a list of words, and where to find them in the book. In a book, the table of contents is like a forward index: it's a list of documents (chapters) which the book contains, except instead of listing the words in those sections, the table of contents just gives a name/general description of what's contained in those documents (chapters).
Example 4: Your cell phone
The forward index in your cell phone is your list of contacts, and which phone numbers (cell, home, work) are associated with those contacts. The inverted index is what allows you to manually enter a phone number, and when you hit "dial" you see the person's name, rather than the number, because your phone has taken the phone number and found you the contact associated with it.
他们称其为倒排,只是因为已经有了前向索引。以搜索引擎为例,它由两部分组成:第一部分是“网络爬虫和解析器”,建立从文档到单词的索引,第二部分是搜索数据库,建立从单词到文档的索引。由于第一个索引的存在,我们很自然地将第二个索引称为倒排索引。
如果将一本书的TOC(目录)命名为索引,那么应该将书末尾的索引称为“倒排索引”。或者,在另一方面,您可以将目录称为倒排索引。
They called it inverted just because there is already a forward index. Take the example of search engine, it composed by two parts: the first part is "web crawler and parser" which build a index from document to word, the second part is search database which build a index from word to document. Because of the first index exist, we naturally call the second index as inverted index.
If you name the TOC (Table of Content) of a book as index, then you should call the index at the end of book as "inverted index". Or, in other side, you can call the TOC as inverted index.
通常,当谈到索引时,您指的是为了加快应用程序的速度而添加的一些计算或存储的过程结果(例如 MySQL 或其他 RDBMS 查阅 MySQL 文档)。索引还可以与缓存等相关。
倒排索引创建的文件的结构主要用于(全文)搜索。
倒排索引由两个主要文件组成:
词汇中是从文本中提取的常见单词(当然是在过滤代词等黑名单单词之后)。事件文件保存单词和文档之间的连接(word1 出现在 doc1 和 doc2 中,而不是出现在 doc3 中)。它以矩阵的形式表示。
上图中显示了创建上述两个文件的过程。
如果您对这个问题进一步感兴趣,我可以向您推荐 Ricardo Yated 写的一本好书 - 现代信息检索 (在亚马逊上查看) - 我认为大约第 200 页。
希望它有帮助:-)
typically when speaking about index, you mean some added calculations or stored results of procedures which have been done in order to speed up application (e.g. MySQL or other RDBMS Consult MySQL the docs). Indexing can also be related to caching etc.
Inverted index creates file with structure that is primarily intender for (fulltext) searching.
Inverted index consists of two main files:
In vocabulary are common words extracted from text (of course after filtering blacklist words like pronouns). The occurences file holds the connection between words and documents (word1 appears in doc1 and doc2, not in doc3). It is represented in a form of a matrix.
In the above image is shown the process of creating the two files mentioned.
If you are further interester in this problematic I can recommend you a great book written by Ricardo Yated - Modern Information Retrieval (See it on Amazon) - about page 200 I think.
Hope it helps :-)
正态已经很好地区分了正向索引和倒排索引,但问题是为什么一个叫正向索引,另一个叫倒排索引,也许这就是他们这么叫的原因吧——
以搜索引擎爬行和索引(或者为一本书建立索引)为例,当你建立一个正向索引时,你可以同时建立一个正向索引。是抓取网页(或阅读书籍)或继续。因此,如果您有 10 个网页要抓取(或一本书中的 10 个章节),您可以抓取第一个网页(阅读第一章),然后列出该网页中出现的单词(该章节中出现的单词)并继续此过程适用于其他网页(其他章节),因此,当您抓取了所有 10 个网页(阅读所有 10 章)时,您的正向索引已完成,每个网页(章节)都指向它的单词列表包含。
但是要创建倒排索引,您必须爬行所有 10 个网页(阅读 10 个章节),然后从每个文档列表中取出每个单词,并找出哪些文档包含该单词。 所以这就像在爬行网页后向后退(阅读本书的章节)。所以它被称为倒排索引。
这只是我的猜测。
normalocity has already wonderfully differentiated between a forward and an inverted index but for the question of why one is called a forward index and the other an inverted index, maybe this is why they are called that way---
Taking example of search engine crawling and indexing (or building index for a book), a forward index can be built simultaneously while you are crawling the web pages(or reading the book) or going forward. So if you have 10 webpages to crawl(or 10 chapters in a book) you can crawl the first webpage(read the first chapter) and then make a list of words which appear in the webpage(words which appear in the chapter) and continue this process for other webpages(other chapters) so by the time you have crawled all the 10 webpages(read all 10 chapters) your forward index is complete with each webpage(chapter) pointing to a list of words it contains.
But to make an inverted index you have to crawl all the 10 webpages(read the 10 chapters) and and then take each word from each documents list and figure out which documents contain that word. So this is like going backward once you have crawled the webpages(read chapters of the book). So its called an inverted index.
This is just my speculation.
术语“倒排词索引”是指关系的变化
包含许多单词的单个文档,每个唯一的单词包含
(或识别)多文档列表。这实际上是采用一对多关系(文档到单词)并将其反转(或反转),以便现在存在一个新的“反转”一对多关系,这是与许多相关的每个唯一单词 -文档(即所有包含该词的文档)。它的起源确实就是这么简单,早在计算机和电子高速索引存在之前,术语“倒排索引”就被用来描述相同类型的手动索引(是的,不可否认,我是一个老了的老程序员,几乎当 COBOL 还是一门闪亮的新语言时,她已经足够大了,可以将 Grace Hopper 视为“可爱的年轻女士”,适合重新追求的年龄)。请暂时不要抛弃我们这些怪人,因为我们偶尔可能会提供一两个有用的、甚至可能有价值的历史花絮——也就是说,当我们的个人 RAM 仍在工作时。 [咧嘴笑]
The term "Inverted Word Index" refers to the change in relationship of
a single-document containing many-words, to each unique word containing
(or identifying) a list of many-documents. This is effectively taking a One-to-Many Relationship (Docs to Words) and Inverting (or reversing) it such that a new "Inverted" One-to-Many Relationship now exists, which is each-unique-word relating to Many-Documents (i.e., all that contain that word). It's origin really is that simple, and the term "inverted index" was used to describe manual indexes of the same type long before computers and electronic high-speed indexing even existed (yes, admittedly, I'm an old, geezer programmer, almost old enough to have considered Grace Hopper a "sweet young lady" age appropriate for courting back when COBOL was a shiny new language). Please don't discard us geezers just yet, as we may occasionally provide a useful, and possibly even valuable, historical tid-bit or two - when our personal RAM is still working, that is. [grin]
索引有很多种类型。例如B树、R树、哈希......针对不同的目的,我们必须选择正确的索引。
倒排索引是一种特殊的索引。倒排索引通常用于全文搜索引擎。使用倒排索引我们可以尽快找到一个单词在文档(或文档集)中的位置。想想内存和cpu的限制,其他索引无法完成这个工作。
您可以阅读 lucene 文档以获取更多详细信息。它是一个开源搜索引擎。 http://lucene.apache.org/java/docs/index.html
There are many types of index. For example, B-tree, R-tree, hash... For different purposes, we must choose correct index.
Inverted index is a special one. Inverted index usually used in full text search engine. Use inverted index we can find out a word's locate in a document(or documents set) as fast as possible. Think about the limit of memory and cpu, other index can't finish this job.
You can read lucene document for more details. It's a open source search engine. http://lucene.apache.org/java/docs/index.html
在倒排索引中,我们有以下形式:
word1->它出现的文档列表(排序顺序)
word2->它出现的文档列表(排序顺序)
对于搜索引擎查询处理非常有用,因为它允许我们找到单词出现的文档。
您可以使用监督机器学习来构建此倒排索引。
in inverted indexes, we have the following form:
word1-> list of docs it occurs in (sorted order)
word2-> list of docs it occurs in (sorted order)
It is very useful for search engine query processing as it allows us to find docs that word occurs in .
You can use supervised machine learing to build this inverted index.
还有一个区别:
与正向索引相比,使用倒排索引处理更新的成本较高。
正向索引通过仅反映相应文档索引中的更改来轻松处理更新,而在倒排索引中,相同的更改必须反映在倒排索引的多个位置中。
One more difference:
Handling updates with the inverted index are expensive in comparison with forward index.
Forward index handles updates easily by reflecting the changes only in the corresponding document index, whereas in the inverted index, the same change has to reflect in multiple positions across the inverted index.
解释的方式是基于“什么指向什么”。
例如,一个实体有许多属性。我们首先需要拥有该实体才能找到它具有哪些属性。
一个实际的例子是,在搜索引擎中,在其入口/网络爬虫阶段,它首先爬行并访问网页。这里的网页就是实体。它创建的用于将网页映射到其所具有的不同单词的索引是前向索引。映射将是,文档 ->为了
方便从属性到实体的查找,我们需要倒排索引。在我的示例中,它将是单词的映射 ->网页文档。
The way to interpret is based on "what points to what".
Example, An entity has many attributes.We first need to have the entity at hand to be able to find what attributes it has.
A practical example is that in a search engine , in its ingress/ web crawler phase, it first crawls and has access to a webpage. Webpage here is the entity. The index that it would create to map a webpage to the different words it has is a forward index. The mapping would be , document -> words
To facilitate the lookup from an attribute to the entity , we need a inverted index. In my example it would be the mapping of words -> webpage documents.