是什么让表查找如此便宜?
不久前,我了解了一些大 O 表示法和不同算法的效率。
例如,循环遍历数组中的每个项目以对其执行某些操作
foreach(item in array)
doSomethingWith(item)
是一个 O(n)
算法,因为程序执行的循环次数与数组的大小成正比。
但令我惊讶的是,表查找的时间复杂度为O(1)
。也就是说,
value = hashTable[key]
无论表有一个键、十个键、一百个键还是十亿个键,在哈希表或字典中查找键都需要相同的周期数。
这真的很酷,我很高兴这是真的,但这对我来说很不直观,我不明白为什么它是真的。
我可以理解第一个 O(n)
算法,因为我可以将它与现实生活中的示例进行比较:如果我有一张想要盖章的纸,我可以浏览每张纸 -逐一并在每一项上盖章。对我来说,如果我有 2,000 张纸,使用这种方法盖章所需的时间将是如果我有 1,000 张纸时的两倍,这对我来说很有意义。
但我不明白为什么查表是O(1)
。我在想,如果我有一本字典,并且我想找到多态性的定义,那么我需要O(logn)
时间才能找到它:我'我会打开字典中的某个页面,看看它是按字母顺序排列在多态性之前还是之后。多态性。比如说,如果它位于 P 部分之后,我可以删除我打开的页面之后字典中的所有内容,并对字典的其余部分重复该过程,直到找到单词 多态性。
这不是一个 O(1)
过程:在一千页词典中查找单词通常比在两页词典中查找单词花费的时间更长。我很难想象一个无论字典大小如何都需要相同时间的过程。
tl;dr:您能否向我解释一下如何以 O(1)
复杂度进行表查找?
(如果你向我展示如何复制令人惊叹的 O(1)
查找算法,我肯定会得到一本厚厚的字典,这样我就可以向我所有的朋友炫耀我的忍者字典 -查找技能)
编辑:大多数答案似乎都取决于这个假设:
您可以在恒定时间内访问给定页码的词典的任何页面
如果这是真的,我很容易看出。但我不知道为什么这个基本假设是正确的:我会使用与按单词查找页面相同的过程按数字查找页面。
与内存地址相同,使用什么算法来加载内存地址?是什么让从一个地址找到一块内存变得如此便宜?换句话说,为什么内存访问是O(1)
?
A while back, I learned a little bit about big O notation and the efficiency of different algorithms.
For example, looping through each item in an array to do something with it
foreach(item in array)
doSomethingWith(item)
is an O(n)
algorithm, because the number of cycles the program performs is directly proportional to the size of the array.
What amazed me, though, was that table lookup is O(1)
. That is, looking up a key in a hash table or dictionary
value = hashTable[key]
takes the same number of cycles regardless of whether the table has one key, ten keys, a hundred keys, or a gigabrajillion keys.
This is really cool, and I'm very happy that it's true, but it's unintuitive to me and I don't understand why it's true.
I can understand the first O(n)
algorithm, because I can compare it to a real-life example: if I have sheets of paper that I want to stamp, I can go through each paper one-by-one and stamp each one. It makes a lot of sense to me that if I have 2,000 sheets of paper, it will take twice as long to stamp using this method than it would if I had 1,000 sheets of paper.
But I can't understand why table lookup is O(1)
. I'm thinking that if I have a dictionary, and I want to find the definition of polymorphism, it will take me O(logn)
time to find it: I'll open some page in the dictionary and see if it's alphabetically before or after polymorphism. If, say, it was after the P section, I can eliminate all the contents of the dictionary after the page I opened and repeat the process with the remainder of the dictionary until I find the word polymorphism.
This is not an O(1)
process: it will usually take me longer to find words in a thousand page dictionary than in a two page dictionary. I'm having a hard time imagining a process that takes the same amount of time regardless of the size of the dictionary.
tl;dr: Can you explain to me how it's possible to do a table lookup with O(1)
complexity?
(If you show me how to replicate the amazing O(1)
lookup algorithm, I'm definitely going to get a big fat dictionary so I can show off to all of my friends my ninja-dictionary-looking-up skills)
EDIT: Most of the answers seem to be contingent on this assumption:
You have the ability to access any page of a dictionary given its page number in constant time
If this is true, it's easy for me to see. But I don't know why this underlying assumption is true: I would use the same process to to look up a page by number as I would by word.
Same thing with memory addresses, what algorithm is used to load a memory address? What makes it so cheap to find a piece of memory from an address? In other words, why is memory access O(1)
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
您应该阅读维基百科文章。
但本质是,您首先对密钥应用哈希函数,将其转换为整数索引(这是
O(1)
)。然后用它来索引数组,这也是O(1)
。如果哈希函数设计得很好,数组中的每个位置应该只存储一个(或几个项目),这样查找就完成了。因此,在大规模简化的伪代码中:
显然,这不能处理冲突,但您可以阅读本文以了解如何处理冲突。
更新
为了解决已编辑的问题,索引到数组的时间为O(1),因为在幕后,CPU正在执行以下操作:
其中
LOAD
加载存储在内存中指定地址的数据。更新 2
从内存中加载的时间为
O(1)
的原因实际上只是因为这是我们在讨论计算复杂性时通常指定的公理(请参阅 http://en.wikipedia.org/wiki/RAM_model)。如果我们忽略缓存层次结构和数据访问模式,那么这是一个合理的假设。当我们扩展机器的大小时,这可能不是真的(具有 100TB 存储空间的机器可能不会与具有 100kB 存储空间的机器花费相同的时间)。但通常,我们假设机器的存储容量是恒定的,并且比我们可能看到的任何问题大小都要大得多。因此,无论出于何种目的,这都是一个恒定时间操作。You should read the Wikipedia article.
But the essence is that you first apply a hash function to your key, which converts it to an integer index (this is
O(1)
). This is then used to index into an array, which is alsoO(1)
. If the hash function has been well designed, there should only be one (or a few items) stored at each location in the array, so the lookup is complete.So in massively-simplified pseudocode:
Obviously, this doesn't handle collisions, but you can read the article to learn how that's handled.
Update
To address the edited question, indexing into an array is O(1) because underneath the hood, the CPU is doing this:
where
LOAD
loads data stored in memory at the specified address.Update 2
And the reason a load from memory is
O(1)
is really just because this is an axiom we usually specify when we talk about computational complexity (see http://en.wikipedia.org/wiki/RAM_model). If we ignore cache hierarchies and data-access patterns, then this is a reasonable assumption. As we scale the size of the machine,, this may not be true (a machine with 100TB of storage may not take the same amount of time as a machine with 100kB). But usually, we assume that the storage capacity of our machine is constant, and much much bigger than any problem size we're likely to look at. So for all intents and purposes, it's a constant-time operation.我将从与其他人不同的角度来回答这个问题。希望这能够解释为什么访问
x[45]
和访问x[5454563]
花费相同的时间。RAM 布置在电容器网格(即行和列)中。 RAM 可以通过激活网格上的特定列和行来寻址特定的内存单元,所以假设您有一个 16 字节容量的 RAM,布置在 4x4 网格中(对于现代计算机来说非常小,但足以说明问题)目的),并且您尝试访问内存地址 13 (
1101
),您首先将地址拆分为行和列,即row 3 (11) column 1 (01)
。假设 0 表示走左侧交叉点,1 表示走右侧交叉点。所以当你想要激活第3行时,你在该行的起始门发送一支电子军,这支电子军向右走,向右到达第3行激活门;接下来,您在列起始门上发送另一支电子军,列军电子向左然后向右到达第一列激活门。仅当行和列都被激活时,存储单元才能被读/写,因此这将允许被标记的单元被读/写。
所有这些乱码的影响是,内存地址的访问时间取决于地址长度< /strong>,而不是特定的内存地址本身;如果架构使用 32 位地址空间(即 32 个交叉点),则寻址内存地址 45 和寻址内存地址 5454563 仍然必须经过所有 32 个交叉点(实际上行电子有 16 个交叉点,列电子有 16 个交叉点)电子)。
请注意,实际上,与电容器充电和放电相比,内存寻址所需的时间非常少,因此即使我们开始拥有 512 位长度的地址空间(足以容纳 ~1.4*10^130 yottabyte 的 RAM,即足以保留RAM 中阳光下的一切),这意味着电子必须经过 512 个交叉点,这实际上不会给实际内存访问时间增加那么多时间。
请注意,这是对现代 RAM 的过度简化。在现代 DRAM 中,如果要访问后续内存地址,只需更改列,而不需要花时间更改行,因此访问后续内存比访问完全随机地址要快得多。另外,这个描述完全不了解CPU缓存的影响(尽管CPU缓存也使用类似的网格寻址方案,但是由于CPU缓存使用更快的基于晶体管的电容器,因此拥有大缓存地址空间的负面影响变得非常关键)。然而,这一点仍然认为,如果你在内存中跳转,访问其中任何一个都将花费相同的时间。
I'll address the question from a different perspective from every one else. Hopefully this will give light to why the accessing
x[45]
and accessingx[5454563]
takes the same amount of time.A RAM is laid out in a grid (i.e. rows and columns) of capacitors. A RAM can address a particular cell of memory by activating a particular column and row on the grid, so let's say if you have a 16-byte capacity RAM, laid out in a 4x4 grid (insanely small for modern computer, but sufficient for illustrative purpose), and you're trying to access the memory address 13 (
1101
), you first split the address into rows and column, i.erow 3 (11) column 1 (01)
.Let's suppose a 0 means taking the left intersection and a 1 means taking a right intersection. So when you want to activate row 3, you send an army of electrons in the row starting gate, the row-army electrons went right, right to reach row 3 activation gate; next you send another army of electrons on the column starting gate, the column-army electrons went left then right to reach the 1st column activation gate. A memory cell can only be read/written if the row and column are both activated, so this would allow the marked cell to be read/written.
The effect of all this gibberish is that the access time of a memory address depends on the address length, and not the particular memory address itself; if an architecture uses a 32-bit address space (i.e. 32 intersections), then addressing memory address 45 and addressing memory address 5454563 both will still have to pass through all 32 intersections (actually 16 intersections for the row electrons and 16 intersections for the columns electrons).
Note that in reality memory addressing takes very little amount of time compared to charging and discharging the capacitors, therefore even if we start having a 512-bit length address space (enough for ~1.4*10^130 yottabyte of RAM, i.e. enough to keep everything under the sun in your RAM), which mean the electrons would have to go through 512 intersections, it wouldn't really add that much time to the actual memory access time.
Note that this is a gross oversimplification of modern RAM. In modern DRAM, if you want to access subsequent memory addresses you only change the columns and not spend time changing the rows, therefore accessing subsequent memory is much faster than accessing totally random addresses. Also, this description is totally ignorant about the effect of CPU cache (although CPU cache also uses a similar grid addressing scheme, however since CPU cache uses the much faster transistor-based capacitor, the negative effect of having large cache address space becomes very critical). However, the point still holds that if you're jumping around the memory, accessing any one of them will take the same amount of time.
你是对的,要找到一个现实世界的例子是非常困难的。当然,这个想法是您正在通过地址而不是值来寻找某些东西。
字典示例失败,因为您无法立即知道页面(例如 278)的位置。您仍然必须像查找单词一样查找该页面,因为页面位置不在您的记忆中。
但是假设我在你的每根手指上标记了一个数字,然后我告诉你扭动上面写着 15 的手指。您必须查看其中的每一个(假设其未排序),如果不是 15 个,则检查下一个。在)。
如果我告诉你扭动你的右手小指。你不必查找任何东西。你知道它在哪里,因为我刚刚告诉过你它在哪里。我刚刚传递给你的值是它在你的“内存”中的地址。
这有点像数据库,但规模比 10 个手指大得多。
You're right, it's surprisingly difficult to find a real-world example of this. The idea of course is that you're looking for something by address and not value.
The dictionary example fails because you don't immediately know the location of page say 278. You still have to look that up the same as you would a word because the page locations are not in your memory.
But say I marked a number on each of your fingers and then I told you to wiggle the one with 15 written on it. You'd have to look at each of them (assuming its unsorted), and if it's not 15 you check the next one. O(n).
If I told you to wiggle your right pinky. You don't have to look anything up. You know where it is because I just told you where it is. The value I just passed to you is its address in your "memory."
It's kind of like that with databases, but on a much larger scale than just 10 fingers.
因为工作是预先完成的——该值被放入一个可以根据键的哈希码轻松访问的存储桶中。这就好像你想在字典中查找你的作品,但标记了该单词所在的确切页面。
Because work is done up front -- the value is put in a bucket that is easily accessible given the hashcode of the key. It would be like if you wanted to look up your work in the dictionary but had marked the exact page the word was on.
想象一下,您有一本字典,其中以字母 A 开头的所有内容都位于第 1 页,以字母 B 开头的内容位于第 2 页......等等。因此,如果您想查找“气球”,您就会确切地知道要转到哪个页面。这是 O(1) 查找背后的概念。
任意数据输入=>映射到特定的内存地址
当然,权衡是您需要更多的内存来分配所有潜在的地址,其中许多可能永远不会被使用。
Imagine you had a dictionary where everything starting with letter A was on page 1, letter B on page 2...etc. So if you wanted to look up "balloon" you would know exactly what page to go to. This is the concept behind O(1) lookups.
Arbitrary data input => maps to a specific memory address
The trade-off of course being you need more memory to allocate for all the potential addresses, many of which may never be used.
如果您有一个包含 999999999 个位置的数组,那么根据社会安全号码查找记录需要多长时间?
假设您没有那么多内存,则分配比您打算存储的记录数多大约 30% 的数组位置,然后编写一个哈希函数来查找它。
一个非常简单(可能也很糟糕)的哈希函数是social % numElementsInArray。
问题是冲突——你不能保证每个位置只包含一个元素。但没关系,您可以存储记录的链接列表,而不是将记录存储在数组位置。然后,一旦哈希得到正确的数组位置,就可以线性扫描所需的元素。
最坏的情况是 O(n)——一切都进入同一个桶。平均情况是 O(1),因为一般来说,如果您分配足够的存储桶并且哈希函数很好,那么记录通常不会经常发生冲突。
If you have an array with 999999999 locations, how long does it take to find a record by social security number?
Assuming you don't have that much memory, then allocate about 30% more array locations that the number of records you intend to store, and then write a hash function to look it up instead.
A very simple (and probably bad) hash function would be social % numElementsInArray.
The problem is collisions--you can't guarantee that every location holds only one element. But thats ok, instead of storing the record at the array location, you can store a linked list of records. Then you scan linearly for the element you want once you hash to get the right array location.
Worst case this is O(n)--everything goes to the same bucket. Average case is O(1) because in general if you allocate enough buckets and your hash function is good, records generally don't collide very often.
好的,简而言之,哈希表:
您采用常规数组(O(1) 访问),并且不使用常规 Int 值来访问它,而是使用 MATH。
您要做的就是将键值(假设是一个字符串)计算为数字(字符上的某个函数),然后使用众所周知的数学公式,为您提供数组范围上相对较好的分布。
因此,在这种情况下,您只需执行 4-5 次计算 (O(1)) 即可使用非 int 的键从该数组中获取对象。
现在,避免碰撞并找到正确的数学公式以实现良好的分布是困难的部分。这就是维基百科中解释得很好的内容: en.wikipedia.org/wiki/Hash_table
Ok, hash-tables in a nutshell:
You take a regular array (O(1) access), and instead of using regular Int values to access it, you use MATH.
What you do, is to take the key value (lets say a string) calculate it into a number (some function on the characters) and then use a well known mathematical formula that gives you a relatively good distribution on the array's range.
So, in that case you are just doing like 4-5 calculations (O(1)) to get an object from that array, using a key which isn't an int.
Now, avoiding collisions, and finding the right mathematical formula for good distribution is the hard part. That's what is explained pretty well in wikipedia: en.wikipedia.org/wiki/Hash_table
查找表事先确切地知道如何访问表中的给定项目。
与按排序数组中的值查找项目完全相反,您必须访问项目来检查它是否是您想要的。
Lookup tables know exactly how to access the given item in the table before hand.
Completely the opposite of say, finding an item by it's value in a sorted array, where you have to access items to check that it is what you want.
理论上,哈希表是一系列存储桶(内存中的地址)和一个将对象从域映射到这些存储桶的函数。
假设您的域名是 3 个字母的单词,您需要为所有可能的 3 个字母的单词屏蔽 26^3=17,576 个地址,并创建一个将所有 3 个字母的单词映射到这些地址的函数,例如 aaa=0、aab=1、现在,当您想要查找一个单词时,例如“and”,您可以立即从 O(1) 函数知道它是地址号 367。
In theory, a hashtable is a series of buckets (addresses in memory) and a function that maps objects from a domain into those buckets.
Say your domain is 3 letter words, you'd block out 26^3=17,576 addresses for all the possible 3 letter words and create a function that maps all 3 letter words to those addresses, e.g., aaa=0, aab=1, etc. Now when you have a word you'd like to look up, say, "and", you know immediately from your O(1) function that it is address number 367.