在超过一百万个字符串的列表中按相反顺序排列字符串对?
最近在一些采访中被问到“如果存在于超过一百万个字符串的列表中,如何找到所有字符串的反转?
对于例如 str[1] =“abc”,我需要准确地检查“cba”,没有字谜。
方法1. 将所有字符串存储在哈希集中,从第一个字符串开始遍历,如果存在,则将其配对移动到下一个元素,
如果内存是限制,您能建议任何方法吗?
Recently asked during some interview that "How to find reverse of all strings if exists in a list of more than million strings?
For E.g. str[1] = "abc", I need to check for "cba" exactly, no anagrams.
Method 1. Store all the strings in a hashset, start traversing from the first string and check for the reversed form exists in Hashset. if yes, then pair else move to next element.
Can you suggest any method if memory is the constraint?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
如果允许,您可以对字符串进行就地排序,这样当您查找字符串的反向时,您可以进行二分搜索。
If allowed, you could in-place sort the strings so when you look up the reverse of a string you can do a binary search.
您可以使用 Bloom Filter 它会告诉您字符串是否已存在于类似哈希表的结构中,但每个桶只有 0 或 1,因此使用的空间非常少。
恰好 1 000 000 位 == 125 KB
You can use a Bloom Filter which will tell you if a string already exists within a hash table like structure, but each bucket is only 0 or 1 so very little space is used.
Exactly 1 000 000 bits == 125 KB
您可以选择HashTable并使用桶来减少哈希冲突。我们现在需要对特定的查询字符串做的只是将其反转、散列并在 HashTable 中查找,而不是从头到尾遍历。
You can choose HashTable and use buckets to reduce hash conflict. What we now need to do for a specific query string is just reverse it, hash it and find in the HashTable instead of traverse from start to then end.
这只是我的观点:
创建一个哈希
我将使用key=character
value=以该字符开头的字符串列表< /em>
This is jus my opinion:
I would create a hash with
key=character
value=List of string which start with that character
<罢工>
使用“内存作为约束”,那么我什至不会选择 HashSet(据我所知,它也会删除原始列表中的重复字符串),因为您将使用 a 的附加结构HashSet 需要一些内存。
排序,也不会提高内存使用率。
我将使用原始列表(已经存在,因此不会使用额外的内存)+ 3 字节整数变量来迭代列表。 3 个字节可以迭代 2^24 = 16777216 个字符串的列表
如果“内存作为约束”,我会进行 2 个 for 循环。我认为类似 C 的伪代码会比我的简单英语更容易理解。
注意:
这是最少的记忆-我能想到的消耗代码:
关于内存使用情况,这个解决方案将采用2个整数变量(通常每个4字节)+原始列表,我认为我们无法摆脱
关于CPU使用情况(实际上,基于不相关 。的问题),字符串反转的次数将为:(N*(N+1))/2,其中 N 是列表的长度
With "memory as a constraint" then I wouldn't even go for a HashSet (which, afaik will also remove the duplicated strings in the original list) because you'll be using the additional structure of a HashSet which takes some memory.
Sorting, wouldn't improve memory usage either.
I would use the original list (which is already there, so no extra memory will be used) + a 3 byte integer variable to iterate the list. 3 bytes can iterate over a list of 2^24 = 16777216 strings
With "memory as a constraint" I would go for 2 for loops. I think a C-Like pseudocode will be easier to understand that my plain english.
Notes:
Here is the least memory-consumption code I can think of:
Regarding memory usage, this solution will take 2 integer variables (usually 4 bytes each) + the original list, which I assume we can not get rid of.
Regarding CPU usage (actually, not relevant based on the question), the amount of times a strings will be reversed will be: (N*(N+1))/2 where N is the length of the list
首先,我将使用与方向无关的散列对字符串进行散列。这可能是一个简单的字符总和,尽管肯定有更好的方案可以从两端进行散列。为了“使交易更顺利”,人们可以将字符串长度附加到哈希值中,或者以其他方式将其合并到哈希中。
然后,当您将字符串分成相同的哈希组时,进行“长手”比较。
请注意,使用此方案或仅向前或向后使用方向相关哈希的方案,要做的事情是不要立即将字符串插入哈希集中,而是检查它(如果需要,使用反向哈希)首先,如果你得到一个匹配(并且随后的长比较为真),则删除已经散列的字符串并将两者配对。第二个字符串永远不会进入集合,并且,如果所有字符串最多都匹配,则哈希集中只会有 500,000 个条目,并且,如果字符串是随机的,则可能接近 250,000 个(我没有坐过)来计算概率)。
因此,您只需要遍历一组字符串即可完成整个操作。
First I would hash the strings using a hash that was independent of direction. This could be a simple sum of characters, though there are certainly better schemes that would hash from both ends. And to "sweeten the deal" one could append string length to the hash value, or otherwise incorporate it in the hash.
Then when you have the strings broken into identical hash groups, do the "long hand" compare.
Note that, using either this scheme or the one where you simply use a direction dependent hash forwards or backwards, the thing to do is to not immediately insert the string into the hash set, but rather check it (with the reversed hash if needed) first, and if you get a match (and the subsequent long compare is true) remove the already-hashed string and pair the two. The second string never goes into the set, and, if all the strings have matches at most you'd only ever have 500,000 entries in the hash set, and, if the strings were random, probably closer to 250,000 (I haven't sat down to figure the probabilities).
So you'd need only one pass through the set of strings to do the whole thing.