独特字数
这是一个通用问题,(可能)适用于任何高级编程语言。 情况是这样的:
假设我有一个字符串数组。比如说,我设法将一个短篇小说中的 500 000 个字符串放入一个数组中(假设您没有输入格式选项)。因此,很可能存在任意数量的重复项。
我想获取这个字符串数组并创建另一个数组,其中包含该数组的唯一子集(?)(即:没有重复项)。在这种情况下,输入和输出都必须是数组,因此这可能会限制您的各种选项。
从性能角度来看,实现这一目标的最快方法是什么?我目前正在使用线性搜索来检查某个单词是否已经存在,但由于它是线性搜索,我觉得可能有更快的方法,特别是如果我有不合理数量的字符串需要处理。就像一本更大的小说!
This is a generic question that applies to (probably) any high-level programming language.
Here is the situation:
Suppose I have an array of strings. Say, I managed to put 500 000 strings from a short story into an array (just suppose you don't have an option for input format). Consequently, there will most likely be an arbitrary number of duplicated items.
I want to take this array of strings and create another array that contains a unique subset(?) of that array (ie: no duplicates). In this scenario, both the input and output must be arrays, so that may restrict you from various options.
Performance-wise, what's the fastest way to accomplish this? I'm currently using a linear search to check whether a word exists already, but as it is a linear search I feel that there might be faster ways especially if I have unreasonable amounts of strings to work with. Like a bigger novel!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用哈希集可能是最明智的做法 - 复杂性应该是 O(N)。
注意:大多数高级编程语言都包含从数组中删除重复项的函数的实现,例如 PHP。
Using a hashset might be the most sensible thing to do - complexity should be O(N).
Note: most high-level programming languages contain an implementation of a function that removes duplicates from an array, e.g. PHP.
如果您要在其中放入无数的单词,有向非循环单词图是最合适的我所知道的高效数据结构。
但从概念上讲,它是一个非常简单的数据结构。
If you are going to be putting gazillions of words into it, a directed acyclic word graph is the most efficient data structure I know of.
And yet it is conceptually a very simple data structure.