适当的数据结构可加快检索过程(数据大小:约 200,000 个值,全部为字符串)

发布于 2024-12-04 17:40:01 字数 223 浏览 0 评论 0原文

我有一个大约 200, 000 个值的大型数据集,它们都是字符串。我应该使用哪种数据结构,以便搜索和检索过程更快。插入是一次性的,所以即使插入速度慢也没什么关系。

哈希映射可能是一种解决方案,但其他选择是什么? 谢谢

编辑: 一些指示 1. 我正在寻找完全匹配的内容,而不是部分匹配的内容。 2.我必须用PHP来完成这个任务。 3. 有什么方法可以将如此大量的数据以树的形式或其他格式保存在缓存中吗?

I have a large data set of around 200, 000 values, all of them are strings. Which data structure should i use so that the searching and retrieval process is fast. Insertion is one time, so even if the insertion is slow it wouldn't matter much.

Hash Map could be one solution, but what are the other choices??
Thanks

Edit:
some pointers
1. I am looking for exact matches and not the partial ones.
2. I have to accomplish this in PHP.
3. Is there any way i can keep such amount of data in cache in form of tree or in some other format?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

娇妻 2024-12-11 17:40:01

如果您需要的只是字符串查找,那么您确实应该考虑不使用映射或哈希字典。使用这些时,在字符串大小 M 的查找中,N 个项目的复杂性保证为 O(M x log(N)),或者,对于散列来说,最好摊销为 O(M) 和一个大的常量乘数。使用非循环确定性有限自动机 (ADFA) 进行基本查找会更有效,如果需要关联数据,则使用 Trie。这些将一次一个字符地遍历数据结构,以非常小的乘数复杂度给出 O(M)。

基本上,您需要一种在数据结构使用字符串时解析字符串的数据结构,而不是必须在查找的每个节点进行完整字符串比较的数据结构。您所看到的红黑树的常见复杂顺序以及假设 O(1) 比较,这对于字符串来说并非如此。字符串的复杂度是 O(M),并且会传播到所有使用的比较。

You really should consider not using maps or hash dictionaries if all you need is a string lookup. When using those, your complexity guaranties for N items in a lookup of string size M are O(M x log(N)) or, best amortised for the hash, O(M) with a large constant multiplier. It is much more efficient to use an acyclic deterministic finite automaton (ADFA) for basic lookups, or a Trie if there is a need to associate data. These will walk the data structure one character at a time, giving O(M) with very small multiplier complexity.

Basically, you want a data structure that parses your string as it is consumed by the data structure, not one that must do full string compares at each node of the lookup. The common orders of complexity you see thrown around around for red-black trees and such assume O(1) compare, which is not true for strings. Strings are O(M), and that propagates to all compares used.

当梦初醒 2024-12-11 17:40:01

也许是 trie 数据结构。

特里树或前缀树是一种有序树数据结构,用于存储关联数组,其中键通常是字符串

Maybe a trie data structure.

A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings

装纯掩盖桑 2024-12-11 17:40:01

在这种情况下请使用 TreeMap。搜索和检索的时间复杂度为 O(log n)。在 HashMap 的情况下,最坏情况下搜索可能是 O(n),但检索是 O(1)。

对于 200000 个值,除非您遇到硬件限制,否则它可能不会有太大影响。我已经使用了包含 200 万个字符串的 HashMap,它们仍然足够快。 YMMV。

Use a TreeMap in that case. Search and Retrieval will be O(log n). In case of HashMap search can be O(n) worst case, but retrieval is O(1).

For 200000 values, it probably won't matter much though unless you are working with hardware constraints. I have used HashMaps with 2 million Strings and they were still fast enough. YMMV.

一花一树开 2024-12-11 17:40:01

如果您想确保以插入时间为代价来最小化搜索,则可以使用 B+ 树。

您还可以尝试存储桶推送和搜索。

You can B+ trees if you want to ensure your search is minimal at the cost of insertion time.

You can also try bucket push and search.

枕梦 2024-12-11 17:40:01

使用哈希图。假设实现与 Java 类似,并且冲突率正常,检索时间复杂度为 O(m) - 主要成本是计算哈希码,然后进行一次字符串比较。这很难被击败。

对于任何树/特里结构实现,请考虑由于额外的非本地化数据获取而导致的额外管道停顿的难以量化的成本。使用一个(特别是特里结构)的唯一原因是可能节省内存。只有长字符串才会节省内存。对于短字符串,减少字符存储所节省的内存比所有附加指针/索引所抵消的还要多。

小字:当由于选择不当的哈希函数而导致大量哈希码冲突时,可能会出现更糟糕的行为。您的里程可能会有所不同。但可能不会。

我不使用 PHP - 可能有些语言特征会扭曲这里的答案。

Use a hashmap. Assuming implementation similar to Java's, and a normal collision rate, retrieval is O(m) - the main cost is computing the hashcode and then one string-compare. That's hard to beat.

For any tree/trie implementation, factor in the hard-to-quantify costs of the additional pipeline stalls caused by additional non-localized data fetches. The only reason to use one (a trie, in particular) would be to possibly save memory. Memory will be saved only with long strings. With short strings, the memory savings from reduced character storage are more than offset by all the additional pointers/indices.

Fine print: worse behavior can occur when there are lots of hashcode collisions due to an ill-chosen hashing function. Your mileage may vary. But it probably won't.

I don't do PHP - there may be language characteristics that skew the answer here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文