BST 还是哈希表?

发布于 2024-11-06 04:02:33 字数 542 浏览 0 评论 0原文

我有大型文本文件,需要对其执行各种操作,主要涉及逐行验证。这些数据通常具有销售/交易性质,因此往往包含大量跨行的冗余信息,例如客户姓名。迭代和操作这些数据已成为一项常见任务,因此我正在用 C 编写一个库,希望将其作为 Python 模块提供。

在一项测试中,我发现在 130 万个列值中,只有约 300,000 个是唯一的。内存开销是一个问题,因为我们基于 Python 的 Web 应用程序可能会同时处理大型数据集的请求。

我的第一次尝试是读入文件并将每个列值插入到二叉搜索树中。如果以前从未见过该值,则分配内存来存储该字符串,否则返回指向该值的现有存储的指针。这对于大约 100,000 行的数据集非常有效。更大了,一切都会停止,内存消耗猛增。我认为树中所有这些节点指针的开销没有帮助,并且使用 strcmp 进行二分搜索变得非常痛苦。

这种令人不满意的性能使我相信我应该投资使用哈希表。然而,这提出了另一点——我事先不知道有多少记录。可能是十个,也可能是一千万。如何实现时间/空间的正确平衡,以防止一次又一次地调整哈希表的大小?

在这种情况下,最好的数据结构是什么?

感谢您抽出时间。

I have large text files upon which all kinds of operations need to be performed, mostly involving row by row validations. The data are generally of a sales / transaction nature, and thus tend to contain a huge amount of redundant information across rows, such as customer names. Iterating and manipulating this data has become such a common task that I'm writing a library in C that I hope to make available as a Python module.

In one test, I found that out of 1.3 million column values, only ~300,000 were unique. Memory overhead is a concern, as our Python based web application could be handling simultaneous requests for large data sets.

My first attempt was to read in the file and insert each column value into a binary search tree. If the value has never been seen before, memory is allocated to store the string, otherwise a pointer to the existing storage for that value is returned. This works well for data sets of ~100,000 rows. Much larger and everything grinds to a halt, and memory consumption skyrockets. I assume the overhead of all those node pointers in the tree isn't helping, and using strcmp for the binary search becomes very painful.

This unsatisfactory performance leads me to believe I should invest in using a hash table instead. This, however, raises another point -- I have no idea ahead of time how many records there are. It could be 10, or ten million. How do I strike the right balance of time / space to prevent resizing my hash table again and again?

What are the best data structure candidates in a situation like this?

Thank you for your time.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

在巴黎塔顶看东京樱花 2024-11-13 04:02:33

哈希表大小调整不是一个问题,除非您要求每次插入到表中的时间应该相同。只要您始终将哈希表大小扩展为常数因子(例如始终将大小增加 50%),添加额外元素的计算成本就会摊销O(1)。这意味着 n 次插入操作(当 n 很大时)将花费与 n 成比例的时间 - 然而,实际时间每次插入可能会有很大差异(实际上,其中一个插入会非常慢,而其他插入会非常快,但所有操作的平均值很小)。这样做的原因是,当您插入一个额外的元素,迫使表从 1000000 个元素扩展到 1500000 个元素时,该插入将花费大量时间,但现在您已经为自己购买了 500000 个极快的未来插入,然后您才需要再次调整大小。简而言之,我肯定会选择哈希表。

Hash table resizing isn't a concern unless you have a requirement that each insert into the table should take the same amount of time. As long as you always expand the hash table size by a constant factor (e.g. always increasing the size by 50%), the computational cost of adding an extra element is amortized O(1). This means that n insertion operations (when n is large) will take an amount of time that is proportionate to n - however, the actual time per insertion may vary wildly (in practice, one of the insertions will be very slow while the others will be very fast, but the average of all operations is small). The reason for this is that when you insert an extra element that forces the table to expand from e.g. 1000000 to 1500000 elements, that insert will take a lot of time, but now you've bought yourself 500000 extremely fast future inserts before you need to resize again. In short, I'd definitely go for a hash table.

只是我以为 2024-11-13 04:02:33

您需要使用哈希表的增量调整大小。在我当前的项目中,我跟踪每个存储桶中使用的哈希键大小,如果该大小低于表的当前键大小,那么我会在插入或查找时重新哈希该存储桶。在调整哈希表大小时,键大小加倍(向键添加额外的位),并且在所有新存储桶中,我只需添加一个指向现有表中适当存储桶的指针。因此,如果n是哈希桶的数量,则哈希扩展代码如下所示:

n=n*2;
bucket=realloc(bucket, sizeof(bucket)*n);
for (i=0,j=n/2; j<n; i++,j++) {
  bucket[j]=bucket[i];
}

You need to use incremental resizing of your hash table. In my current project, I keep track of the hash key size used in every bucket, and if that size is below the current key size of the table, then I rehash that bucket on an insert or lookup. On a resizing of the hash table, the key size doubles (add an extra bit to the key) and in all the new buckets, I just add a pointer back to the appropriate bucket in the existing table. So if n is the number of hash buckets, the hash expand code looks like:

n=n*2;
bucket=realloc(bucket, sizeof(bucket)*n);
for (i=0,j=n/2; j<n; i++,j++) {
  bucket[j]=bucket[i];
}
亣腦蒛氧 2024-11-13 04:02:33

我希望制作的C语言库
可作为 Python 模块使用

Python 已经内置了非常高效的微调哈希表。我强烈建议您首先让您的库/模块在 Python 中运行。然后检查速度。如果这还不够快,请对其进行分析并删除您发现的任何减速带,也许可以使用 Cython。

设置代码:

shared_table = {}
string_sharer = shared_table.setdefault

整理每个输入行:

for i, field in enumerate(fields):
    fields[i] = string_sharer(field, field)

在检查每一列后,您当然可能会发现某些列压缩得不好,应该从“整理”中排除。

library in C that I hope to make
available as a Python module

Python already has very efficient finely-tuned hash tables built in. I'd strongly suggest that you get your library/module working in Python first. Then check the speed. If that's not fast enough, profile it and remove any speed-humps that you find, perhaps by using Cython.

setup code:

shared_table = {}
string_sharer = shared_table.setdefault

scrunching each input row:

for i, field in enumerate(fields):
    fields[i] = string_sharer(field, field)

You may of course find after examining each column that some columns don't compress well and should be excluded from "scrunching".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文