使用什么数据结构? (哈希映射 vs. trie vs.?)
我有一个 C 函数,可以生成大约 600 万个唯一的数组。这些数组每个总是有 17 个元素,每个元素都是 0 到 16 之间的整数。我还有该函数的一个稍微修改过的版本,它也将生成大约 600 万个相同类型的唯一数组。我的问题是第二个生成的结果比第一个少大约 45,000 个结果,我想看看这些结果是什么。
所以我的方法是简单地存储第二个函数的所有结果(计算器告诉我这不应该超过 400 mb,这很好地保留在内存中),然后查找第一个函数的结果,打印出那些不存在。
假设一般方法有意义(如果没有,请告诉我),我正在寻找一个合适的数据结构(最好是用 C 语言实现),它可以容纳大约 600 万个独特的排列
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
(或其某些转换)和然后对它们执行快速成员资格测试。正如标题所说,我确实对哪些数据结构可以完成这项工作有一些怀疑,但我不确定尝试或哈希图是最佳选择。
这是一种用于检测另一个算法中的缺陷的算法,而不是在生产中使用的算法。我有兴趣以一种能够以人类术语相对较快地编码并返回结果的方式来完成此操作,不一定要缩短几毫秒,因此存在能够完成大部分工作的易于理解的库绝对是一个优势。
I have a C function that produces about 6 million unique arrays. These arrays always have 17 elements each, and each element is an integer from 0 to 16. I also have a slightly modified version of that function that will also produce about 6 million unique arrays of the same kind. My problem is that the second one produces about 45,000 results less than the first, and I'd like to see what these results are.
So my approach is to simply store all the results of the second function (calculator tells me this should not take more than 400 mb which is fine to keep in-memory) and then look up the results of the first, printing out the ones that don't exist.
Assuming the general approach makes sense (and if not, do tell), what I am looking for is an appropriate data structure (ideally with a good implementation in C) that can hold about 6 million unique permutations of
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
(or some transformation thereof) and then perform fast membership testing on them. As the title says, I do have some suspicions about which data structures may do the job, but I am not certain tries or hashmaps are the best choice for this.
This is an algorithm to detect a flaw in another algorithm, not something that will be used in production. I am interested in doing this in a way that will be coded and return results relatively quickly in human terms, not necessarily shave milliseconds, so existence of easy to grok libraries that will do most of the job is definitely a plus.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
最优性在某种程度上取决于排列的分布方式以及插入与搜索的比率。由于您不关心最优性,而只是想要一种简单的方法来检验假设,而无需整夜等待结果,我的直觉告诉我:
整数 [0,16] 可以表示为五位数,因此其中 17 个可以表示为 85 位(11 字节)二进制字符串。因此,您只需使用可用于存储排序/散列字符串集并对其进行成员资格测试的众多库之一即可完成。它不会像调整后的 trie 那样快或缓存一致,但它足以在几秒钟内处理完 66mb 的数据,并且您将在午餐时完成。
如果没有这样的库方便使用,并且您必须从头开始工作,我只需制作一个排序的字符串列表,然后通过二进制搜索进行成员资格测试。其结果类似于 O( n log n + m( n log n ) ) = O( 2×mn log n ) eg 二次时间为 m→n。如果这仅在生产过程中作为离线作业运行一两次,那可能就足够了;如果您每天要多次执行此操作,我会担心缓存局部性并使用 trie 或 B 树。
Optimality would kind of depend on how the permutations are distributed and the ratio of insertions to searches. Since you are not concerned with optimality, but just want a straightforward way to test a hypothesis without waiting all night for results, my gut says:
An integer [0,16] can be represented as a five bit number, so seventeen of them can be represented as an 85-bit (11-byte) binary string. So, you can just use one of the many libraries available for storing sorted/hashed sets of strings with membership tests on them, and be done. It won't be quite as fast or cache-coherent as a tuned trie, but it'll be good enough to grind through 66mb of data in a few seconds, and you'll be done by lunch.
If no such library is conveniently to hand and you have to work from scratch, I'd just make a sorted list of the strings and then do the membership tests via binary search. That works out to something like O( n log n + m( n log n ) ) = O( 2×mn log n ) eg quadratic time as m→n. If this is only being run as an offline job once or twice during production, that might be good enough; if you're going to do this more than once a day, I'd worry about cache locality and use a trie or B-tree.
我认为你需要权衡在 C 中这样做的价值以避免沟通。
我会将 C 中的每个数组作为空格分隔的整数逐行打印。然后从文件中加载该文件以创建一组字节数组,如下所示(F# 代码):
然后计算两个文件之间的设置差异,如下所示:
这可能需要几分钟才能运行。
I think you need to weigh up the value doing this in C just to avoid communication.
I would print each array from C line-by-line as space-separated integers. Then load that in from file to create a set of byte arrays like this (F# code):
and then compute the set difference between two files like this:
That would probably take a few minutes to run.
保持简单:
memcmp(left, right, 17)
bsearch
)。最后两个步骤中的每一步都将执行 6M * log(6M) 量级的比较,大约为 138M。这可能仍然少于编写代码所需的时间。这并不长,因为一切都很简单:-)
Keeping it simple:
qsort
just callsmemcmp(left, right, 17)
bsearch
).Each of the last two steps will perform something of the order of 6M * log(6M) comparisons, which is about 138M. Which is probably still less time than it takes to write the code. Which isn't long, since everything is so simple :-)
取决于您的情况,哪一种会获得更好的内存性能。另外,您使用什么哈希函数,如何解决冲突等。如何查看 哈希数组映射 Trie ( HAMT)
Depends on which one in your case would get better memory performance. Also what hash function you use, how do you resolve a collision etc. How about checking out a Hash Array Mapped Trie (HAMT)
@Steve Jessop您可以在线性时间内完成最后一步,通过删除我们正在搜索的数组中不需要的值来进行更智能的搜索:
让我们假设 n 是 A 的大小,m 是 B 的大小,
这应该在 O( n+m) 时间,因为算法的每一步都执行至少一次计数器增量。
@Steve Jessop You can do the last step in linear time, doing a smarter search by removing unneeded values of the array that we are searching in:
Lets suppose n the size of A and m the size of B,
This should perform in O(n+m) time since every step of the algorithm performs at least one incrementation of a counter.
a) 创建一个包含两个 64 位 int 的结构
b) 因为每个结果有 17 个元素,所以将前 8 个元素相乘并将结果放入第一个 int,将其他 7 相乘并将结果放入第二个 int。
c) 创建一个运算符 <对于您的结构
d) 创建一组结构并插入第一次运行的所有结果
e) 迭代第二次运行结果并执行 set::find()
Edwin
a) create a struct that contains two 64-bit int's
b) since each result has 17 elements, multiply the first 8 and put the result on the first int, multiply the other 7 and put the result on the second int.
c) create an operator < for your struct
d) create a set of your struct and insert all of your results from your first run
e) iterate through your second run results and do a set::find()
Edwin