We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(6)
我最近也走上了这条路; 虽然听起来我的应用程序略有不同。 我对大量字符串上的近似集合运算感兴趣。
您确实注意到需要快速位向量。 根据您想要在布隆过滤器中放入的内容,您可能还需要考虑所使用的哈希算法的速度。 您可能会发现这个库很有用。 您可能还想修改下面使用的随机数技术,该技术仅对您的密钥进行一次哈希处理。
就非 Java 位数组实现而言:
I使用 BitVector 构建了我的布隆过滤器。 我花了一些时间分析和优化该库,并将我的补丁贡献给 Avi。 转到该 BitVector 链接并向下滚动到 v1.5 中的致谢以查看详细信息。 最后,我意识到性能不是这个项目的目标,并决定不使用它。
这是我手头上的一些代码。 我可能会将其放在 python-bloom 的谷歌代码上。 欢迎提出建议。
另外,就我而言,我发现为 BitVector 提供更快的 count_bits 函数很有用。 将此代码放入 BitVector 1.5 中,它应该为您提供性能更高的位计数方法:
I recently went down this path as well; though it sounds like my application was slightly different. I was interested in approximating set operations on a large number of strings.
You do make the key observation that a fast bit vector is required. Depending on what you want to put in your bloom filter, you may also need to give some thought to the speed of the hashing algorithm(s) used. You might find this library useful. You may also want to tinker with the random number technique used below that only hashes your key a single time.
In terms of non-Java bit array implementations:
I built my bloom filter using BitVector. I spent some time profiling and optimizing the library and contributing back my patches to Avi. Go to that BitVector link and scroll down to acknowledgments in v1.5 to see details. In the end, I realized that performance was not a goal of this project and decided against using it.
Here's some code I had lying around. I may put this up on google code at python-bloom. Suggestions welcome.
Also, in my case I found it useful to have a faster count_bits function for BitVector. Drop this code into BitVector 1.5 and it should give you a more performant bit counting method:
作为对 Parand 的回应,他说“常见的做法似乎是使用 SHA1 之类的东西并将这些位拆分以形成多个哈希值”,虽然从它是常见做法的意义上来说这可能是正确的(PyBloom 也使用它),但它仍然没有。这并不意味着这是正确的做法;-)
对于布隆过滤器,哈希函数的唯一要求是其输出空间必须在给定预期输入的情况下均匀分布。 虽然加密哈希确实满足了这一要求,但它也有点像用火箭筒射击苍蝇。
相反,请尝试 FNV Hash,它每个输入字节仅使用一次异或和一次乘法,我估计它比 SHA1 快几百倍:)
FNV 哈希在加密上并不安全,但您不需要它如此。 它有轻微 不完美雪崩行为,但您也不将其用于完整性检查。
关于一致性,请注意,第二个链接仅对 32 位 FNV 哈希进行了卡方检验。 最好使用更多位和 FNV-1 变体,它交换 XOR 和 MUL 步骤以获得更好的位分散。 对于布隆过滤器,还有一些问题,例如将输出统一映射到位数组的索引范围。 如果可能的话,我会说将位数组的大小四舍五入到最接近的 2 次方,并相应地调整 k。 这样您可以获得更高的准确性,并且可以使用简单的异或折叠来映射范围。
此外,这里有一个参考解释了为什么当您需要 通用哈希。
In reaction to Parand, saying "common practice seems to be using something like SHA1 and split up the bits to form multiple hashes", while that may be true in the sense that it's common practice (PyBloom also uses it), it still doesn't mean it's the right thing to do ;-)
For a Bloom filter, the only requirement a hash function has is that its output space must be uniformly distributed given the expected input. While a cryptographic hash certainly fulfils this requirement, it's also a little bit like shooting a fly with a bazooka.
Instead try the FNV Hash which uses just one XOR and one multiplication per input byte, which I estimate is a few hundred times faster than SHA1 :)
The FNV hash is not cryptographically secure, but you don't need it to be. It has slightly imperfect avalanche behaviour, but you're not using it for integrity checking either.
About uniformity, note that the second link only did a Chi-square test for the 32-bit FNV hash. It's better to use more bits and the FNV-1 variant, which swaps the XOR and the MUL steps for better bit-dispersion. For a Bloom Filter, there's a few more catches, such as mapping the output uniformly to the index range of the bit-array. If possible, I'd say round up the size of the bit-array to the nearest power of 2 and adjust k accordingly. That way you get better accuracy and you can use simple XOR-folding to map the range.
Additionally, here's a reference explaining why you don't want SHA1 (or any cryptographic hash) when you need a general purpose hash.
最终我找到了 pybloomfiltermap。 我没用过,但看起来它符合要求。
Eventually I found pybloomfiltermap. I haven't used it, but it looks like it'd fit the bill.
查看 array 模块。
FWIW,所有
//8
和% 8
操作都可以替换为>>3
和&0x07< /代码>。 这可能会带来稍微更好的速度,但存在一些模糊的风险。
此外,在大多数硬件上,将
'B'
和8
更改为'L'
和32
应该会更快。 [在某些硬件上更改为'H'
和 16 可能会更快,但这值得怀疑。]Look at the array module.
FWIW, all of the
//8
and% 8
operations can be replaced with>>3
and&0x07
. This may lead to slightly better speed at the risk of some obscurity.Also, changing
'B'
and8
to'L'
and32
should be faster on most hardware. [Changing to'H'
and 16 might be faster on some hardware, but it's doubtful.]距离这里最新的答案已经过去近十年了。 时代确实在改变。
看起来 2019 年底最流行的维护布隆过滤器包现在是这个:https://github.com/joseph-fox/python-bloomfilter,在 PyPi 上以 pybloom_live 形式提供:https://pypi.org/project/pybloom_live/
It's almost a decade since the most recent answers on here. And times do change.
Looks like the most popular maintained bloom filter package in late 2019 is now this one: https://github.com/joseph-fox/python-bloomfilter, available on PyPi as pybloom_live: https://pypi.org/project/pybloom_live/
我在 http://stromberg.dnsalias 上放置了一个 python 布隆过滤器实现。 org/~strombrg/drs-bloom-filter/
它是纯 python 编写的,具有良好的哈希函数、良好的自动化测试、精选的后端(磁盘、数组、mmap 等)以及更直观的参数
__init__
方法,这样您就可以指定理想的元素数量和所需的最大错误率,而不是一些空灵的、特定于数据结构的可调参数。I've put up a python bloom filter implementation at http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/
It's in pure python, has good hash functions, good automated tests, a selection of backends (disk, array, mmap, more) and more intuitive arguments to the
__init__
method, so you can specify an ideal number of elements and desired maximum error rate, instead of somewhat ethereal, datastructure-specific tunables.