使用 Hadoop 解析批量文本:生成密钥的最佳实践
我有一组“大”的行分隔完整句子,正在使用 Hadoop 进行处理。我开发了一个映射器,其中应用了一些我最喜欢的 NLP 技术。我在原始句子集上映射了几种不同的技术,在减少阶段我的目标是将这些结果收集到组中,以便组中的所有成员共享相同的原始句子。
我觉得用整个句子作为关键词是个坏主意。我觉得由于键的数量有限(不合理的信念),生成句子的一些哈希值可能不起作用。
有人可以推荐为每个句子生成唯一键的最佳想法/实践吗?理想情况下,我想保留顺序。然而,这不是主要要求。
阿凡陀,
I have a 'large' set of line delimited full sentences that I'm processing with Hadoop. I've developed a mapper that applies some of my favorite NLP techniques to it. There are several different techniques that I'm mapping over the original set of sentences, and my goal during the reducing phase is to collect these results into groups such that all members in a group share the same original sentence.
I feel that using the entire sentence as a key is a bad idea. I felt that generating some hash value of the sentence may not work because of a limited number of keys (unjustified belief).
Can anyone recommend the best idea/practice for generating unique keys for each sentence? Ideally, I would like to preserve order. However, this isn't a main requirement.
Aντίο,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
标准散列应该可以正常工作。大多数哈希算法的值空间远远大于您可能使用的句子数量,因此发生冲突的可能性仍然极低。
Standard hashing should work fine. Most hash algorithms have a value space far greater than the number of sentences you're likely to be working with, and thus the likelihood of a collision will still be extremely low.
尽管我已经给了你关于什么是正确的哈希函数的答案,但我真的建议你只使用句子本身作为键,除非你有一个具体的原因来解释为什么这是有问题的。
Despite the answer that I've already given you about what a proper hash function might be, I would really suggest you just use the sentences themselves as the keys unless you have a specific reason why this is problematic.
尽管您可能希望避免简单的哈希函数(例如,您可以快速想到的任何不成熟的想法),因为它们可能无法充分混合句子数据来避免冲突,但标准加密哈希函数之一可能非常合适,例如 MD5、SHA-1 或 SHA-256。
即使已发现冲突<,您也可以使用 MD5 来实现此目的/a> 并且该算法被认为对于安全密集型目的是不安全的。这不是一个安全关键的应用程序,所发现的冲突是通过精心构造的数据产生的,并且可能不会在您自己的 NLP 句子数据中随机出现。 (例如,请参阅 Johannes Schindelin 的解释,解释为什么可能没有必要更改 git 以使用 SHA- 256 个哈希值,以便您可以了解其背后的推理。)
Though you might want to avoid simple hash functions (for example, any half-baked idea that you could think up quickly) because they might not mix up the sentence data enough to avoid collisions in the first place, one of the standard cryptographic hash functions would probably be quite suitable, for example MD5, SHA-1, or SHA-256.
You can use MD5 for this, even though collisions have been found and the algorithm is considered unsafe for security intensive purposes. This isn't a security critical application, and the collisions that have been found arose through carefully constructed data and probably won't arise randomly in your own NLP sentence data. (See, for example Johannes Schindelin's explanation of why it's probably unnecessary to change git to use SHA-256 hashes, so that you can appreciate the reasoning behind this.)