java中为字符串构造一个唯一的数字

发布于 2024-09-06 01:57:40 字数 272 浏览 6 评论 0原文

我们需要在一个文件中读取/写入超过 1000 万个字符串。此外,我们不希望文件中有重复项。由于字符串一旦被读取就会被刷新到文件中,因此我们不会将其保留在内存中。

我们不能使用哈希码,因为哈希码中存在冲突,因此我们可能会错过重复的字符串。 我在谷歌搜索中发现的另外两种方法:

1.使用像 MD5 这样的消息摘要算法 - 但计算和存储的成本可能太高。

2.使用校验和算法。 [我不确定这是否会为字符串生成唯一的密钥-有人可以确认]

是否还有其他可用的方法。 谢谢。

We have a requirement of reading/writing more than 10 million strings into a file. Also we do not want duplicates in the file. Since the strings would be flushed to a file as soon as they are read we are not maintaining it in memory.

We cannot use hashcode because of collisions in the hash code due to which we might miss a string as duplicate.
Two other approaches i found in my googling:

1.Use a message digest algorithm like MD5 - but it might be too costly to calculate and store.

2.Use a checksum algorithm. [i am not sure if this produces a unique key for a string- can someone please confirm]

Is there any other approach avaiable.
Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

天生の放荡 2024-09-13 01:57:40

如果您可以接受微小的冲突风险,您可以按照您的建议使用一些哈希函数(例如 MD5),并依赖哈希值。

另一种替代方案(可能需要更大的内存占用)是将已经遇到的字符串存储在 中trie(一种特殊类型的树)。


更新:另一种选择是使用 Bloom 过滤器。然而,这仍然依赖于散列,但可以调整为具有任意小的冲突概率。

If you're okay with a microscopical risk of collisions, you could use some hash function such as MD5 as you suggest, and rely on the hashes.

Another alternative, possibly with a larger memory footprint, is to store the, already encountered strings, in a trie (a special type of tree).


Update: Yet another alternative, would be to use a Bloom filter. This however, still relies on hashing but can be adjusted to have an arbitrarily small probability of collisions.

落墨 2024-09-13 01:57:40

在内存中存储 1000 万个字符串确实很多,所以我理解立即将其写入文件而不是存储在例如 TreeSet首先,但是哪里您想存储这 1000 万个您想与哪个唯一的数字键进行比较?当您想保持它唯一数字(其基数/基数比字母小得多)时,您不能使密钥比字符串本身更短,所以你不会保存任何内存。或者最多可以使用 GZIP 等数据压缩,但这只会增加大量开销。 MD5 也是不合适的,因为两个不同的字符串可以产生相同的哈希值。

我真的认为没有比使用像样的 RDBMS(SQL 数据库)更好的解决方案了,其中您将列设置为 UNIQUE 并相应地处理约束违规。 RDBMS 针对此类任务进行了高度优化。

如果您确实无法考虑数据库,那么您需要在写入/刷新之前重新读取文件中的任何现有条目。也许不是很快,但内存效率肯定很高。

Storing 10 million strings in memory is indeed a lot, so I understand the reason to write it to file immediately instead of storing in e.g. a TreeSet<String> first, but where would you like to store the 10 million unique numerical keys which you want to compare with? When you want to keep it unique and numerical (which has much littler base/radix than letters), you can't make the key shorter than the string itself already is, so you won't save any memory. Or maybe at highest with data compression like GZIP, but this would only add a lot of overhead. MD5 is also inappropriate since two different strings can yield the same hash.

I really see no better solution for this than using a decent RDBMS (SQL database) wherein you set the column as UNIQUE and handle the constraint violation accordingly. A RDBMS is highly optimized for this kind of tasks.

If you really can't consider a database, then you need to re-read the file for any existing entry before the write/flush. Maybe not very fast, but certainly memory efficient.

╰沐子 2024-09-13 01:57:40

无法创建一个函数来为字符串生成唯一键,该键比该字符串短。
有一些数据结构可以解决您的任务。如果数据足够大,B 树可能适合。根据您输入的性质,可能有更有效的方法。

There is no way to make a function that would produce a unique key for a string, which is shorter than that string.
There are data structures which can solve your task. B-tree might fit if you data is large enough. Depending on the nature of your input, there might be more effective ways.

客…行舟 2024-09-13 01:57:40

可靠地删除重复项几乎与对文件进行排序一样困难。正如另一个答案所示,如果不将每个字符串的完整副本保留在内存中,就没有保证精确检测重复项的方法,这似乎正是您想要避免的。

您可以保留哈希码的内存或磁盘索引,并使用它们从文件存储中检索实际字符串进行比较,但这本质上会重复数据库能够为您做的事情。

另一种方法是在文件完成后对其进行后处理。 UNIX 排序命令非常适合处理大文件(如何UNIX 排序命令可以对非常大的文件进行排序吗?),所以我希望标准 UNIX 命令行方法能够合理工作:(

    sort my-file-of-strings.txt | uniq > my-filtered-file-of-strings.txt

请注意,在传递给 uniq 删除重复项之前,必须先对文件进行排序) 。

如果您没有可用的这些工具(或等效工具),那么您始终可以尝试自己实现外部合并排序的某些变体。

Reliably removing duplicates is pretty much as difficult as sorting the file. As another answer indicates, there is no guaranteed way of precisely detecting duplicates without keeping a full copy of each string in memory, which seems to be exactly what you're trying to avoid.

You could keep an in-memory or on-disk index of hashcodes, and use these to retrieve actual strings from file storage for comparison, but this would essentially duplicate what a database would be able to do for you.

An alternative is to post-process the file once it's complete. The UNIX sort command is pretty good at large files (How could the UNIX sort command sort a very large file?), so I'd expect the standard UNIX command-line approach to work reasonably:

    sort my-file-of-strings.txt | uniq > my-filtered-file-of-strings.txt

(Note that files have to be sorted first before passing to uniq to remove duplicates).

If you haven't got these tools (or equivalents) available, then you can always try implementing some variant of an external merge sort yourself.

背叛残局 2024-09-13 01:57:40

如果字符串来自固定的可能字符串池 (N),则可以使用最小完美哈希来创建数组 0...N-1。由完美哈希函数确定的槽中的零意味着到目前为止还没有看到该字符串。

否则,在大量内存和迄今为止建议的解决方案之外,唯一有效的正确方法是在决定将字符串写入文件之前重新读取文件。

您可以通过文件的内存映射部分来尽可能有效地完成此操作。

If the strings are from a fixed pool of possible strings (N), then you can use minimal perfect hashing to create an array 0...N-1. A zero in the slot determined by the perfect hash function means the string has not been seen so far.

Otherwise, the only effectively correct means outside of a lot of memory and the solutions suggested so far is to re-read the file before deciding to write the string to it.

You could do this as efficiently as possible by memory mapping portions of the file.

(り薆情海 2024-09-13 01:57:40

我真的认为最好的解决方案是 - 正如其他人已经建议的那样 - 使用数据库。

如果由于某种原因您无法使用数据库,您仍然可以使用哈希码。肯定会有碰撞。只需添加一些代码,以便当您检测到重复的哈希码时,您的程序会检查该文件以确定它是真正的重复项还是冲突。

I really think the best solution is - as someone else already suggested - to use a database.

If for some reason you can not use a database, you can still use a hashcode. Sure there will be collisions. Just add some code so that when you detect a duplicate hashcode, your program checks the file to determine if it is a genuine duplicate or a collision.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文