在大文件中查找重复项

发布于 2025-01-04 06:21:16 字数 225 浏览 1 评论 0原文

我有一个非常大的文件,大约有 1500 万个条目。 文件中的每一行都包含一个字符串(称为键)。

我需要使用 java 查找文件中的重复条目。 我尝试使用哈希图并检测重复的条目。 显然,这种方法向我抛出了“java.lang.OutOfMemoryError:Java堆空间”错误。

我该如何解决这个问题?

我想我可以增加堆空间并尝试一下,但我想知道是否有更有效的解决方案而无需调整堆空间。

I have really large file with approximately 15 million entries.
Each line in the file contains a single string (call it key).

I need to find the duplicate entries in the file using java.
I tried to use a hashmap and detect duplicate entries.
Apparently that approach is throwing me a "java.lang.OutOfMemoryError: Java heap space" error.

How can I solve this problem?

I think I could increase the heap space and try it, but I wanted to know if there are better efficient solutions without having to tweak the heap space.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

守护在此方 2025-01-11 06:21:16

关键是你的数据无法装入内存。您可以使用外部合并排序来实现此目的:

将文件划分为多个适合内存的较小块。对每个块进行排序,消除重复项(现在是相邻元素)。

合并块并在合并时再次消除重复项。由于这里将进行 n-nway 合并,因此您可以将每个块的下一个 k 元素保留在内存中,一旦块的项目耗尽(它们已经被合并),就可以从磁盘中获取更多元素。

The key is that your data will not fit into memory. You can use external merge sort for this:

Partition your file into multiple smaller chunks that fit into memory. Sort each chunk, eliminate the duplicates (now neighboring elements).

Merge the chunks and again eliminate the duplicates when merging. Since you will have an n-nway merge here you can keep the next k elements from each chunk in memory, once the items for a chunk are depleted (they have been merged already) grab more from disk.

合约呢 2025-01-11 06:21:16

我不确定您是否会考虑在 java 之外执行此操作,但如果是这样,这在 shell 中非常简单:

cat file | sort | uniq

I'm not sure if you'd consider doing this outside of java, but if so, this is very simple in a shell:

cat file | sort | uniq
笨死的猪 2025-01-11 06:21:16

您可能无法一次加载整个文件,但您可以将哈希值和行号存储在 HashSet 中,没有问题。

伪代码...

for line in file
    entries.put(line.hashCode, line-number)
for entry in entries
    if entry.lineNumbers > 1
         fetch each line by line number and compare

You probably can't load the entire file at one time but you can store the hash and line-number in a HashSet no problem.

Pseudo code...

for line in file
    entries.put(line.hashCode, line-number)
for entry in entries
    if entry.lineNumbers > 1
         fetch each line by line number and compare
无敌元气妹 2025-01-11 06:21:16

我认为您不需要对数据进行排序来消除重复项。只需使用受快速排序启发的方法即可。

  1. 从数据中选择 k 个主元(除非您的数据真的很古怪,这应该非常简单)
  2. 使用这些 k 个主元将数据划分为 k+1 个小文件
  3. 如果这些块中的任何一个太大而无法放入内存,请重复该过程chunk
  4. 一旦你有了可管理大小的块,只需应用你最喜欢的方法(散列?)来查找重复项

请注意,k 可以等于 1。

I don't think you need to sort the data to eliminate duplicates. Just use quicksort inspired approach.

  1. Pick k pivots from the data (unless your data is really wacky this should be pretty straightforward )
  2. Using these k pivots divide the data into k+1 small files
  3. If any of these chunks are too large to fit in memory repeat the process just for that chunk
  4. Once you have manageable sized chunks just apply your favorite method (hashing?) to find duplicates

Note that k can be equal to 1.

心如狂蝶 2025-01-11 06:21:16

我可以想象解决这个问题的一种方法是首先使用 外部排序算法 对文件进行排序(搜索for external sort java 使用代码产生大量结果)。然后您可以逐行迭代文件,重复项现在显然将直接相互跟随,因此您只需要在迭代时记住前一行。

One way I can imagine solving this is to first use an external sorting algorithm to sort the file (searching for external sort java yields lots of results with code). Then you can iterate the file line by line, duplicates will now obviously be directly following each other so you only need to remember the previous line while iterating.

夏末染殇 2025-01-11 06:21:16

如果由于没有足够的内存而无法构建完整的列表,则可以尝试循环执行。即创建一个哈希图,但只存储一小部分项目(例如,以 A 开头的项目)。然后收集重复项,然后继续“B”等。

当然,您可以选择任何类型的“分组”(即前 3 个字符、前 6 个字符等)。

它只会需要(许多)更多的迭代。

If you cannot build up a complete list since you don't have enough memory, you might try do it in loops. I.e. create a hashmap but only store a small portion of the items (for example, those starting with A). Then you gather the duplicates, then continue with 'B' etc.

Of course you can select any kind of 'grouping' (i.e. first 3 characters, first 6 etc).

It only will take (many) more iterations.

暗地喜欢 2025-01-11 06:21:16

如果您愿意接受一定数量的统计误差,您可以尝试布隆过滤器。 Guava 提供了一个,但现在有一个相当大的错误应该会在下周发布 11.0.2 时修复。

You might try a Bloom filter, if you're willing to accept a certain amount of statistical error. Guava provides one, but there's a pretty major bug in it right now that should be fixed probably next week with release 11.0.2.

等待我真够勒 2025-01-11 06:21:16
#!/bin/bash

# This script will sort a file and remove duplicates
# It will use external merge sort to do this
# It will use a temporary directory to store the sorted chunks
# It will use a temporary directory to store the merged chunks
# It will use a temporary directory to store the final sorted file

# The script will take the following parameters
# $1 - The file to sort
# $2 - The number of lines to sort in each chunk
# $3 - The number of chunks to merge at a time
# $4 - The temporary directory to use

# The script will output the sorted file to stdout

# The script will return 0 on success
# The script will return 1 if the file does not exist
# The script will return 2 if the temporary directory does not exist

# Check that the file exists
if [ ! -f "$1" ]; then
    echo "The file $1 does not exist"
    exit 1
fi

# Check that the temporary directory exists
if [ ! -d "$4" ]; then
    echo "The temporary directory $4 does not exist"
    exit 2
fi

# Create a temporary directory to store the sorted chunks
chunk_dir="$4/chunks"
mkdir -p "$chunk_dir"

# Create a temporary directory to store the merged chunks
merge_dir="$4/merge"
mkdir -p "$merge_dir"

# Create a temporary directory to store the final sorted file
sort_dir="$4/sort"
mkdir -p "$sort_dir"

# Split the file into chunks
split -l "$2" "$1" "$chunk_dir/chunk"

# Sort each chunk
for chunk in "$chunk_dir"/chunk*; do
    sort "$chunk" > "$chunk.sorted"
done

# Merge the chunks
while [ $(ls "$chunk_dir" | wc -l) -gt 0 ]; do
    # Merge the first $3 chunks
    merge_chunks=""
    for i in $(seq 1 "$3"); do
        chunk=$(ls "$chunk_dir" | head -n 1)
        merge_chunks="$merge_chunks $chunk_dir/$chunk"
        rm "$chunk_dir/$chunk"
    done
    merge_file="$merge_dir/merge$(date +%s%N)"
    sort -m "$merge_chunks" > "$merge_file"
    # Remove duplicates from the merged file
    uniq "$merge_file" > "$merge_file.uniq"
    mv "$merge_file.uniq" "$merge_file"
    # Move the merged file to the chunk directory
    mv "$merge_file" "$chunk_dir"
done

# Move the final sorted file to the sort directory
mv "$chunk_dir"/* "$sort_dir"

# Output the sorted file to stdout
cat "$sort_dir"/*

# Remove the temporary directories
rm -rf "$chunk_dir"
rm -rf "$merge_dir"
rm -rf "$sort_dir"
#!/bin/bash

# This script will sort a file and remove duplicates
# It will use external merge sort to do this
# It will use a temporary directory to store the sorted chunks
# It will use a temporary directory to store the merged chunks
# It will use a temporary directory to store the final sorted file

# The script will take the following parameters
# $1 - The file to sort
# $2 - The number of lines to sort in each chunk
# $3 - The number of chunks to merge at a time
# $4 - The temporary directory to use

# The script will output the sorted file to stdout

# The script will return 0 on success
# The script will return 1 if the file does not exist
# The script will return 2 if the temporary directory does not exist

# Check that the file exists
if [ ! -f "$1" ]; then
    echo "The file $1 does not exist"
    exit 1
fi

# Check that the temporary directory exists
if [ ! -d "$4" ]; then
    echo "The temporary directory $4 does not exist"
    exit 2
fi

# Create a temporary directory to store the sorted chunks
chunk_dir="$4/chunks"
mkdir -p "$chunk_dir"

# Create a temporary directory to store the merged chunks
merge_dir="$4/merge"
mkdir -p "$merge_dir"

# Create a temporary directory to store the final sorted file
sort_dir="$4/sort"
mkdir -p "$sort_dir"

# Split the file into chunks
split -l "$2" "$1" "$chunk_dir/chunk"

# Sort each chunk
for chunk in "$chunk_dir"/chunk*; do
    sort "$chunk" > "$chunk.sorted"
done

# Merge the chunks
while [ $(ls "$chunk_dir" | wc -l) -gt 0 ]; do
    # Merge the first $3 chunks
    merge_chunks=""
    for i in $(seq 1 "$3"); do
        chunk=$(ls "$chunk_dir" | head -n 1)
        merge_chunks="$merge_chunks $chunk_dir/$chunk"
        rm "$chunk_dir/$chunk"
    done
    merge_file="$merge_dir/merge$(date +%s%N)"
    sort -m "$merge_chunks" > "$merge_file"
    # Remove duplicates from the merged file
    uniq "$merge_file" > "$merge_file.uniq"
    mv "$merge_file.uniq" "$merge_file"
    # Move the merged file to the chunk directory
    mv "$merge_file" "$chunk_dir"
done

# Move the final sorted file to the sort directory
mv "$chunk_dir"/* "$sort_dir"

# Output the sorted file to stdout
cat "$sort_dir"/*

# Remove the temporary directories
rm -rf "$chunk_dir"
rm -rf "$merge_dir"
rm -rf "$sort_dir"
雨落□心尘 2025-01-11 06:21:16

关键是您的数据无法装入内存。 (BrokenGlass)
有足够的内存来存储键哈希值的 Map 来定位键,例如 RandomAccessFile.seek() 的偏移量或行号 Andrew White 建议,您可以在识别非唯一密钥时对其进行处理。

否则,首先建立一个之前可能见过的哈希值映射(例如,使用 key.hashCode() % (3<<23) 索引的 3MB 位图)传递,并在第二传递中处理桶中的钥匙,仅击中至少两次。

The key is that your data will not fit into memory. (BrokenGlass)
With enough memory to store a Map of key hash values to something to locate the key like an offset for RandomAccessFile.seek() or a line number as Andrew White suggested, you can process non-unique keys as they are identified.

Otherwise, establish a map of hash values to maybe seen before (e.g. a 3MB bitmap indexed with key.hashCode() % (3<<23)) in a first pass and in a second pass handle keys from buckets hit at least twice, only.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文