当前位置：文江博客话题详情

在大文件中查找重复项

发布于 2025-01-04 06:21:16 字数 225 浏览 1 评论 0原文

我有一个非常大的文件，大约有 1500 万个条目。文件中的每一行都包含一个字符串（称为键）。

我需要使用 java 查找文件中的重复条目。我尝试使用哈希图并检测重复的条目。显然，这种方法向我抛出了“java.lang.OutOfMemoryError：Java堆空间”错误。

我该如何解决这个问题？

我想我可以增加堆空间并尝试一下，但我想知道是否有更有效的解决方案而无需调整堆空间。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

守护在此方 2025-01-11 06:21:16

关键是你的数据无法装入内存。您可以使用外部合并排序来实现此目的：

将文件划分为多个适合内存的较小块。对每个块进行排序，消除重复项（现在是相邻元素）。

合并块并在合并时再次消除重复项。由于这里将进行 n-nway 合并，因此您可以将每个块的下一个 k 元素保留在内存中，一旦块的项目耗尽（它们已经被合并），就可以从磁盘中获取更多元素。

回复收藏 0 原文

合约呢 2025-01-11 06:21:16

我不确定您是否会考虑在 java 之外执行此操作，但如果是这样，这在 shell 中非常简单：

cat file | sort | uniq

I'm not sure if you'd consider doing this outside of java, but if so, this is very simple in a shell:

cat file | sort | uniq

回复收藏 0 原文

笨死的猪 2025-01-11 06:21:16

您可能无法一次加载整个文件，但您可以将哈希值和行号存储在 HashSet 中，没有问题。

伪代码...

for line in file
    entries.put(line.hashCode, line-number)
for entry in entries
    if entry.lineNumbers > 1
         fetch each line by line number and compare

You probably can't load the entire file at one time but you can store the hash and line-number in a HashSet no problem.

Pseudo code...

for line in file
    entries.put(line.hashCode, line-number)
for entry in entries
    if entry.lineNumbers > 1
         fetch each line by line number and compare

回复收藏 0 原文

无敌元气妹 2025-01-11 06:21:16

我认为您不需要对数据进行排序来消除重复项。只需使用受快速排序启发的方法即可。

从数据中选择 k 个主元（除非您的数据真的很古怪，这应该非常简单）
使用这些 k 个主元将数据划分为 k+1 个小文件
如果这些块中的任何一个太大而无法放入内存，请重复该过程chunk
一旦你有了可管理大小的块，只需应用你最喜欢的方法（散列？）来查找重复项

请注意，k 可以等于 1。

回复收藏 0 原文

心如狂蝶 2025-01-11 06:21:16

我可以想象解决这个问题的一种方法是首先使用外部排序算法对文件进行排序（搜索for external sort java 使用代码产生大量结果）。然后您可以逐行迭代文件，重复项现在显然将直接相互跟随，因此您只需要在迭代时记住前一行。

回复收藏 0 原文

夏末染殇 2025-01-11 06:21:16

如果由于没有足够的内存而无法构建完整的列表，则可以尝试循环执行。即创建一个哈希图，但只存储一小部分项目（例如，以 A 开头的项目）。然后收集重复项，然后继续“B”等。

当然，您可以选择任何类型的“分组”（即前 3 个字符、前 6 个字符等）。

它只会需要（许多）更多的迭代。

回复收藏 0 原文

暗地喜欢 2025-01-11 06:21:16

如果您愿意接受一定数量的统计误差，您可以尝试布隆过滤器。 Guava 提供了一个，但现在有一个相当大的错误应该会在下周发布 11.0.2 时修复。

回复收藏 0 原文

等待我真够勒 2025-01-11 06:21:16

#!/bin/bash

# This script will sort a file and remove duplicates
# It will use external merge sort to do this
# It will use a temporary directory to store the sorted chunks
# It will use a temporary directory to store the merged chunks
# It will use a temporary directory to store the final sorted file

# The script will take the following parameters
# $1 - The file to sort
# $2 - The number of lines to sort in each chunk
# $3 - The number of chunks to merge at a time
# $4 - The temporary directory to use

# The script will output the sorted file to stdout

# The script will return 0 on success
# The script will return 1 if the file does not exist
# The script will return 2 if the temporary directory does not exist

# Check that the file exists
if [ ! -f "$1" ]; then
    echo "The file $1 does not exist"
    exit 1
fi

# Check that the temporary directory exists
if [ ! -d "$4" ]; then
    echo "The temporary directory $4 does not exist"
    exit 2
fi

# Create a temporary directory to store the sorted chunks
chunk_dir="$4/chunks"
mkdir -p "$chunk_dir"

# Create a temporary directory to store the merged chunks
merge_dir="$4/merge"
mkdir -p "$merge_dir"

# Create a temporary directory to store the final sorted file
sort_dir="$4/sort"
mkdir -p "$sort_dir"

# Split the file into chunks
split -l "$2" "$1" "$chunk_dir/chunk"

# Sort each chunk
for chunk in "$chunk_dir"/chunk*; do
    sort "$chunk" > "$chunk.sorted"
done

# Merge the chunks
while [ $(ls "$chunk_dir" | wc -l) -gt 0 ]; do
    # Merge the first $3 chunks
    merge_chunks=""
    for i in $(seq 1 "$3"); do
        chunk=$(ls "$chunk_dir" | head -n 1)
        merge_chunks="$merge_chunks $chunk_dir/$chunk"
        rm "$chunk_dir/$chunk"
    done
    merge_file="$merge_dir/merge$(date +%s%N)"
    sort -m "$merge_chunks" > "$merge_file"
    # Remove duplicates from the merged file
    uniq "$merge_file" > "$merge_file.uniq"
    mv "$merge_file.uniq" "$merge_file"
    # Move the merged file to the chunk directory
    mv "$merge_file" "$chunk_dir"
done

# Move the final sorted file to the sort directory
mv "$chunk_dir"/* "$sort_dir"

# Output the sorted file to stdout
cat "$sort_dir"/*

# Remove the temporary directories
rm -rf "$chunk_dir"
rm -rf "$merge_dir"
rm -rf "$sort_dir"

#!/bin/bash

# This script will sort a file and remove duplicates
# It will use external merge sort to do this
# It will use a temporary directory to store the sorted chunks
# It will use a temporary directory to store the merged chunks
# It will use a temporary directory to store the final sorted file

# The script will take the following parameters
# $1 - The file to sort
# $2 - The number of lines to sort in each chunk
# $3 - The number of chunks to merge at a time
# $4 - The temporary directory to use

# The script will output the sorted file to stdout

# The script will return 0 on success
# The script will return 1 if the file does not exist
# The script will return 2 if the temporary directory does not exist

# Check that the file exists
if [ ! -f "$1" ]; then
    echo "The file $1 does not exist"
    exit 1
fi

# Check that the temporary directory exists
if [ ! -d "$4" ]; then
    echo "The temporary directory $4 does not exist"
    exit 2
fi

# Create a temporary directory to store the sorted chunks
chunk_dir="$4/chunks"
mkdir -p "$chunk_dir"

# Create a temporary directory to store the merged chunks
merge_dir="$4/merge"
mkdir -p "$merge_dir"

# Create a temporary directory to store the final sorted file
sort_dir="$4/sort"
mkdir -p "$sort_dir"

# Split the file into chunks
split -l "$2" "$1" "$chunk_dir/chunk"

# Sort each chunk
for chunk in "$chunk_dir"/chunk*; do
    sort "$chunk" > "$chunk.sorted"
done

# Merge the chunks
while [ $(ls "$chunk_dir" | wc -l) -gt 0 ]; do
    # Merge the first $3 chunks
    merge_chunks=""
    for i in $(seq 1 "$3"); do
        chunk=$(ls "$chunk_dir" | head -n 1)
        merge_chunks="$merge_chunks $chunk_dir/$chunk"
        rm "$chunk_dir/$chunk"
    done
    merge_file="$merge_dir/merge$(date +%s%N)"
    sort -m "$merge_chunks" > "$merge_file"
    # Remove duplicates from the merged file
    uniq "$merge_file" > "$merge_file.uniq"
    mv "$merge_file.uniq" "$merge_file"
    # Move the merged file to the chunk directory
    mv "$merge_file" "$chunk_dir"
done

# Move the final sorted file to the sort directory
mv "$chunk_dir"/* "$sort_dir"

# Output the sorted file to stdout
cat "$sort_dir"/*

# Remove the temporary directories
rm -rf "$chunk_dir"
rm -rf "$merge_dir"
rm -rf "$sort_dir"

回复收藏 0 原文