如何使用 python 有效地找到两个大文件的交集?
我有两个大文件。它们的内容如下所示:
134430513
125296589
151963957
125296589
该文件包含未排序的 id 列表。某些 id 可能会在单个文件中出现多次。
现在我想找到两个文件的交集部分。这就是两个文件中都出现的 id。
我只是将这两个文件读入 2 组,s1
和 s2
。并通过 s1.intersection(s2) 获取交集。但它消耗大量内存并且看起来很慢。
那么有没有更好的或Python式的方法来做到这一点?如果文件包含太多 id,无法读取到内存有限的 set
中,我该怎么办?
编辑:我使用生成器将文件读入 2 组:
def id_gen(path):
for line in open(path):
tmp = line.split()
yield int(tmp[0])
c1 = id_gen(path)
s1 = set(c1)
所有 id 都是数字。最大id可能是5000000000。如果使用bitarray,会消耗更多内存。
I have two large files. Their contents looks like this:
134430513
125296589
151963957
125296589
The file contains an unsorted list of ids. Some ids may appear more than one time in a single file.
Now I want to find the intersection part of two files. That is the ids appear in both files.
I just read the two files into 2 sets, s1
and s2
. And get the intersection by s1.intersection(s2)
. But it consumes a lot of memory and seems slow.
So is there any better or pythonic way to do this? If the file contains so many ids that can not be read into a set
with limited memory, what can I do?
EDIT: I read the file into 2 sets using a generator:
def id_gen(path):
for line in open(path):
tmp = line.split()
yield int(tmp[0])
c1 = id_gen(path)
s1 = set(c1)
All of the ids are numeric. And the max id may be 5000000000. If use bitarray, it will consume more memory.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
其他人已经展示了更惯用的方法
Python,但如果数据量确实太大,可以
使用系统实用程序排序并消除重复项,然后
使用 File 是一个返回一行的迭代器这一事实
一次,执行如下操作:
这可以避免一次有多于一行(对于每个文件)
在内存中(并且系统排序应该比任何东西都快)
Python 可以做到,因为它针对这一任务进行了优化)。
Others have shown the more idiomatic ways of doing this in
Python, but if the size of the data really is too big, you can
use the system utilities to sort and eliminate duplicates, then
use the fact that a File is an iterator which returns one line
at a time, doing something like:
This avoids having more than one line at a time (for each file)
in memory (and the system sort should be faster than anything
Python can do, as it is optimized for this one task).
这相当于使用
intersection
,是最Pythonic的方式。 您也许可以通过这样做来加快速度从那时起,您将存储和比较整数而不是字符串, 。当然,只有当所有 id 都是数字时,这才有效。
如果仍然不够快,您可以转向稍微更命令式的风格:
如果保证两个文件不包含重复项,您还可以使用列表来加快速度:
一定要衡量各种解决方案,并检查您是否'实际上并不等待磁盘或网络输入。
which is equivalent to using
intersection
, is the most Pythonic way. You might be able to speed it up by doingsince then you'll be storing and comparing integers rather than strings. This only works if all ids are numeric, of course.
If it's still not fast enough, you can turn to a slightly more imperative style:
If both files are guaranteed not to contain duplicates, you can also use a list to speed things up:
Be sure to measure various solutions, and check whether you're not actually waiting for disk or network input.
因此,该算法不一定与 python 相关,而是通用的(如果您无法表示内存中集合中的所有 id)。如果整数的范围有限,一种方法是使用大的位数组。现在,您读取第一个文件并将
bitarray
中的整数设置为存在。现在,您读取第二个文件,并输出
bitarray
中也存在的所有数字。如果这还不够,您可以使用多次扫描来分割范围。即在第一遍中,您只考虑小于 0x200000000 (1GB
bitarray
) 的整数。然后,您重置
bitarray
并再次读取文件,仅考虑从0x200000000
到0x400000000
的整数(并在之前减去0x200000000
处理整数)。通过这种方式,您可以以合理的运行时间处理大量数据。
单次扫描的示例为:
So the algorithm is not necessarily tied to python, but rather generic if you cannot represent all ids in a set in memory. If the range of the integers is limited, an approach would be to use a large bitarray. Now you read the first file and set the integer in the
bitarray
to be present.Now you read the second file, and output all numbers that are also present in the
bitarray
.If even this is not sufficient, you can split the range using multiple sweeps. I.e. in the first pass, you only consider integers smaller than 0x200000000 (1GB
bitarray
).Then you reset the
bitarray
and read the files again only considering integers from0x200000000
to0x400000000
(and substract0x200000000
before handling the integer).This way you can handle LARGE amounts of data, with reasonable runtime.
A sample for single sweep would be:
据我所知,Python 没有有效的方法来做到这一点,特别是当你处理大量数据时。
我喜欢 rumpel< /a> 的解决方案。但请注意 bitarray 是 C 扩展。
我会使用 shell 命令来处理这个问题。您可以预处理文件以节省时间和space:
然后您可以使用
diff
来找出相似之处:当然,可以将所有内容组合到单个命令中,而无需创建中间文件。
更新
根据Can的建议,
comm
是更合适的命令:AFAIK there is no efficient way to do this with Python, especially if you are dealing with massive amounts of data.
I like rumpel's solution. But please note that bitarray is a C extension.
I would use shell commands to handle this. You can pre-process files to save time & space:
Then you can use
diff
to find out the similarities:Of course it is possible to combine everything into a single command, without creating intermediary files.
UPDATE
According to Can's recommendation,
comm
is the more appropriate command:您不需要同时创建
s1
和s2
。首先从第一个文件中读取行,将每行转换为整数(节省内存),然后将其放入s1
中。然后,对于第二个文件中的每一行,将其转换为整数,并检查该值是否在s1
中。这样,您就可以节省存储字符串和拥有两组字符串的内存。
You need not create both
s1
ands2
. First read in the lines from the first file, convert each line to integer (saves memory), put it ins1
. Then for each line in the second file, convert it to integer, and check if this value is ins1
.That way, you'll save memory from storing strings, and from having two sets.
对于大于内存的数据,您可以将数据文件分成10个文件,每个文件包含相同的最低数字。
所以s1.txt中所有以0结尾的id都将保存在s1_0.txt中。
然后使用set()求s1_0.txt和s2_0.txt、s1_1.txt和s2_1.txt的交集,...
for data larger then memory, you can split your data file into 10 files, which contain the same lowest digital.
so all ids in s1.txt that ends with 0 will be saved in s1_0.txt.
Then use set() to find the intersection of s1_0.txt and s2_0.txt, s1_1.txt and s2_1.txt, ...
我遇到了同样的错误,我有两个文件,一个为 1GB,另一个为 1.5GB,当我尝试将其存储在集合中时,出现 OutOfMemory Heap Size 错误。
因此,我根据这些文件的最后两位数字将这些文件分成 100 个小文件,因此 temp1_00 包含第一个文件的最后两位数字为 00 的 id,而 temp2_00 包含第二个文件的最后两位数字为 00 的 id ,然后我使用 set 计算所有 100 个文件的交集并将其相加。
算法
加载Set中的一个文件,然后一一遍历第二个文件内容并检查Set中的内容,如果找到则增加计数,并记住始终将小文件存储在Set中以节省内存。
进一步优化:
我正在处理的任务只需要一些近似值,因此我刚刚计算了 temp1_00 和 temp2_00 并将结果乘以 100 以获得近似结果。对于这个大型数据集,我可以承受的误差约为 0-5%。
如果您需要更准确的结果,则可以计算 10 个文件并将结果乘以 10。
统计数据:
I have encountered the same error, I was having two files one for 1GB and another one for 1.5GB around, and when I tried to store it in the set I got an OutOfMemory Heap Size error.
So I have divided these files into 100 small files based on their last two digits so temp1_00 is containing the id with the last two digits as 00 for the first file and temp2_00 is containing the id with the last two digits as 00 for the second file, then I calculated intersection using set for all 100 files and summed it up.
Algorithm
Load one file in Set and then traverse second file content one by one and check it's content in Set, if we found then increase the count and remember always store the small size file in Set to save memory.
Further Optimization:
The task I was working on just need some approximate value so I have just calculated for temp1_00 and temp2_00 and multiplied the result by 100 to get an approximate result. The error was around 0-5% that I can afford for that large dataset.
If you need a more accurate result then you can calculate for 10 files and multiply the result by 10.
Stats: