单独且相互独立地压缩文件的每一行? (或保留换行符)
我有一个非常大的文件(~10 GB),可以压缩到<使用 gzip 的 1 GB。我对使用 sort FILE | 感兴趣uniq-c|排序以查看单行重复的频率,但是 10 GB 文件太大而无法排序,而且我的计算机内存不足。
有没有一种方法可以在保留换行符的同时压缩文件(或完全不同的方法),从而将文件减小到足够小的大小以进行排序,但仍然使文件处于可排序的状态?
或者有任何其他方法可以找出/计算大文件(类似于 10 GB 的 CSV 文件)中每行重复的次数?
感谢您的帮助!
I have a very large file (~10 GB) that can be compressed to < 1 GB using gzip. I'm interested in using sort FILE | uniq -c | sort to see how often a single line is repeated, however the 10 GB file is too large to sort and my computer runs out of memory.
Is there a way to compress the file while preserving newlines (or an entirely different method all together) that would reduce the file to a small enough size to sort, yet still leave the file in a condition that's sortable?
Or any other method of finding out / countin how many times each line is repetead inside a large file (a ~10 GB CSV-like file) ?
Thanks for any help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您确定您的
sort
内存 (RAM) 已用完吗?根据我调试
sort
问题的经验,我相信您可能已经用完sort
的磁盘空间来创建临时文件。另请记住,用于排序的磁盘空间通常位于/tmp
或/var/tmp
中。因此,请使用以下命令检查您的可用磁盘空间:(
某些系统不支持 -g,请尝试 -m (megs) -k (kiloB) )
如果您的
/tmp
分区过小,您是否有另一个有 10-20GB 可用空间的分区?如果是,则告诉您的排序使用该目录,注意排序版本
帮助说
我不确定这是否意味着可以组合一堆
-T=/dr1/ -T=/dr2 ...
是否获取您的 10GB*sortFactor 空间。我的经验是它只使用列表中的最后一个目录,因此尝试使用 1 个足够大的目录。另请注意,您可以转到用于排序的任何目录,并且您将看到用于排序的临时文件的活动。
Are you sure you're running out of the Memory (RAM?) with your
sort
?My experience debugging
sort
problems leads me to believe that you have probably run out of diskspace forsort
to create it temporary files. Also recall that diskspace used to sort is usually in/tmp
or/var/tmp
.So check out your available disk space with :
(some systems don't support -g, try -m (megs) -k (kiloB) )
If you have an undersized
/tmp
partition, do you have another partition with 10-20GB free? If yes, then tell your sort to use that dir withNote that for sort version
The help says
I'm not sure if this means can combine a bunch of
-T=/dr1/ -T=/dr2 ...
to get to your 10GB*sortFactor space or not. My experience was that it only used the last dir in the list, so try to use 1 dir that is big enough.Also, note that you can go to the whatever dir you are using for sort, and you'll see the acctivity of the temporary files used for sorting.
有一些可能的解决方案:
1 - 使用任何文本处理语言(perl、awk)提取每一行并保存该行的行号和散列,然后比较散列
2 - 可以/想要删除重复的行,每个文件只留下一次出现?可以使用如下脚本(命令):
awk '!x[$0]++' 旧文件 > newfile
3 - 为什么不分割文件但有一些标准?假设你的所有行都以字母开头:
- 将您的original_file分成20个较小的文件:grep "^a*$"original_file >一个文件
- 对每个小文件进行排序:a_file、b_file 等
- 验证重复项,计算它们,做任何你想做的事。
There are some possible solutions:
1 - use any text processing language (perl, awk) to extract each line and save the line number and a hash for that line, and then compare the hashes
2 - Can / Want to remove the duplicate lines, leaving just one occurence per file? Could use a script (command) like:
awk '!x[$0]++' oldfile > newfile
3 - Why not split the files but with some criteria? Supposing all your lines begin with letters:
- break your original_file in 20 smaller files: grep "^a*$" original_file > a_file
- sort each small file: a_file, b_file, and so on
- verify the duplicates, count them, do whatever you want.