单独且相互独立地压缩文件的每一行? (或保留换行符)

发布于 2024-11-17 01:27:38 字数 271 浏览 5 评论 0原文

我有一个非常大的文件(~10 GB),可以压缩到<使用 gzip 的 1 GB。我对使用 sort FILE | 感兴趣uniq-c|排序以查看单行重复的频率,但是 10 GB 文件太大而无法排序,而且我的计算机内存不足。

有没有一种方法可以在保留换行符的同时压缩文件(或完全不同的方法),从而将文件减小到足够小的大小以进行排序,但仍然使文件处于可排序的状态?

或者有任何其他方法可以找出/计算大文件(类似于 10 GB 的 CSV 文件)中每行重复的次数?

感谢您的帮助!

I have a very large file (~10 GB) that can be compressed to < 1 GB using gzip. I'm interested in using sort FILE | uniq -c | sort to see how often a single line is repeated, however the 10 GB file is too large to sort and my computer runs out of memory.

Is there a way to compress the file while preserving newlines (or an entirely different method all together) that would reduce the file to a small enough size to sort, yet still leave the file in a condition that's sortable?

Or any other method of finding out / countin how many times each line is repetead inside a large file (a ~10 GB CSV-like file) ?

Thanks for any help!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

随遇而安 2024-11-24 01:27:38

您确定您的 sort 内存 (RAM) 已用完吗?

根据我调试 sort 问题的经验,我相信您可能已经用完 sort 的磁盘空间来创建临时文件。另请记住,用于排序的磁盘空间通常位于 /tmp/var/tmp 中。

因此,请使用以下命令检查您的可用磁盘空间:(

df -g 

某些系统不支持 -g,请尝试 -m (megs) -k (kiloB) )

如果您的 /tmp 分区过小,您是否有另一个有 10-20GB 可用空间的分区?如果是,则告诉您的排序使用该目录,注意

 sort -T /alt/dir

排序版本

sort (GNU coreutils) 5.97

帮助说

 -T, --temporary-directory=DIR  use DIR for temporaries, not $TMPDIR or /tmp;
                          multiple options specify multiple directories

我不确定这是否意味着可以组合一堆 -T=/dr1/ -T=/dr2 ... 是否获取您的 10GB*sortFactor 空间。我的经验是它只使用列表中的最后一个目录,因此尝试使用 1 个足够大的目录。

另请注意,您可以转到用于排序的任何目录,并且您将看到用于排序的临时文件的活动。

Are you sure you're running out of the Memory (RAM?) with your sort?

My experience debugging sort problems leads me to believe that you have probably run out of diskspace for sort to create it temporary files. Also recall that diskspace used to sort is usually in /tmp or /var/tmp.

So check out your available disk space with :

df -g 

(some systems don't support -g, try -m (megs) -k (kiloB) )

If you have an undersized /tmp partition, do you have another partition with 10-20GB free? If yes, then tell your sort to use that dir with

 sort -T /alt/dir

Note that for sort version

sort (GNU coreutils) 5.97

The help says

 -T, --temporary-directory=DIR  use DIR for temporaries, not $TMPDIR or /tmp;
                          multiple options specify multiple directories

I'm not sure if this means can combine a bunch of -T=/dr1/ -T=/dr2 ... to get to your 10GB*sortFactor space or not. My experience was that it only used the last dir in the list, so try to use 1 dir that is big enough.

Also, note that you can go to the whatever dir you are using for sort, and you'll see the acctivity of the temporary files used for sorting.

不离久伴 2024-11-24 01:27:38

有一些可能的解决方案:

1 - 使用任何文本处理语言(perl、awk)提取每一行并保存该行的行号和散列,然后比较散列

2 - 可以/想要删除重复的行,每个文件只留下一次出现?可以使用如下脚本(命令):
awk '!x[$0]++' 旧文件 > newfile

3 - 为什么不分割文件但有一些标准?假设你的所有行都以字母开头:
- 将您的original_file分成20个较小的文件:grep "^a*$"original_file >一个文件
- 对每个小文件进行排序:a_file、b_file 等
- 验证重复项,计算它们,做任何你想做的事。

There are some possible solutions:

1 - use any text processing language (perl, awk) to extract each line and save the line number and a hash for that line, and then compare the hashes

2 - Can / Want to remove the duplicate lines, leaving just one occurence per file? Could use a script (command) like:
awk '!x[$0]++' oldfile > newfile

3 - Why not split the files but with some criteria? Supposing all your lines begin with letters:
- break your original_file in 20 smaller files: grep "^a*$" original_file > a_file
- sort each small file: a_file, b_file, and so on
- verify the duplicates, count them, do whatever you want.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文