当前位置：文江博客话题详情

单独且相互独立地压缩文件的每一行？（或保留换行符）

发布于 2024-11-17 01:27:38 字数 271 浏览 5 评论 0原文

我有一个非常大的文件（~10 GB），可以压缩到<使用 gzip 的 1 GB。我对使用 sort FILE | 感兴趣uniq-c|排序以查看单行重复的频率，但是 10 GB 文件太大而无法排序，而且我的计算机内存不足。

有没有一种方法可以在保留换行符的同时压缩文件（或完全不同的方法），从而将文件减小到足够小的大小以进行排序，但仍然使文件处于可排序的状态？

或者有任何其他方法可以找出/计算大文件（类似于 10 GB 的 CSV 文件）中每行重复的次数？

感谢您的帮助！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

随遇而安 2024-11-24 01:27:38

您确定您的 sort 内存 (RAM) 已用完吗？

根据我调试 sort 问题的经验，我相信您可能已经用完 sort 的磁盘空间来创建临时文件。另请记住，用于排序的磁盘空间通常位于 /tmp 或 /var/tmp 中。

因此，请使用以下命令检查您的可用磁盘空间：（

df -g

某些系统不支持 -g，请尝试 -m (megs) -k (kiloB) ）

如果您的 /tmp 分区过小，您是否有另一个有 10-20GB 可用空间的分区？如果是，则告诉您的排序使用该目录，注意

 sort -T /alt/dir

排序版本

sort (GNU coreutils) 5.97

帮助说

 -T, --temporary-directory=DIR  use DIR for temporaries, not $TMPDIR or /tmp;
                          multiple options specify multiple directories

我不确定这是否意味着可以组合一堆 -T=/dr1/ -T=/dr2 ... 是否获取您的 10GB*sortFactor 空间。我的经验是它只使用列表中的最后一个目录，因此尝试使用 1 个足够大的目录。

另请注意，您可以转到用于排序的任何目录，并且您将看到用于排序的临时文件的活动。

Are you sure you're running out of the Memory (RAM?) with your sort?

My experience debugging sort problems leads me to believe that you have probably run out of diskspace for sort to create it temporary files. Also recall that diskspace used to sort is usually in /tmp or /var/tmp.

So check out your available disk space with :

df -g

(some systems don't support -g, try -m (megs) -k (kiloB) )

If you have an undersized /tmp partition, do you have another partition with 10-20GB free? If yes, then tell your sort to use that dir with

 sort -T /alt/dir

Note that for sort version

sort (GNU coreutils) 5.97

The help says

 -T, --temporary-directory=DIR  use DIR for temporaries, not $TMPDIR or /tmp;
                          multiple options specify multiple directories

I'm not sure if this means can combine a bunch of -T=/dr1/ -T=/dr2 ... to get to your 10GB*sortFactor space or not. My experience was that it only used the last dir in the list, so try to use 1 dir that is big enough.

Also, note that you can go to the whatever dir you are using for sort, and you'll see the acctivity of the temporary files used for sorting.

回复收藏 0 原文