大量文件串联

发布于 2024-11-28 00:18:40 字数 372 浏览 2 评论 0原文

我的目录中有大约 3-4 百万个文件,文件名以 type1.txt、type2.txt 结尾。 (文件为 1type1.txt、1type2.txt、2type2.txt、2type2 .txt 等)

现在我想连接所有以 type1.txt & 结尾的文件类型2.txt。

目前我正在做 cat *type1.txt > allTtype1.txttype2.txt 类似。 我想保留两个最终输出文件中的顺序,我猜测 cat 会这样做。 但速度太慢了。

请建议一些更快的方法来执行相同的操作。

谢谢, 拉维

I have around 3-4 million files in a directory filename ending with, say type1.txt, type2.txt. (file are 1type1.txt, 1type2.txt,2type2.txt,2type2.txt etc )

Now I want to concatenate all files ending with type1.txt & type2.txt.

Currently I am doing cat *type1.txt > allTtype1.txt similarly for type2.txt.
I wanted to preserve order in both final output file, it is my guess that cat does that.
But it is too slow.

Please suggest some faster method to do the same.

Thanks,
Ravi

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

孤凫 2024-12-05 00:18:40

你可以使用这个命令来做到这一点:

ls | while read file; do cat $file >> allTtype${file#*type}; done

但是正如snap在他的回答中所说的那样,每次cat需要打开一个文件时,它都必须进行inode查找,这将在目录中花费很长时间有很多文件。为了加快速度,您可以使用 icat 从 < a href="http://www.sleuthkit.org/sleuthkit/" rel="noreferrer">Sleuth Kit:

ls -i | while read -a file_array; do icat /dev/sda1 ${file_array[0]} >> allTtype${file_array[1]#*type}; done

更好的是,您可以将生成的文件放在另一个目录中:

ls -i | while read -a file_array; do icat /dev/sda1 ${file_array[0]} >> /newdir/allTtype${file_array[1]#*type}; done

You can do this using this command:

ls | while read file; do cat $file >> allTtype${file#*type}; done

But as snap said above in his answer, each time cat need to open a file, it will have to do an inode lookup which would take a long time in a directory with lots of file. To try to speed things up, you could cat by inode using icat from the Sleuth Kit:

ls -i | while read -a file_array; do icat /dev/sda1 ${file_array[0]} >> allTtype${file_array[1]#*type}; done

And even better, you can put the resulting files in another directory:

ls -i | while read -a file_array; do icat /dev/sda1 ${file_array[0]} >> /newdir/allTtype${file_array[1]#*type}; done
仙气飘飘 2024-12-05 00:18:40

cat 本身并不慢。但每次展开 shell 通配符(? 和 *)时,shell 都会读取并搜索该目录中的所有文件名,这非常慢。

此外,当您按名称打开文件时,内核将花费一些时间来查找该文件,这是您无法避免的。这取决于所使用的文件系统(问题中未指定):某些文件系统比其他文件系统更智能地处理巨大的目录。

要解决这个问题,您可能会受益于获取文件列表一次

ls > /tmp/filelist

...然后使用grep或类似的方法从该列表中选择文件:

cat `grep foo /tmp/filelist` > /out/bar

排序后如果出现这种情况,请确保以这样的方式构建您的存储/应用程序,以免再次发生这种情况。 :) 在取出文件后,还要确保rmdir到现有目录(即使其中只有一个文件,出于任何目的再次使用它也不会有效)。

cat itself is not slow. But every time you expand a shell wild card (? and *), the shell will read and search through all the file names in that directory, which is very slow.

Also the kernel will take time finding the file when you open it by name, which you can not avoid. This depends on the file system in use (unspecified in the question): some file systems are more intelligent with huge directories than others.

To sort this out you might benefit from taking a file listing once:

ls > /tmp/filelist

...and then using grep or similar for selecting the files out of that list:

cat `grep foo /tmp/filelist` > /out/bar

After you have sorted this mess out, make sure to structure your storage/application in such a way that this does not ever happen again. :) Also make sure to to rmdir the existing directory after you have gotten your files out of it (using it again for any purpose will not be effective even if there is just a single file in it).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文