从文件中快速删除带有索引的行

发布于 2024-12-12 09:02:41 字数 170 浏览 1 评论 0原文

我有一个10G的巨大文件。我想从此文件中删除第 188888 行。

我按如下方式使用 sed:

sed -i '188888d' file

问题是它真的很慢。我知道这是因为文件的大小,但是有什么方法可以更快地做到这一点。

谢谢

I have a HUGE file of 10G. I want to remove line 188888 from this file.

I use sed as follows:

sed -i '188888d' file

The problem is it is really slow. I understand it is because of the size of the file, but is there any way that I can do that faster.

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

叶落知秋 2024-12-19 09:02:41

尝试

sed -i '188888{;d;q;}' file

您可能需要尝试保留上面的分号中的哪一个,{d;q} ... 是第二个要尝试的事情。

这将在删除该行后停止搜索文件,但您仍然需要花时间重写文件。还值得测试

sed '188888{;q;d;}' file > /path/to/alternate/mountpoint/newFile

备用安装点位于单独的磁盘驱动器上的位置。

最终编辑
啊,另一种选择是在通过管道写入文件时对其进行编辑

 yourLogFileProducingProgram | sed -i '188888d' > logFile

但这假设您知道要删除的数据始终位于“188888”行,这可能吗?

我希望这有帮助。

Try

sed -i '188888{;d;q;}' file

You may need to experiment with which of the above semi-colons you keep, {d;q} ... being the 2nd thing to try.

This will stop searching the file after it deletes that one line, but you'll still have to spend the time re-writing the file. It would also be worth testing

sed '188888{;q;d;}' file > /path/to/alternate/mountpoint/newFile

where the alternate mountpoint is on a separate disk drive.

final edit
Ah, one other option would be to edit the file while it is being written through a pipe

 yourLogFileProducingProgram | sed -i '188888d' > logFile

But this assumes that you know that the data your want to delete is always at line '188888, is that possible?

I hope this helps.

没有你我更好 2024-12-19 09:02:41

文件行是通过计算 \n 字符来确定的,如果行大小是可变的,那么您无法计算给定行的位置的偏移量,但必须计算换行符的数量。

这始终是 O(n),其中 n 是文件中的字节数。

并行算法也无济于事,因为此操作受磁盘 IO 限制,分而治之会更慢。

如果您要对同一个文件多次执行此操作,可以通过多种方法来预处理该文件并使其速度更快。

一个简单的方法是建立一个索引

line#:offset

,当你想查找一行时,在索引中进行二分搜索(Log n)查找你想要的行号,并使用偏移量在原始文件中找到该行。

The file lines are determined by counting the \n character, if the line size are variable then you cannot calculate the offset to the location given a line but have to count the number of newlines.

This will always be O(n) where n is the number of bytes in the file.

Parallel algorithms does not help either because this operation is disk IO limited, divide and conquer will be even slower.

If you will do this a lot on a same file, there are ways to preprocess the file and make it faster.

A easy way is to build a index with

line#:offset

And when you want to find a line, do binary search (Log n) in index for the line number you want, and use the offset to locate the line in the original file.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文