快速数据从文件移动到某个 StringIO

发布于 2024-12-17 17:46:26 字数 273 浏览 0 评论 0原文

在 Python 中,我有一个文件流,我想将其某些部分复制到 StringIO 中。我希望速度尽可能快,副本最少。

但如果我这样做:

data = file.read(SIZE)
stream = StringIO(data)

我想已经完成了两份,不是吗?一份复制到文件中的数据,另一份从 StringIO 复制到内部缓冲区。我可以避免其中一份副本吗?我不需要临时数据,所以我认为一份副本就足够了

In Python I have a file stream, and I want to copy some part of it into a StringIO. I want this to be fastest as possible, with minimum copy.

But if I do:

data = file.read(SIZE)
stream = StringIO(data)

I think 2 copies was done, no? One copy into data from file, another copy inside StringIO into internal buffer. Can I avoid one of the copies? I don't need temporary data, so I think one copy should be enough

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

冰魂雪魄 2024-12-24 17:46:26

简而言之:使用 StringIO 无法避免 2 个副本。

一些假设:

  • 您正在使用 cStringIO,否则优化这么多是愚蠢的。
  • 您追求的是速度而不是内存效率。如果没有,请参阅 Jakob Bowyer 的解决方案,或者如果您的文件是二进制文件,则使用 file.read(SOME_BYTE_COUNT) 的变体。
  • 您已经在评论中声明了这一点,但为了完整性:您想要实际编辑内容,而不仅仅是查看它。

长答案:由于Python字符串是不可变的,而StringIO缓冲区不是,所以迟早必须创建一个副本;否则你就会改变一个不可变的对象!为了实现您想要的功能,StringIO 对象需要有一个专用方法,可以直接从作为参数给出的文件对象中读取。没有这样的方法。

在 StringIO 之外,还有一些解决方案可以避免额外的复制。在我的脑海中,这将直接将文件读入可修改的字节数组,无需额外的副本:

import numpy as np
a = np.fromfile("filename.ext", dtype="uint8")

使用起来可能很麻烦,具体取决于您想要的用法,因为它是一个从 0 到 255 的值数组,而不是字符数组。但它在功能上等同于 StringIO 对象,并且使用 np.fromstring、np.tostring、np.tofile 和切片符号应该可以帮助您你想要的。您可能还需要 np.insertnp.deletenp.append

我确信还有其他模块可以做类似的事情。

TIMEIT:

这一切到底有多重要?好吧,让我们看看。我制作了一个 100MB 的文件 largefile.bin。然后我使用这两种方法读取文件并更改第一个字节。

$ python -m timeit -s "import numpy as np" "a = np.fromfile('largefile.bin', 'uint8'); a[0] = 1"
10 loops, best of 3: 132 msec per loop
$ python -m timeit -s "from cStringIO import StringIO" "a = StringIO(); a.write(open('largefile.bin').read()); a.seek(0); a.write('1')"
10 loops, best of 3: 203 msec per loop

所以就我而言,使用 StringIO 比使用 numpy 慢 50%。

最后,作为比较,直接编辑文件:

$ python -m timeit "a = open('largefile.bin', 'r+b'); a.seek(0); a.write('1')"
10000 loops, best of 3: 29.5 usec per loop

因此,速度快了近 4500 倍。当然,这很大程度上取决于您要对文件执行的操作。改变第一个字节几乎不具有代表性。但是使用这种方法,您确实比其他两种方法领先,并且由于大多数操作系统都有良好的磁盘缓冲,因此速度也可能非常好。

(如果您不被允许编辑文件,因此想避免制作工作副本的成本,有几种可能的方法来提高速度。如果您可以选择文件系统,Btrfs 有一个 copy-on-write< /a> 文件复制操作——进行复制的行为使用任何文件系统的 LVM 快照都可以实现相同的效果。)

In short: you can't avoid 2 copies using StringIO.

Some assumptions:

  • You're using cStringIO, otherwise it would be silly to optimize this much.
  • It's speed and not memory efficiency you're after. If not, see Jakob Bowyer's solution, or use a variant using file.read(SOME_BYTE_COUNT) if your file is binary.
  • You've already stated this in the comments, but for completeness: you want to actually edit the contents, not just view it.

Long answer: Since python strings are immutable and the StringIO buffer is not, a copy will have to be made sooner or later; otherwise you'd be altering an immutable object! For what you want to be possible, the StringIO object would need to have a dedicated method that read directly from a file object given as an argument. There is no such method.

Outside of StringIO, there are solutions that avoid the extra copy. Off the top of my head, this will read a file directly into a modifiable byte array, no extra copy:

import numpy as np
a = np.fromfile("filename.ext", dtype="uint8")

It may be cumbersome to work with, depending on the usage you intend, since it's an array of values from 0 to 255, not an array of characters. But it's functionally equivalent to a StringIO object, and using np.fromstring, np.tostring, np.tofile and slicing notation should get you where you want. You might also need np.insert, np.delete and np.append.

I'm sure there are other modules that will do similar things.

TIMEIT:

How much does all this really matter? Well, let's see. I've made a 100MB file, largefile.bin. Then I read in the file using both methods and change the first byte.

$ python -m timeit -s "import numpy as np" "a = np.fromfile('largefile.bin', 'uint8'); a[0] = 1"
10 loops, best of 3: 132 msec per loop
$ python -m timeit -s "from cStringIO import StringIO" "a = StringIO(); a.write(open('largefile.bin').read()); a.seek(0); a.write('1')"
10 loops, best of 3: 203 msec per loop

So in my case, using StringIO is 50% slower than using numpy.

Lastly, for comparison, editing the file directly:

$ python -m timeit "a = open('largefile.bin', 'r+b'); a.seek(0); a.write('1')"
10000 loops, best of 3: 29.5 usec per loop

So, it's nearly 4500 times faster. Of course, it's extremely dependent on what you're going to do with the file. Altering the first byte is hardly representative. But using this method, you do have a head start on the other two, and since most OS's have good buffering of disks, the speed may be very good too.

(If you're not allowed to edit the file and so want to avoid the cost of making a working copy, there are a couple of possible ways to increase the speed. If you can choose the filesystem, Btrfs has a copy-on-write file copy operation -- making the act of taking a copy of a file virtually instant. The same effect can be achieved using an LVM snapshot of any filesystem.)

请别遗忘我 2024-12-24 17:46:26

不,没有制作额外的副本。用于存储数据的缓冲区是相同的。 data 和使用 StringIO.getvalue() 访问的内部属性对于相同数据来说是不同的名称。

Python 2.7 (r27:82500, Jul 30 2010, 07:39:35) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import StringIO
>>> data = open("/dev/zero").read(1024)
>>> hex(id(data))
'0xea516f0'
>>> stream = StringIO.StringIO(data)
>>> hex(id(stream.getvalue()))
'0xea516f0'

快速浏览源代码表明cStringIO并不也不会在构造时进行复制,但它会在调用 cStringIO.getvalue() 时进行复制,因此我无法重复上述演示。

No, there is not an extra copy made. The buffer used to store the data is the same. Both data and the internal attribute accessible using StringIO.getvalue() are different names for the same data.

Python 2.7 (r27:82500, Jul 30 2010, 07:39:35) 
[GCC 4.1.2 20080704 (Red Hat 4.1.2-48)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import StringIO
>>> data = open("/dev/zero").read(1024)
>>> hex(id(data))
'0xea516f0'
>>> stream = StringIO.StringIO(data)
>>> hex(id(stream.getvalue()))
'0xea516f0'

A quick skim through the source shows that cStringIO doesn't make a copy on construction either, but it does make a copy on calling cStringIO.getvalue(), so I can't repeat the above demonstration.

我家小可爱 2024-12-24 17:46:26

也许您正在寻找的是 buffer/memoryview

>>> data = file.read(SIZE)
>>> buf = buffer(data, 0, len(data))

这样您可以访问原始数据的一部分而无需复制它。但是,您必须只对以面向字节的格式访问该数据感兴趣,因为这是缓冲区协议提供的。

您可以在此相关问题中找到更多信息。

编辑:在此 博客文章 我通过 reddit 发现,针对同一问题给出了更多信息:

>>> f = open.(filename, 'rb')
>>> data = bytearray(os.path.getsize(filename))
>>> f.readinto(data)

根据作者的说法,由于 bytearray 是可变的,因此没有创建额外的副本并且可以修改数据。

Maybe what you're looking for is a buffer/memoryview:

>>> data = file.read(SIZE)
>>> buf = buffer(data, 0, len(data))

This way you can access a slice of the original data without copying it. However, you must be interested in accessing that data only in byte oriented format since that's what the buffer protocol provides.

You can find more information in this related question.

Edit: In this blog post I found through reddit, some more information is given regarding the same problem:

>>> f = open.(filename, 'rb')
>>> data = bytearray(os.path.getsize(filename))
>>> f.readinto(data)

According to the author no extra copy is created and data can be modified since bytearray is mutable.

因为看清所以看轻 2024-12-24 17:46:26
stream = StringIO()
for line in file:
    stream.write(line + "\n")
stream = StringIO()
for line in file:
    stream.write(line + "\n")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文