快速数据从文件移动到某个 StringIO
在 Python 中,我有一个文件流,我想将其某些部分复制到 StringIO 中。我希望速度尽可能快,副本最少。
但如果我这样做:
data = file.read(SIZE)
stream = StringIO(data)
我想已经完成了两份,不是吗?一份复制到文件中的数据,另一份从 StringIO
复制到内部缓冲区。我可以避免其中一份副本吗?我不需要临时数据
,所以我认为一份副本就足够了
In Python I have a file stream, and I want to copy some part of it into a StringIO
. I want this to be fastest as possible, with minimum copy.
But if I do:
data = file.read(SIZE)
stream = StringIO(data)
I think 2 copies was done, no? One copy into data from file, another copy inside StringIO
into internal buffer. Can I avoid one of the copies? I don't need temporary data
, so I think one copy should be enough
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
简而言之:使用 StringIO 无法避免 2 个副本。
一些假设:
file.read(SOME_BYTE_COUNT)
的变体。长答案:由于Python字符串是不可变的,而StringIO缓冲区不是,所以迟早必须创建一个副本;否则你就会改变一个不可变的对象!为了实现您想要的功能,StringIO 对象需要有一个专用方法,可以直接从作为参数给出的文件对象中读取。没有这样的方法。
在 StringIO 之外,还有一些解决方案可以避免额外的复制。在我的脑海中,这将直接将文件读入可修改的字节数组,无需额外的副本:
使用起来可能很麻烦,具体取决于您想要的用法,因为它是一个从 0 到 255 的值数组,而不是字符数组。但它在功能上等同于 StringIO 对象,并且使用 np.fromstring、np.tostring、np.tofile 和切片符号应该可以帮助您你想要的。您可能还需要
np.insert
、np.delete
和np.append
。我确信还有其他模块可以做类似的事情。
TIMEIT:
这一切到底有多重要?好吧,让我们看看。我制作了一个 100MB 的文件
largefile.bin
。然后我使用这两种方法读取文件并更改第一个字节。所以就我而言,使用 StringIO 比使用 numpy 慢 50%。
最后,作为比较,直接编辑文件:
因此,速度快了近 4500 倍。当然,这很大程度上取决于您要对文件执行的操作。改变第一个字节几乎不具有代表性。但是使用这种方法,您确实比其他两种方法领先,并且由于大多数操作系统都有良好的磁盘缓冲,因此速度也可能非常好。
(如果您不被允许编辑文件,因此想避免制作工作副本的成本,有几种可能的方法来提高速度。如果您可以选择文件系统,Btrfs 有一个 copy-on-write< /a> 文件复制操作——进行复制的行为使用任何文件系统的 LVM 快照都可以实现相同的效果。)
In short: you can't avoid 2 copies using StringIO.
Some assumptions:
file.read(SOME_BYTE_COUNT)
if your file is binary.Long answer: Since python strings are immutable and the StringIO buffer is not, a copy will have to be made sooner or later; otherwise you'd be altering an immutable object! For what you want to be possible, the StringIO object would need to have a dedicated method that read directly from a file object given as an argument. There is no such method.
Outside of StringIO, there are solutions that avoid the extra copy. Off the top of my head, this will read a file directly into a modifiable byte array, no extra copy:
It may be cumbersome to work with, depending on the usage you intend, since it's an array of values from 0 to 255, not an array of characters. But it's functionally equivalent to a StringIO object, and using
np.fromstring
,np.tostring
,np.tofile
and slicing notation should get you where you want. You might also neednp.insert
,np.delete
andnp.append
.I'm sure there are other modules that will do similar things.
TIMEIT:
How much does all this really matter? Well, let's see. I've made a 100MB file,
largefile.bin
. Then I read in the file using both methods and change the first byte.So in my case, using StringIO is 50% slower than using numpy.
Lastly, for comparison, editing the file directly:
So, it's nearly 4500 times faster. Of course, it's extremely dependent on what you're going to do with the file. Altering the first byte is hardly representative. But using this method, you do have a head start on the other two, and since most OS's have good buffering of disks, the speed may be very good too.
(If you're not allowed to edit the file and so want to avoid the cost of making a working copy, there are a couple of possible ways to increase the speed. If you can choose the filesystem, Btrfs has a copy-on-write file copy operation -- making the act of taking a copy of a file virtually instant. The same effect can be achieved using an LVM snapshot of any filesystem.)
不,没有制作额外的副本。用于存储数据的缓冲区是相同的。
data
和使用StringIO.getvalue()
访问的内部属性对于相同数据来说是不同的名称。快速浏览源代码表明
cStringIO
并不也不会在构造时进行复制,但它会在调用 cStringIO.getvalue() 时进行复制,因此我无法重复上述演示。No, there is not an extra copy made. The buffer used to store the data is the same. Both
data
and the internal attribute accessible usingStringIO.getvalue()
are different names for the same data.A quick skim through the source shows that
cStringIO
doesn't make a copy on construction either, but it does make a copy on callingcStringIO.getvalue()
, so I can't repeat the above demonstration.也许您正在寻找的是 buffer/memoryview:
这样您可以访问原始数据的一部分而无需复制它。但是,您必须只对以面向字节的格式访问该数据感兴趣,因为这是缓冲区协议提供的。
您可以在此相关问题中找到更多信息。
编辑:在此 博客文章 我通过 reddit 发现,针对同一问题给出了更多信息:
根据作者的说法,由于
bytearray
是可变的,因此没有创建额外的副本并且可以修改数据。Maybe what you're looking for is a buffer/memoryview:
This way you can access a slice of the original data without copying it. However, you must be interested in accessing that data only in byte oriented format since that's what the buffer protocol provides.
You can find more information in this related question.
Edit: In this blog post I found through reddit, some more information is given regarding the same problem:
According to the author no extra copy is created and data can be modified since
bytearray
is mutable.