一次读取多个 Python pickled 数据、缓冲和换行符?
为您提供背景信息:
我有一个大文件 f
,大小为几 Gig。 生成的不同对象的连续 pickles
它包含通过运行for obj in objs: cPickle.dump(obj, f)
我想在读取此文件时利用缓冲。我想要的是一次将几个选取的对象读入缓冲区。这样做的最佳方法是什么?我想要一个用于腌制数据的 readlines(buffsize)
类似物。事实上,如果所选数据确实是换行符分隔的,则可以使用读取行,但我不确定这是否属实。
我想到的另一个选择是首先将 pickle 对象转储到字符串,然后将字符串写入文件,每个字符串用换行符分隔。要读回文件,我可以使用 readlines()
和 loads()
。但我担心 pickled 对象可能具有 "\n"
字符,并且它会抛出此文件读取方案。我的恐惧是没有根据的吗?
一种选择是将其作为一个巨大的对象列表进行挑选,但这需要的内存超出了我的承受能力。设置可以通过多线程加速,但在缓冲正常工作之前我不想去那里。对于这种情况,“最佳实践”是什么?
编辑: 我还可以将原始字节读入缓冲区并对其调用加载,但我需要知道加载消耗了该缓冲区的多少字节,以便我可以丢弃头部。
to give you context:
I have a large file f
, several Gigs in size. It contains consecutive pickles of different object that were generated by running
for obj in objs: cPickle.dump(obj, f)
I want to take advantage of buffering when reading this file. What I want, is to read several picked objects into a buffer at a time. What is the best way of doing this? I want an analogue of readlines(buffsize)
for pickled data. In fact if the picked data is indeed newline delimited one could use readlines, but I am not sure if that is true.
Another option that I have in mind is to dumps()
the pickled object to a string first and then to write the strings to a file, each separated by a newline. To read the file back I can use readlines()
and loads()
. But I fear that a pickled object may have the "\n"
character and it will throw off this file reading scheme. Is my fear unfounded ?
One option is to pickle it out as a huge list of objects, but that will require more memory than I can afford. The setup can be sped up by multi-threading but I do not want to go there before I get the buffering working properly. Whats the "best practice" for situations like this.
EDIT:
I can also read in raw bytes into a buffer and invoke loads on that, but I need to know how many bytes of that buffer was consumed by loads so that I can throw the head away.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我想你不需要做任何事情。
You don't need to do anything, i think.
file.readlines() 返回文件全部内容的列表。您需要一次阅读几行。我认为这个简单的代码应该解开你的数据:
如果你对生成pickle的程序有任何控制权,我会选择以下之一:
shelve
模块。顺便说一句,我怀疑
file
的内置缓冲应该可以为您带来您想要的 99% 的性能提升。如果您确信 I/O 阻塞了您,您是否考虑过尝试
mmap()
并让操作系统一次处理块打包?file.readlines() returns a list of the entire contents of the file. You'll want to read a few lines at a time. I think this naive code should unpickle your data:
If you have any control over the program that generates the pickles, I'd pick one of:
shelve
module.By the way, I suspect that the
file
's built-in buffering should get you 99% of the performance gains you're looking for.If you're convinced that I/O is blocking you, have you thought about trying
mmap()
and letting the OS handle packing in blocks at a time?您可能想查看 shelve 模块。它使用数据库模块(例如
dbm
)来创建磁盘上的对象字典。对象本身仍然使用 pickle 进行序列化。这样你就可以读取一组对象,而不是一次读取一个大的泡菜。You might want to look at the shelve module. It uses a database module such as
dbm
to create an on-disk dictionary of objects. The objects themselves are still serialized using pickle. That way you could read sets of objects instead of one big pickle at a time.如果您想向任何文件添加缓冲,请通过 io.open() 打开它。下面是一个示例,它将以 128K 块的形式从底层流中读取。每次对
cPickle.load()
的调用都将从内部缓冲区中完成,直到耗尽为止,然后将从底层文件中读取另一个块:If you want to add buffering to any file, open it via
io.open()
. Here is an example which will read from the underlying stream in 128K chunks. Each call tocPickle.load()
will be fulfilled from the internal buffer until it is exhausted, then another chunk will be read from the underlying file: