一次读取多个 Python pickled 数据、缓冲和换行符?

发布于 2024-10-28 16:20:31 字数 647 浏览 10 评论 0原文

为您提供背景信息:

我有一个大文件 f,大小为几 Gig。 生成的不同对象的连续 pickles

它包含通过运行for obj in objs: cPickle.dump(obj, f)

我想在读取此文件时利用缓冲。我想要的是一次将几个选取的对象读入缓冲区。这样做的最佳方法是什么?我想要一个用于腌制数据的 readlines(buffsize) 类似物。事实上,如果所选数据确实是换行符分隔的,则可以使用读取行,但我不确定这是否属实。

我想到的另一个选择是首先将 pickle 对象转储到字符串,然后将字符串写入文件,每个字符串用换行符分隔。要读回文件,我可以使用 readlines()loads()。但我担心 pickled 对象可能具有 "\n" 字符,并且它会抛出此文件读取方案。我的恐惧是没有根据的吗?

一种选择是将其作为一个巨大的对象列表进行挑选,但这需要的内存超出了我的承受能力。设置可以通过多线程加速,但在缓冲正常工作之前我不想去那里。对于这种情况,“最佳实践”是什么?

编辑: 我还可以将原始字节读入缓冲区并对其调用加载,但我需要知道加载消耗了该缓冲区的多少字节,以便我可以丢弃头部。

to give you context:

I have a large file f, several Gigs in size. It contains consecutive pickles of different object that were generated by running

for obj in objs: cPickle.dump(obj, f)

I want to take advantage of buffering when reading this file. What I want, is to read several picked objects into a buffer at a time. What is the best way of doing this? I want an analogue of readlines(buffsize) for pickled data. In fact if the picked data is indeed newline delimited one could use readlines, but I am not sure if that is true.

Another option that I have in mind is to dumps() the pickled object to a string first and then to write the strings to a file, each separated by a newline. To read the file back I can use readlines() and loads(). But I fear that a pickled object may have the "\n" character and it will throw off this file reading scheme. Is my fear unfounded ?

One option is to pickle it out as a huge list of objects, but that will require more memory than I can afford. The setup can be sped up by multi-threading but I do not want to go there before I get the buffering working properly. Whats the "best practice" for situations like this.

EDIT:
I can also read in raw bytes into a buffer and invoke loads on that, but I need to know how many bytes of that buffer was consumed by loads so that I can throw the head away.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

壹場煙雨 2024-11-04 16:20:31

我想你不需要做任何事情。

>>> import pickle
>>> import StringIO
>>> s = StringIO.StringIO(pickle.dumps('apples') + pickle.dumps('bananas'))
>>> pickle.load(s)
'apples'
>>> pickle.load(s)
'bananas'
>>> pickle.load(s)

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    pickle.load(s)
  File "C:\Python26\lib\pickle.py", line 1370, in load
    return Unpickler(file).load()
  File "C:\Python26\lib\pickle.py", line 858, in load
    dispatch[key](self)
  File "C:\Python26\lib\pickle.py", line 880, in load_eof
    raise EOFError
EOFError
>>> 

You don't need to do anything, i think.

>>> import pickle
>>> import StringIO
>>> s = StringIO.StringIO(pickle.dumps('apples') + pickle.dumps('bananas'))
>>> pickle.load(s)
'apples'
>>> pickle.load(s)
'bananas'
>>> pickle.load(s)

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    pickle.load(s)
  File "C:\Python26\lib\pickle.py", line 1370, in load
    return Unpickler(file).load()
  File "C:\Python26\lib\pickle.py", line 858, in load
    dispatch[key](self)
  File "C:\Python26\lib\pickle.py", line 880, in load_eof
    raise EOFError
EOFError
>>> 
乄_柒ぐ汐 2024-11-04 16:20:31

file.readlines() 返回文件全部内容的列表。您需要一次阅读几行。我认为这个简单的代码应该解开你的数据:

import pickle
infile = open('/tmp/pickle', 'rb')
buf = []
while True:
    line = infile.readline()
    if not line:
        break
    buf.append(line)
    if line.endswith('.\n'):
        print 'Decoding', buf
        print pickle.loads(''.join(buf))
        buf = []

如果你对生成pickle的程序有任何控制权,我会选择以下之一:

  1. 使用shelve模块。
  2. 在将每个pickle写入文件之前打印它的长度(以字节为单位),以便您准确地知道每次要读取多少字节。
  3. 与上面相同,但将整数列表写入单独的文件,以便您可以使用这些值作为保存泡菜的文件的索引。
  4. 一次pickle一个包含K个对象的列表。以字节为单位写入该 pickle 的长度。写腌菜。重复。

顺便说一句,我怀疑 file 的内置缓冲应该可以为您带来您想要的 99% 的性能提升。

如果您确信 I/O 阻塞了您,您是否考虑过尝试 mmap() 并让操作系统一次处理块打包?

#!/usr/bin/env python

import mmap
import cPickle

fname = '/tmp/pickle'
infile = open(fname, 'rb')
m = mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ)
start = 0
while True:
    end = m.find('.\n', start + 1) + 2
    if end == 1:
        break
    print cPickle.loads(m[start:end])
    start = end

file.readlines() returns a list of the entire contents of the file. You'll want to read a few lines at a time. I think this naive code should unpickle your data:

import pickle
infile = open('/tmp/pickle', 'rb')
buf = []
while True:
    line = infile.readline()
    if not line:
        break
    buf.append(line)
    if line.endswith('.\n'):
        print 'Decoding', buf
        print pickle.loads(''.join(buf))
        buf = []

If you have any control over the program that generates the pickles, I'd pick one of:

  1. Use the shelve module.
  2. Print the length (in bytes) of each pickle before writing it to the file so that you know exactly how many bytes to read in each time.
  3. Same as above, but write the list of integers to a separate file so that you can use those values as an index into the file holding the pickles.
  4. Pickle a list of K objects at a time. Write the length of that pickle in bytes. Write the pickle. Repeat.

By the way, I suspect that the file's built-in buffering should get you 99% of the performance gains you're looking for.

If you're convinced that I/O is blocking you, have you thought about trying mmap() and letting the OS handle packing in blocks at a time?

#!/usr/bin/env python

import mmap
import cPickle

fname = '/tmp/pickle'
infile = open(fname, 'rb')
m = mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ)
start = 0
while True:
    end = m.find('.\n', start + 1) + 2
    if end == 1:
        break
    print cPickle.loads(m[start:end])
    start = end
夜吻♂芭芘 2024-11-04 16:20:31

您可能想查看 shelve 模块。它使用数据库模块(例如dbm)来创建磁盘上的对象字典。对象本身仍然使用 pickle 进行序列化。这样你就可以读取一组对象,而不是一次读取一个大的泡菜。

You might want to look at the shelve module. It uses a database module such as dbm to create an on-disk dictionary of objects. The objects themselves are still serialized using pickle. That way you could read sets of objects instead of one big pickle at a time.

装迷糊 2024-11-04 16:20:31

如果您想向任何文件添加缓冲,请通过 io.open() 打开它。下面是一个示例,它将以 128K 块的形式从底层流中读取。每次对 cPickle.load() 的调用都将从内部缓冲区中完成,直到耗尽为止,然后将从底层文件中读取另一个块:

import cPickle
import io

buf = io.open('objects.pkl', 'rb', buffering=(128 * 1024))
obj = cPickle.load(buf)

If you want to add buffering to any file, open it via io.open(). Here is an example which will read from the underlying stream in 128K chunks. Each call to cPickle.load() will be fulfilled from the internal buffer until it is exhausted, then another chunk will be read from the underlying file:

import cPickle
import io

buf = io.open('objects.pkl', 'rb', buffering=(128 * 1024))
obj = cPickle.load(buf)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文