一次读取多个 Python pickled 数据、缓冲和换行符？

发布于 2024-10-28 16:20:31 字数 647 浏览 10 评论 0原文

为您提供背景信息：

我有一个大文件 f，大小为几 Gig。生成的不同对象的连续 pickles

它包含通过运行for obj in objs: cPickle.dump(obj, f)

我想在读取此文件时利用缓冲。我想要的是一次将几个选取的对象读入缓冲区。这样做的最佳方法是什么？我想要一个用于腌制数据的 readlines(buffsize) 类似物。事实上，如果所选数据确实是换行符分隔的，则可以使用读取行，但我不确定这是否属实。

我想到的另一个选择是首先将 pickle 对象转储到字符串，然后将字符串写入文件，每个字符串用换行符分隔。要读回文件，我可以使用 readlines() 和 loads()。但我担心 pickled 对象可能具有 "\n" 字符，并且它会抛出此文件读取方案。我的恐惧是没有根据的吗？

一种选择是将其作为一个巨大的对象列表进行挑选，但这需要的内存超出了我的承受能力。设置可以通过多线程加速，但在缓冲正常工作之前我不想去那里。对于这种情况，“最佳实践”是什么？

编辑：我还可以将原始字节读入缓冲区并对其调用加载，但我需要知道加载消耗了该缓冲区的多少字节，以便我可以丢弃头部。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

壹場煙雨 2024-11-04 16:20:31

我想你不需要做任何事情。

>>> import pickle
>>> import StringIO
>>> s = StringIO.StringIO(pickle.dumps('apples') + pickle.dumps('bananas'))
>>> pickle.load(s)
'apples'
>>> pickle.load(s)
'bananas'
>>> pickle.load(s)

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    pickle.load(s)
  File "C:\Python26\lib\pickle.py", line 1370, in load
    return Unpickler(file).load()
  File "C:\Python26\lib\pickle.py", line 858, in load
    dispatch[key](self)
  File "C:\Python26\lib\pickle.py", line 880, in load_eof
    raise EOFError
EOFError
>>>

You don't need to do anything, i think.

>>> import pickle
>>> import StringIO
>>> s = StringIO.StringIO(pickle.dumps('apples') + pickle.dumps('bananas'))
>>> pickle.load(s)
'apples'
>>> pickle.load(s)
'bananas'
>>> pickle.load(s)

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    pickle.load(s)
  File "C:\Python26\lib\pickle.py", line 1370, in load
    return Unpickler(file).load()
  File "C:\Python26\lib\pickle.py", line 858, in load
    dispatch[key](self)
  File "C:\Python26\lib\pickle.py", line 880, in load_eof
    raise EOFError
EOFError
>>>

回复收藏 0 原文

乄_柒ぐ汐 2024-11-04 16:20:31

file.readlines() 返回文件全部内容的列表。您需要一次阅读几行。我认为这个简单的代码应该解开你的数据：

import pickle
infile = open('/tmp/pickle', 'rb')
buf = []
while True:
    line = infile.readline()
    if not line:
        break
    buf.append(line)
    if line.endswith('.\n'):
        print 'Decoding', buf
        print pickle.loads(''.join(buf))
        buf = []

如果你对生成pickle的程序有任何控制权，我会选择以下之一：

使用shelve模块。
在将每个pickle写入文件之前打印它的长度（以字节为单位），以便您准确地知道每次要读取多少字节。
与上面相同，但将整数列表写入单独的文件，以便您可以使用这些值作为保存泡菜的文件的索引。
一次pickle一个包含K个对象的列表。以字节为单位写入该 pickle 的长度。写腌菜。重复。

顺便说一句，我怀疑 file 的内置缓冲应该可以为您带来您想要的 99% 的性能提升。

如果您确信 I/O 阻塞了您，您是否考虑过尝试 mmap() 并让操作系统一次处理块打包？

#!/usr/bin/env python

import mmap
import cPickle

fname = '/tmp/pickle'
infile = open(fname, 'rb')
m = mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ)
start = 0
while True:
    end = m.find('.\n', start + 1) + 2
    if end == 1:
        break
    print cPickle.loads(m[start:end])
    start = end

file.readlines() returns a list of the entire contents of the file. You'll want to read a few lines at a time. I think this naive code should unpickle your data:

import pickle
infile = open('/tmp/pickle', 'rb')
buf = []
while True:
    line = infile.readline()
    if not line:
        break
    buf.append(line)
    if line.endswith('.\n'):
        print 'Decoding', buf
        print pickle.loads(''.join(buf))
        buf = []

If you have any control over the program that generates the pickles, I'd pick one of:

Use the shelve module.
Print the length (in bytes) of each pickle before writing it to the file so that you know exactly how many bytes to read in each time.
Same as above, but write the list of integers to a separate file so that you can use those values as an index into the file holding the pickles.
Pickle a list of K objects at a time. Write the length of that pickle in bytes. Write the pickle. Repeat.

By the way, I suspect that the file's built-in buffering should get you 99% of the performance gains you're looking for.

If you're convinced that I/O is blocking you, have you thought about trying mmap() and letting the OS handle packing in blocks at a time?

#!/usr/bin/env python

import mmap
import cPickle

fname = '/tmp/pickle'
infile = open(fname, 'rb')
m = mmap.mmap(infile.fileno(), 0, access=mmap.ACCESS_READ)
start = 0
while True:
    end = m.find('.\n', start + 1) + 2
    if end == 1:
        break
    print cPickle.loads(m[start:end])
    start = end

回复收藏 0 原文

夜吻♂芭芘 2024-11-04 16:20:31

您可能想查看 shelve 模块。它使用数据库模块（例如dbm）来创建磁盘上的对象字典。对象本身仍然使用 pickle 进行序列化。这样你就可以读取一组对象，而不是一次读取一个大的泡菜。

回复收藏 0 原文

装迷糊 2024-11-04 16:20:31

如果您想向任何文件添加缓冲，请通过 io.open() 打开它。下面是一个示例，它将以 128K 块的形式从底层流中读取。每次对 cPickle.load() 的调用都将从内部缓冲区中完成，直到耗尽为止，然后将从底层文件中读取另一个块：

import cPickle
import io

buf = io.open('objects.pkl', 'rb', buffering=(128 * 1024))
obj = cPickle.load(buf)

If you want to add buffering to any file, open it via io.open(). Here is an example which will read from the underlying stream in 128K chunks. Each call to cPickle.load() will be fulfilled from the internal buffer until it is exhausted, then another chunk will be read from the underlying file:

import cPickle
import io

buf = io.open('objects.pkl', 'rb', buffering=(128 * 1024))
obj = cPickle.load(buf)

回复收藏 0 原文

~没有更多了~

关于作者

陪我终i

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

一次读取多个 Python pickled 数据、缓冲和换行符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

一次读取多个 Python pickled 数据、缓冲和换行符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。