来自 cStringIO 对象的 Numpy 数组并避免复制

发布于 2024-11-17 11:22:26 字数 1004 浏览 9 评论 0原文

这是为了更好地理解事物。这不是我需要解决的实际问题。 cstringIO 对象应该模拟字符串、文件以及行上的迭代器。它也模拟缓冲区吗？无论如何，理想情况下，人们应该能够构建一个 numpy 数组，如下所示。

import numpy as np
import cstringIO

c = cStringIO.StringIO('\x01\x00\x00\x00\x01\x00\x00\x00')

#Trying the iterartor abstraction
b = np.fromiter(c,int)
# The above fails with: ValueError: setting an array element with a sequence.

#Trying the file abstraction
b = np.fromfile(c,int)
# The above fails with: IOError: first argument must be an open file

#Trying the sequence abstraction
b = np.array(c, int)
# The above fails with: TypeError: long() argument must be a string or a number 

#Trying the string abstraction
b = np.fromstring(c)
#The above fails with: TypeError: argument 1 must be string or read-only buffer

b = np.fromstring(c.getvalue(), int)  # does work

我的问题是为什么它会这样。

出现这种情况的实际问题如下：我有一个生成元组的迭代器。我有兴趣从元组的一个组件中创建一个 numpy 数组，并尽可能少地进行复制和重复。我的第一个想法是继续将生成的元组的有趣组件写入 StringIO 对象，然后将其内存缓冲区用于数组。我当然可以使用 getvalue() 但会创建并返回一个副本。避免额外复制的好方法是什么？

原文

This to understand things better. It is not an actual problem that I need to fix. A cstringIO object is supposed to emulate a string, file and also an iterator over the lines. Does it also emulate a buffer ? In anycase ideally one should be able to construct a numpy array as follows

import numpy as np
import cstringIO

c = cStringIO.StringIO('\x01\x00\x00\x00\x01\x00\x00\x00')

#Trying the iterartor abstraction
b = np.fromiter(c,int)
# The above fails with: ValueError: setting an array element with a sequence.

#Trying the file abstraction
b = np.fromfile(c,int)
# The above fails with: IOError: first argument must be an open file

#Trying the sequence abstraction
b = np.array(c, int)
# The above fails with: TypeError: long() argument must be a string or a number 

#Trying the string abstraction
b = np.fromstring(c)
#The above fails with: TypeError: argument 1 must be string or read-only buffer

b = np.fromstring(c.getvalue(), int)  # does work

My question is why does it behave this way.

The practical problem where this came up is the following: I have a iterator which yields a tuple. I am interested in making a numpy array from one of the components of the tuple with as little copying and duplication as possible. My first cut was to keep writing the interesting components of the yielded tuple into a StringIO object and then use its memory buffer for the array. I can of course use getvalue() but will create and return a copy. What would be a good way to avoid the extra copying.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

┼── 2024-11-24 11:22:26

问题似乎是 numpy 不喜欢给出字符而不是数字。请记住，在 Python 中，单个字符和字符串具有相同的类型 - numpy 必须在幕后进行某种类型检测，并将 '\x01' 视为嵌套序列。

另一个问题是 cStringIO 迭代其行，而不是其字符。

像下面这样的迭代器应该可以解决这两个问题：

def chariter(filelike):
    octet = filelike.read(1)
    while octet:
        yield ord(octet)
        octet = filelike.read(1)

像这样使用它（注意搜索！）：

c.seek(0)
b = np.fromiter(chariter(c), int)

The problem seems to be that numpy doesn't like being given characters instead of numbers. Remember, in Python, single characters and strings have the same type — numpy must have some type detection going on under the hood, and takes '\x01' to be a nested sequence.

The other problem is that a cStringIO iterates over its lines, not its characters.

Something like the following iterator should get around both of these problems:

def chariter(filelike):
    octet = filelike.read(1)
    while octet:
        yield ord(octet)
        octet = filelike.read(1)

Use it like so (note the seek!):

c.seek(0)
b = np.fromiter(chariter(c), int)

回复收藏 0 原文

[浮城] 2024-11-24 11:22:26

由于cStringIO没有实现buffer接口，如果它的getvalue返回数据的副本，那么没有办法在不复制的情况下获取其数据。

如果 getvalue 以字符串形式返回缓冲区而不进行复制，numpy.frombuffer(x.getvalue(), dtype='S1') 将给出（只读） ) 引用字符串的 numpy 数组，无需额外的副本。

np.fromiter(c, int) 和 np.array(c, int) 不起作用的原因是 cStringIO 在迭代时，一次返回一行，与文件类似：

>>> list(iter(c))
['\x01\x00\x00\x00\x01\x00\x00\x00']

这么长的字符串无法转换为单个整数。

***

最好不要太担心复印问题，除非它确实是一个问题。原因是，例如使用生成器并将其传递给 numpy.fromiter 的额外开销实际上可能比构造列表然后将其传递给 numpy.array 所涉及的开销更大 --- 与 Python 运行时开销相比，制作副本可能会更便宜。

但是，如果问题出在内存上，那么一种解决方案是将项目直接放入最终的 Numpy 数组中。如果您事先知道大小，则可以预先分配它。如果大小未知，您可以使用数组中的.resize()方法根据需要增长它。

As cStringIO does not implement the buffer interface, if its getvalue returns a copy of the data, then there is no way to get its data without copying.

If getvalue returns the buffer as a string without making a copy, numpy.frombuffer(x.getvalue(), dtype='S1') will give a (read-only) numpy array referring to the string, without an additional copy.

The reason why np.fromiter(c, int) and np.array(c, int) do not work is that cStringIO, when iterated, returns a line at a time, similarly as files:

>>> list(iter(c))
['\x01\x00\x00\x00\x01\x00\x00\x00']

Such a long string cannot be converted to a single integer.

***

It's best not to worry too much about making copies unless it really turns out to be a problem. The reason is that the extra overhead in e.g. using a generator and passing it to numpy.fromiter may be actually larger than what is involved in constructing a list, and then passing that to numpy.array --- making the copies may be cheap compared to Python runtime overhead.

However, if the issue is with memory, then one solution is to put the items directly into the final Numpy array. If you know the size beforehand, you can pre-allocate it. If the size is unknown, you can use the .resize() method in the array to grow it as needed.

回复收藏 0 原文

~没有更多了~