numpy 重新数组可变长度的字符串

发布于 2025-01-01 14:59:43 字数 1165 浏览 1 评论 0原文

是否可以在事先不知道字符串长度的情况下初始化一个保存字符串的 numpy 重新数组？

作为一个（人为的）例子：

mydf = np.empty( (numrows,), dtype=[ ('file_name','STRING'), ('file_size_MB',float) ] )

问题是我在用信息填充它之前构建了我的记录，并且我不一定提前知道 file_name 的最大长度。

我所有的尝试都会导致字符串字段被截断：（

>>> mydf = np.empty( (2,), dtype=[('file_name',str),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('', 6.9164002347457e-310), ('', 9.9413127e-317)], 
      dtype=[('file_name', 'S'), ('file_size_mb', '<f8')])
>>> mydf['file_name']
array(['f', 'a'], 
      dtype='|S1')

顺便说一句，为什么 mydf['file_name'] 显示 'f' 和 'a' 而 mydf 显示 ' ' 和 ''？）

类似地，如果我用类型（例如）|S10 来初始化 file_name 那么事情就会被截断为长度 10。

我唯一可以提出的类似问题find 是这个，但这会计算出适当的字符串长度先验< /em> 因此与我的不太一样（因为我事先一无所知）。

除了使用（例如）|S9999999999999（即一些荒谬的上限）初始化file_name之外，还有其他选择吗？

原文

Is it possible to initialise a numpy recarray that will hold strings, without knowing the length of the strings beforehand?

As a (contrived) example:

mydf = np.empty( (numrows,), dtype=[ ('file_name','STRING'), ('file_size_MB',float) ] )

The problem is that I'm constructing my recarray in advance of populating it with information, and I don't necessarily know the maximum length of file_name in advance.

All my attempts result in the string field being truncated:

>>> mydf = np.empty( (2,), dtype=[('file_name',str),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('', 6.9164002347457e-310), ('', 9.9413127e-317)], 
      dtype=[('file_name', 'S'), ('file_size_mb', '<f8')])
>>> mydf['file_name']
array(['f', 'a'], 
      dtype='|S1')

(As an aside, why does mydf['file_name'] show 'f' and 'a' whilst mydf shows '' and ''?)

Similarly, if I initialise with type (say) |S10 for file_name then things get truncated at length 10.

The only similar question I could find is this one, but this calculates the appropriate string length a priori and hence is not quite the same as mine (as I know nothing in advance).

Is there any alternative other than initalising the file_name with (eg) |S9999999999999 (ie some ridiculous upper limit)?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

水中月 2025-01-08 14:59:43

您始终可以使用 object 作为数据类型，而不是使用 STRING 数据类型。这将允许将任何对象分配给数组元素，包括 Python 可变长度字符串。例如：

>>> import numpy as np
>>> mydf = np.empty( (2,), dtype=[('file_name',object),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('foobarasdf.tif', 0.0), ('arghtidlsarbda.jpg', 0.0)], 
      dtype=[('file_name', '|O8'), ('file_size_mb', '<f8')])

具有可变长度元素是违反数组概念的精神的，但这已经是最接近的了。数组的思想是将元素存储在内存中明确定义且间隔规则的内存地址处，这禁止可变长度元素。通过将指向字符串的指针存储在数组中，可以规避这一限制。（这基本上就是上面例子的作用。）

Instead of using the STRING dtype, one can always use object as dtype. That will allow any object to be assigned to an array element, including Python variable length strings. For example:

>>> import numpy as np
>>> mydf = np.empty( (2,), dtype=[('file_name',object),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('foobarasdf.tif', 0.0), ('arghtidlsarbda.jpg', 0.0)], 
      dtype=[('file_name', '|O8'), ('file_size_mb', '<f8')])

It is a against the spirit of the array concept to have variable length elements, but this is as close as one can get. The idea of an array is that elements are stored in memory at well-defined and regularly spaced memory addresses, which prohibits variable length elements. By storing the pointers to a string in an array, one can circumvent this limitation. (This is basically what the above example does.)

回复收藏 0 原文

~没有更多了~