numpy:有效读取大数组

发布于 2024-10-06 09:36:49 字数 275 浏览 4 评论 0原文

我有一个二进制文件,其中包含 32 位浮点数的密集 n*m 矩阵。将其读入 Fortran 有序 numpy 数组的最有效方法是什么?

该文件的大小为数千兆字节。我可以控制格式,但它必须紧凑(即长度约为 4*n*m 字节)并且必须易于从非 Python 代码生成。

编辑:该方法必须直接生成 Fortran 有序矩阵(由于数据大小,我无法创建 C 有序矩阵,然后将其转换为单独的矩阵) Fortran 排序的副本。)

I have a binary file that contains a dense n*m matrix of 32-bit floats. What's the most efficient way to read it into a Fortran-ordered numpy array?

The file is multi-gigabyte in size. I get to control the format, but it must be compact (i.e. about 4*n*m bytes in length) and must be easy to produce from non-Python code.

edit: It is imperative that the method produces a Fortran-ordered matrix directly (due to the size of the data, I can't afford to create a C-ordered matrix and then transform it into a separate Fortran-ordered copy.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

悲凉≈ 2024-10-13 09:36:50

NumPy 提供 fromfile() 读取二进制数据。

a = numpy.fromfile("filename", dtype=numpy.float32)

将创建一个包含您的数据的一维数组。要将其作为二维 Fortran 排序的 nx m 矩阵进行访问,您可以对其进行重塑:

a = a.reshape((n, m), order="FORTRAN")

[编辑:在这种情况下,reshape() 实际上复制了数据(请参阅的评论)。要在不复制的情况下完成此操作,请使用

a = a.reshape((m, n)).T

感谢 Joe Kingtion 指出这一点。]

但说实话,如果您的矩阵有几 GB,我会选择 HDF5 工具,例如 h5pyPyTables。这两种工具都有常见问题解答条目,将该工具与另一种工具进行比较。我通常更喜欢 h5py,尽管 PyTables 似乎更常用(并且两个项目的范围略有不同)。

HDF5 文件可以用数据分析中使用的大多数编程语言编写。链接的维基百科文章中的接口列表并不完整,例如还有一个 R 接口。但我其实不知道你想用哪种语言来写数据......

NumPy provides fromfile() to read binary data.

a = numpy.fromfile("filename", dtype=numpy.float32)

will create a one-dimensional array containing your data. To access it as a two-dimensional Fortran-ordered n x m matrix, you can reshape it:

a = a.reshape((n, m), order="FORTRAN")

[EDIT: The reshape() actually copies the data in this case (see the comments). To do it without cpoying, use

a = a.reshape((m, n)).T

Thanks to Joe Kingtion for pointing this out.]

But to be honest, if your matrix has several gigabytes, I would go for a HDF5 tool like h5py or PyTables. Both of the tools have FAQ entries comparing the tool to the other one. I generally prefer h5py, though PyTables seems to be more commonly used (and the scopes of both projects are slightly different).

HDF5 files can be written from most programming language used in data analysis. The list of interfaces in the linked Wikipedia article is not complete, for example there is also an R interface. But I actually don't know which language you want to use to write the data...

情感失落者 2024-10-13 09:36:50

基本上,Numpy 将数组存储为平面向量。多个维度只是 Numpy 迭代器使用的不同视图和步幅创建的幻觉。

有关 Numpy 内部工作原理的全面但易于理解的说明,请参阅优秀的 《美丽代码》书第 19 章

至少 Numpy array()reshape() 有 C ('C')、Fortran ('F') 或保留顺序 ('A') 的参数。
另请参阅问题 如何强制 numpy 数组顺序为 fortran 样式?

使用默认 C 索引的示例(行主序):

>>> a = np.arange(12).reshape(3,4) # <- C order by default
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a[1]
array([4, 5, 6, 7])

>>> a.strides
(32, 8)

使用 Fortran 顺序进行索引(列主序):

>>> a = np.arange(12).reshape(3,4, order='F')
>>> a
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])
>>> a[1]
array([ 1,  4,  7, 10])

>>> a.strides
(8, 24)

另一个视图

此外,您可以始终使用数组的参数 T 获取另一种视图:

>>> a = np.arange(12).reshape(3,4, order='C')
>>> a.T
array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

>>> a = np.arange(12).reshape(3,4, order='F')
>>> a.T
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

您也可以手动设置步幅:

>>> a = np.arange(12).reshape(3,4, order='C')
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a.strides
(32, 8)
>>> a.strides = (8, 24)
>>> a
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

Basically Numpy stores the arrays as flat vectors. The multiple dimensions are just an illusion created by different views and strides that the Numpy iterator uses.

For a thorough but easy to follow explanation how Numpy internally works, see the excellent chapter 19 on The Beatiful Code book.

At least Numpy array() and reshape() have an argument for C ('C'), Fortran ('F') or preserved order ('A').
Also see the question How to force numpy array order to fortran style?

An example with the default C indexing (row-major order):

>>> a = np.arange(12).reshape(3,4) # <- C order by default
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a[1]
array([4, 5, 6, 7])

>>> a.strides
(32, 8)

Indexing using Fortran order (column-major order):

>>> a = np.arange(12).reshape(3,4, order='F')
>>> a
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])
>>> a[1]
array([ 1,  4,  7, 10])

>>> a.strides
(8, 24)

The other view

Also, you can always get the other kind of view using the parameter T of an array:

>>> a = np.arange(12).reshape(3,4, order='C')
>>> a.T
array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

>>> a = np.arange(12).reshape(3,4, order='F')
>>> a.T
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

You can also manually set the strides:

>>> a = np.arange(12).reshape(3,4, order='C')
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> a.strides
(32, 8)
>>> a.strides = (8, 24)
>>> a
array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文