使用 Python 和 NumPy 的非常大的矩阵

发布于 2024-07-26 10:15:58 字数 257 浏览 1 评论 0原文

NumPy 是一个非常有用的库,通过使用它我发现它能够处理矩阵它们很容易变得很大(10000 x 10000),但开始与更大的东西作斗争(尝试创建 50000 x 50000 的矩阵失败)。 显然,这是因为大量的内存需求。

有没有办法以某种方式在 NumPy 中本地创建巨大的矩阵(比如 100 万乘 100 万)(无需几 TB 的 RAM)?

NumPy is an extremely useful library, and from using it I've found that it's capable of handling matrices which are quite large (10000 x 10000) easily, but begins to struggle with anything much larger (trying to create a matrix of 50000 x 50000 fails). Obviously, this is because of the massive memory requirements.

Is there is a way to create huge matrices natively in NumPy (say 1 million by 1 million) in some way (without having several terrabytes of RAM)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

拥有 2024-08-02 10:15:59

据我对 numpy 的了解,不,但我可能是错的。

我可以向您建议这个替代解决方案:将矩阵写入磁盘并分块访问它。 我建议您使用 HDF5 文件格式。 如果您明显需要它,您可以重新实现 ndarray 接口,将磁盘存储的矩阵分页到内存中。 如果您修改数据以将其同步回磁盘上,请务必小心。

As far as I know about numpy, no, but I could be wrong.

I can propose you this alternative solution: write the matrix on the disk and access it in chunks. I suggest you the HDF5 file format. If you need it transparently, you can reimplement the ndarray interface to paginate your disk-stored matrix into memory. Be careful if you modify the data to sync them back on the disk.

楠木可依 2024-08-02 10:15:58

PyTables 和 NumPy 是最佳选择。

PyTables 将以 HDF 格式将数据存储在磁盘上,并可选择压缩。 我的数据集通常会进行 10 倍压缩,这在处理数千万或数亿行时非常方便。 它也非常快; 我的 5 年旧笔记本电脑可以以 1,000,000 行/秒的速度处理数据,执行类似 SQL 的 GROUP BY 聚合。 对于基于 Python 的解决方案来说还不错!

再次以 NumPy 重新数组的形式访问数据非常简单:

data = table[row_from:row_to]

HDF 库负责读取相关数据块并转换为 NumPy。

PyTables and NumPy are the way to go.

PyTables will store the data on disk in HDF format, with optional compression. My datasets often get 10x compression, which is handy when dealing with tens or hundreds of millions of rows. It's also very fast; my 5 year old laptop can crunch through data doing SQL-like GROUP BY aggregation at 1,000,000 rows/second. Not bad for a Python-based solution!

Accessing the data as a NumPy recarray again is as simple as:

data = table[row_from:row_to]

The HDF library takes care of reading in the relevant chunks of data and converting to NumPy.

送君千里 2024-08-02 10:15:58

numpy.array 应该存在于内存中。 如果您想使用大于 RAM 的矩阵,则必须解决这个问题。 您至少可以遵循两种方法:

  1. 尝试更有效的矩阵表示,利用矩阵所具有的任何特殊结构。 例如,正如其他人已经指出的那样,稀疏矩阵(有很多零的矩阵)有有效的数据结构,例如 scipy.sparse.csc_matrix
  2. 修改您的算法以处理子矩阵。 您只能从磁盘读取当前正在计算中使用的矩阵块。 设计用于在集群上运行的算法通常按块工作,因为数据分散在不同的计算机上,并且仅在需要时才传递。 例如,矩阵乘法的 Fox 算法(PDF 文件)。

numpy.arrays are meant to live in memory. If you want to work with matrices larger than your RAM, you have to work around that. There are at least two approaches you can follow:

  1. Try a more efficient matrix representation that exploits any special structure that your matrices have. For example, as others have already pointed out, there are efficient data structures for sparse matrices (matrices with lots of zeros), like scipy.sparse.csc_matrix.
  2. Modify your algorithm to work on submatrices. You can read from disk only the matrix blocks that are currently being used in computations. Algorithms designed to run on clusters usually work blockwise, since the data is scatted across different computers, and passed by only when needed. For example, the Fox algorithm for matrix multiplication (PDF file).
妄断弥空 2024-08-02 10:15:58

您应该能够使用 numpy.memmap 将文件映射到磁盘上。 对于较新的 python 和 64 位机器,您应该拥有必要的地址空间,而无需将所有内容加载到内存中。 操作系统应该只处理将文件的一部分保留在内存中。

You should be able to use numpy.memmap to memory map a file on disk. With newer python and 64-bit machine, you should have the necessary address space, without loading everything into memory. The OS should handle only keep part of the file in memory.

赢得她心 2024-08-02 10:15:58

要处理稀疏矩阵,您需要位于 numpy 之上的 scipy 包 - 请参阅 此处了解有关 scipy 为您提供的稀疏矩阵选项的更多详细信息。

To handle sparse matrices, you need the scipy package that sits on top of numpy -- see here for more details about the sparse-matrix options that scipy gives you.

叫思念不要吵 2024-08-02 10:15:58

Stefano Borini 的 帖子 让我了解了这种情况能走多远事情已经是了。

就是这样。它似乎基本上可以满足您的要求。 HDF5 将允许您存储非常大的数据集,然后以与 NumPy 相同的方式访问和使用它们。

Stefano Borini's post got me to look into how far along this sort of thing already is.

This is it. It appears to do basically what you want. HDF5 will let you store very large datasets, and then access and use them in the same ways NumPy does.

作业与我同在 2024-08-02 10:15:58

确保您使用的是 64 位操作系统和 64 位版本的 Python/NumPy。 请注意,在 32 位架构上,您通常可以寻址 3GB 内存(大约 1GB 会因内存映射 I/O 等而丢失)。

对于 64 位和大于可用 RAM 的事物数组,您可以摆脱虚拟内存,但如果必须交换,事物会变得更慢。 此外,内存映射(请参阅 numpy.memmap)是一种处理磁盘上大文件而不将其加载到内存中的方法,但同样,您需要有一个 64 位地址空间才能使用,这样才能发挥很大作用。 PyTables 也会为您完成大部分工作。

Make sure you're using a 64-bit operating system and a 64-bit version of Python/NumPy. Note that on 32-bit architectures you can address typically 3GB of memory (with about 1GB lost to memory mapped I/O and such).

With 64-bit and things arrays larger than the available RAM you can get away with virtual memory, though things will get slower if you have to swap. Also, memory maps (see numpy.memmap) are a way to work with huge files on disk without loading them into memory, but again, you need to have a 64-bit address space to work with for this to be of much use. PyTables will do most of this for you as well.

若水微香 2024-08-02 10:15:58

有时,一种简单的解决方案是为矩阵项使用自定义类型。 根据您需要的数字范围,您可以使用手动dtype,并且特别适合您的项目。 因为 Numpy 默认情况下考虑对象的最大类型,所以在许多情况下这可能是一个有用的想法。 这是一个示例:

In [70]: a = np.arange(5)

In [71]: a[0].dtype
Out[71]: dtype('int64')

In [72]: a.nbytes
Out[72]: 40

In [73]: a = np.arange(0, 2, 0.5)

In [74]: a[0].dtype
Out[74]: dtype('float64')

In [75]: a.nbytes
Out[75]: 32

并且使用自定义类型:

In [80]: a = np.arange(5, dtype=np.int8)

In [81]: a.nbytes
Out[81]: 5

In [76]: a = np.arange(0, 2, 0.5, dtype=np.float16)

In [78]: a.nbytes
Out[78]: 8

Sometimes one simple solution is using a custom type for your matrix items. Based on the range of numbers you need, you can use a manual dtype and specially smaller for your items. Because Numpy considers the largest type for object by default this might be a helpful idea in many cases. Here is an example:

In [70]: a = np.arange(5)

In [71]: a[0].dtype
Out[71]: dtype('int64')

In [72]: a.nbytes
Out[72]: 40

In [73]: a = np.arange(0, 2, 0.5)

In [74]: a[0].dtype
Out[74]: dtype('float64')

In [75]: a.nbytes
Out[75]: 32

And with custom type:

In [80]: a = np.arange(5, dtype=np.int8)

In [81]: a.nbytes
Out[81]: 5

In [76]: a = np.arange(0, 2, 0.5, dtype=np.float16)

In [78]: a.nbytes
Out[78]: 8
笑红尘 2024-08-02 10:15:58

您是否想知道如何在没有 TB RAM 的情况下处理 2,500,000,000 个元素的矩阵?

在没有 80 亿字节 RAM 的情况下处理 20 亿个项目的方法是不将矩阵保留在内存中。

这意味着需要更复杂的算法来从文件系统中分段获取它。

Are you asking how to handle a 2,500,000,000 element matrix without terabytes of RAM?

The way to handle 2 billion items without 8 billion bytes of RAM is by not keeping the matrix in memory.

That means much more sophisticated algorithms to fetch it from the file system in pieces.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文