使用 Python 和 NumPy 的非常大的矩阵
NumPy 是一个非常有用的库,通过使用它我发现它能够处理矩阵它们很容易变得很大(10000 x 10000),但开始与更大的东西作斗争(尝试创建 50000 x 50000 的矩阵失败)。 显然,这是因为大量的内存需求。
有没有办法以某种方式在 NumPy 中本地创建巨大的矩阵(比如 100 万乘 100 万)(无需几 TB 的 RAM)?
NumPy is an extremely useful library, and from using it I've found that it's capable of handling matrices which are quite large (10000 x 10000) easily, but begins to struggle with anything much larger (trying to create a matrix of 50000 x 50000 fails). Obviously, this is because of the massive memory requirements.
Is there is a way to create huge matrices natively in NumPy (say 1 million by 1 million) in some way (without having several terrabytes of RAM)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
据我对 numpy 的了解,不,但我可能是错的。
我可以向您建议这个替代解决方案:将矩阵写入磁盘并分块访问它。 我建议您使用 HDF5 文件格式。 如果您明显需要它,您可以重新实现 ndarray 接口,将磁盘存储的矩阵分页到内存中。 如果您修改数据以将其同步回磁盘上,请务必小心。
As far as I know about numpy, no, but I could be wrong.
I can propose you this alternative solution: write the matrix on the disk and access it in chunks. I suggest you the HDF5 file format. If you need it transparently, you can reimplement the ndarray interface to paginate your disk-stored matrix into memory. Be careful if you modify the data to sync them back on the disk.
PyTables 和 NumPy 是最佳选择。
PyTables 将以 HDF 格式将数据存储在磁盘上,并可选择压缩。 我的数据集通常会进行 10 倍压缩,这在处理数千万或数亿行时非常方便。 它也非常快; 我的 5 年旧笔记本电脑可以以 1,000,000 行/秒的速度处理数据,执行类似 SQL 的 GROUP BY 聚合。 对于基于 Python 的解决方案来说还不错!
再次以 NumPy 重新数组的形式访问数据非常简单:
HDF 库负责读取相关数据块并转换为 NumPy。
PyTables and NumPy are the way to go.
PyTables will store the data on disk in HDF format, with optional compression. My datasets often get 10x compression, which is handy when dealing with tens or hundreds of millions of rows. It's also very fast; my 5 year old laptop can crunch through data doing SQL-like GROUP BY aggregation at 1,000,000 rows/second. Not bad for a Python-based solution!
Accessing the data as a NumPy recarray again is as simple as:
The HDF library takes care of reading in the relevant chunks of data and converting to NumPy.
numpy.array 应该存在于内存中。 如果您想使用大于 RAM 的矩阵,则必须解决这个问题。 您至少可以遵循两种方法:
scipy.sparse.csc_matrix
。numpy.array
s are meant to live in memory. If you want to work with matrices larger than your RAM, you have to work around that. There are at least two approaches you can follow:scipy.sparse.csc_matrix
.您应该能够使用 numpy.memmap 将文件映射到磁盘上。 对于较新的 python 和 64 位机器,您应该拥有必要的地址空间,而无需将所有内容加载到内存中。 操作系统应该只处理将文件的一部分保留在内存中。
You should be able to use numpy.memmap to memory map a file on disk. With newer python and 64-bit machine, you should have the necessary address space, without loading everything into memory. The OS should handle only keep part of the file in memory.
要处理稀疏矩阵,您需要位于
numpy
之上的scipy
包 - 请参阅 此处了解有关scipy
为您提供的稀疏矩阵选项的更多详细信息。To handle sparse matrices, you need the
scipy
package that sits on top ofnumpy
-- see here for more details about the sparse-matrix options thatscipy
gives you.Stefano Borini 的 帖子 让我了解了这种情况能走多远事情已经是了。
就是这样。它似乎基本上可以满足您的要求。 HDF5 将允许您存储非常大的数据集,然后以与 NumPy 相同的方式访问和使用它们。
Stefano Borini's post got me to look into how far along this sort of thing already is.
This is it. It appears to do basically what you want. HDF5 will let you store very large datasets, and then access and use them in the same ways NumPy does.
确保您使用的是 64 位操作系统和 64 位版本的 Python/NumPy。 请注意,在 32 位架构上,您通常可以寻址 3GB 内存(大约 1GB 会因内存映射 I/O 等而丢失)。
对于 64 位和大于可用 RAM 的事物数组,您可以摆脱虚拟内存,但如果必须交换,事物会变得更慢。 此外,内存映射(请参阅 numpy.memmap)是一种处理磁盘上大文件而不将其加载到内存中的方法,但同样,您需要有一个 64 位地址空间才能使用,这样才能发挥很大作用。 PyTables 也会为您完成大部分工作。
Make sure you're using a 64-bit operating system and a 64-bit version of Python/NumPy. Note that on 32-bit architectures you can address typically 3GB of memory (with about 1GB lost to memory mapped I/O and such).
With 64-bit and things arrays larger than the available RAM you can get away with virtual memory, though things will get slower if you have to swap. Also, memory maps (see numpy.memmap) are a way to work with huge files on disk without loading them into memory, but again, you need to have a 64-bit address space to work with for this to be of much use. PyTables will do most of this for you as well.
有时,一种简单的解决方案是为矩阵项使用自定义类型。 根据您需要的数字范围,您可以使用手动
dtype
,并且特别适合您的项目。 因为 Numpy 默认情况下考虑对象的最大类型,所以在许多情况下这可能是一个有用的想法。 这是一个示例:并且使用自定义类型:
Sometimes one simple solution is using a custom type for your matrix items. Based on the range of numbers you need, you can use a manual
dtype
and specially smaller for your items. Because Numpy considers the largest type for object by default this might be a helpful idea in many cases. Here is an example:And with custom type:
您是否想知道如何在没有 TB RAM 的情况下处理 2,500,000,000 个元素的矩阵?
在没有 80 亿字节 RAM 的情况下处理 20 亿个项目的方法是不将矩阵保留在内存中。
这意味着需要更复杂的算法来从文件系统中分段获取它。
Are you asking how to handle a 2,500,000,000 element matrix without terabytes of RAM?
The way to handle 2 billion items without 8 billion bytes of RAM is by not keeping the matrix in memory.
That means much more sophisticated algorithms to fetch it from the file system in pieces.