在Python中处理大型密集矩阵

发布于 2024-09-08 21:39:55 字数 1320 浏览 9 评论 0原文

基本上，在 python 中存储和使用密集矩阵的最佳方法是什么？

我有一个项目，可以生成数组中每个项目之间的相似性度量。

每个项目都是一个自定义类，并存储一个指向另一个类的指针和一个表示它与该类的“接近度”的数字。

目前，它在处理大约 8000 个项目时表现出色，之后它会因内存不足错误而失败。
基本上，如果您假设每次比较使用 ~30（根据测试似乎是准确的）字节来存储相似性，则意味着所需的总内存为：
numItems^2 * itemSize = 内存
因此，内存使用量根据项目数量呈指数级增长。
就我而言，每个链接的内存大小约为 30 字节，因此：
8000 * 8000 * 30 = 1,920,000,000 字节，或 1.9 GB
这正是单线程的内存限制。

在我看来，必须有一种更有效的方法来做到这一点。我研究过内存映射，但仅仅为了生成相似性值就已经需要大量计算，并且通过硬盘驱动器来限制它似乎有点荒谬。

编辑
我看过 numpy 和 scipy。不幸的是，它们也不支持非常大的数组。

>>> np.zeros((20000,20000), dtype=np.uint16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError
>>>

进一步编辑
Numpy 似乎很受欢迎。然而，numpy 不会真正做我想做的事，至少没有另一个抽象层。

我不想存储数字，我想存储对类的引用。 Numpy 支持对象，但这并不能真正解决数组大小问题。我提出 numpy 只是作为一个不起作用的例子。

有什么建议吗？

编辑好吧，我最终只是重写了所有逻辑，使其不再存储任何冗余值，从而将内存使用量从 O*n^2 减少到 O* ((n*(n-1))/2)。

基本上，这整个事件是握手问题的一个版本，所以我已经从存储切换到了所有链接仅指向每个链接的一个版本。

这不是一个完整的解决方案，但我通常没有足够大的数据集来溢出它，所以我认为它会解决。 PyTables 真的很有趣，但我不懂任何 SQL，而且似乎没有任何好的传统切片或基于索引的方式来访问表数据。我将来可能会重新讨论这个问题。

原文

Basically, what is the best way to go about storing and using dense matrices in python?

I have a project that generates similarity metrics between every item in an array.

Each item is a custom class, and stores a pointer to the other class and a number representing it's "closeness" to that class.

Right now, it works brilliantly up to about ~8000 items, after which it fails with a out-of-memory error.
Basically, if you assume that each comparison uses ~30 (seems accurate based on testing) bytes to store the similarity, that means the total required memory is:
numItems^2 * itemSize = Memory
So the memory usage is exponential based on the number of items.
In my case, the memory size is ~30 bytes per link, so:
8000 * 8000 * 30 = 1,920,000,000 bytes, or 1.9 GB
which is right at the memory limit for a single thread.

It seems to me that there has to be a more effective way of doing this. I've looked at memmapping, but it's already computationally intensive just to generate the similarity values, and bottlenecking it all through a harddrive seems a little ridiculous.

Edit
I've looked at numpy and scipy. Unfortunatly, they don't support very large arrays either.

>>> np.zeros((20000,20000), dtype=np.uint16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
MemoryError
>>>

Further Edit
Numpy seems to be popular. However, numpy won't really do what I want, at least without another abstraction layer.

I don't want to store numbers, I want to store reference to classes. Numpy supports objects, but that doesn't really address the array size issues. I brought up numpy just as an example of what isn't working.

Any advice?

Edit Well, I wound up just rewriting all the logic so it no longer stores any redundant values, reducing the memory useage from O*n^2 to O*((n*(n-1))/2).

Basically, this whole affair is a version of the handshake problem, so I've switched from storing all links to only a single version of each link.

It's not a complete solution, but I generally don't have any datasets large enough to overflow it, so I think it will work out. PyTables is really interesting, but I don't know any SQL, and there doesn't appear to be any nice traditional slicing or index based way to access the table data. I may revisit the issue in the future.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

你好，陌生人 2024-09-15 21:39:56

您可以通过使用 uint8 来减少内存使用，但要小心避免溢出错误。 uint16 需要两个字节，因此示例中的最小内存要求是 8000*8000*30*2 字节 = 3.84 Gb。

如果第二个示例失败，那么您需要一台新机器。内存要求为 20000*20000*2*字节 =800 Mb。

我的建议是，您尝试创建较小的矩阵并使用“top”、“ps v”或 gnome 系统监视器来检查 python 进程使用的内存。开始用一个小矩阵检查单个线程并进行数学计算。请注意，您可以通过编写 del(x) 来释放变量 x 的内存。这对于测试很有用。

你机器上的内存是多少？ pytables 创建一个 20000*20000 的表需要多少内存？ numpy 使用 uint8 创建 20000*20000 表需要多少内存？

回复收藏 0 原文

枕花眠 2024-09-15 21:39:55

好吧，我找到了解决方案：
h5py

库，基本上呈现类似 numpy 的界面，但使用压缩的内存映射文件来存储任意大小的数组（它基本上是 HDF5 的包装器）。

PyTables 是建立在它的基础上的，PyTables 实际上引导我实现了它。但是，我不需要 PyTables 主要提供的任何 SQL 功能，并且 PyTables 没有提供我真正想要的干净的类似数组的接口。

h5py 基本上就像一个 numpy 数组，只是以不同的格式存储数据。

除了磁盘空间之外，它似乎对数组大小也没有限制。我目前正在 100,000 * 100,000 uint16 数组上进行测试。

回复收藏 0 原文

匿名。 2024-09-15 21:39:55

PyTables 可以通过使用 memmap 和一些巧妙的压缩来处理任意大小的表（数百万列！）。

表面上，它为 python 提供了类似 SQL 的性能。然而，这将需要大量的代码修改。

在我进行更彻底的审查之前，我不会接受这个答案，以确保它确实能达到我想要的效果。或者有人提供更好的解决方案。

回复收藏 0 原文

伴我心暖 2024-09-15 21:39:55

对于 20,000 x 20,000，您正在寻找 12GB RAM？

尝试在 win32 中使用 12GB 人为地限制操作系统可以寻址的内存时，您是否会最终陷入交换地狱？

我正在寻找一个可以支持 12GB 的操作系统（如果您需要坚持使用 32 位 Windows，则 32 bin win 2003 服务器可以），但具有 64 位操作系统和 16GB RAM 的 64 位机器似乎更适合。

升级的好借口:)

64 位 numpy 可以支持你的矩阵

Python 2.5.2 (r252:60911, Jan 20 2010, 23:14:04) 
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.zeros((20000,20000),dtype=np.uint16)
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint16)

For the 20,000 x 20,000 you are looking at 12GB of RAM?

Aren't you going to end up in swap hell trying to work with a 12GB in win32 which artificially limits the memory that the OS can address?

I'd be looking for an OS that can support the 12GB (32 bin win 2003 server can if you need to stick with 32bit windows), but a 64 bit machine with 64bit OS and 16GB of RAM would seem a better fit.

Good excuse for an upgrade :)

64 bit numpy can support your matrix

Python 2.5.2 (r252:60911, Jan 20 2010, 23:14:04) 
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.zeros((20000,20000),dtype=np.uint16)
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint16)

回复收藏 0 原文

痴骨ら 2024-09-15 21:39:55

如果您有 N 个对象，保存在列表 L 中，并且您希望存储每个对象与其他对象之间的相似度，则相似度为 O(N**2) 。在相似度(A, B) ==相似度(B, A)和相似度(A, A) == 0的常见条件下，您所需要的只是相似性三角数组S。该数组中的元素数量将为 N*(N-1)//2。您应该能够使用 array.array 来实现此目的。保持浮点数的相似度只需要 8 个字节。如果您可以将相似度表示为 range(256) 中的整数，则可以使用无符号字节作为 array.array 元素。

所以大约是 8000 * 8000 / 2 * 8，即大约 256 MB。仅使用一个字节来表示相似性意味着仅 32 MB。您可以通过模拟方形数组而不是使用 S[i*N+j] 来避免三角形事物的缓慢 S[i*Ni*(i+1)//2+j] 索引计算`;内存将增加一倍（浮点数为 512 MB，字节为 64 MB）。

如果上面的内容不适合您，那么也许您可以解释“”“每个项目[在哪个容器中？]都是一个自定义类，并存储一个指向另一个类的指针和一个表示它与该类“接近”的数字。 “”和“”“我不想存储数字，我想存储对类的引用”“”即使在用“对象”替换“类”之后，我仍然很难理解什么。你的意思是。