在Python中处理大型密集矩阵
基本上,在 python 中存储和使用密集矩阵的最佳方法是什么?
我有一个项目,可以生成数组中每个项目之间的相似性度量。
每个项目都是一个自定义类,并存储一个指向另一个类的指针和一个表示它与该类的“接近度”的数字。
目前,它在处理大约 8000 个项目时表现出色,之后它会因内存不足错误而失败。
基本上,如果您假设每次比较使用 ~30(根据测试似乎是准确的)字节来存储相似性,则意味着所需的总内存为:numItems^2 * itemSize = 内存
因此,内存使用量根据项目数量呈指数级增长。
就我而言,每个链接的内存大小约为 30 字节,因此:8000 * 8000 * 30 = 1,920,000,000 字节,或 1.9 GB
这正是单线程的内存限制。
在我看来,必须有一种更有效的方法来做到这一点。我研究过内存映射,但仅仅为了生成相似性值就已经需要大量计算,并且通过硬盘驱动器来限制它似乎有点荒谬。
编辑
我看过 numpy 和 scipy。不幸的是,它们也不支持非常大的数组。
>>> np.zeros((20000,20000), dtype=np.uint16)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
>>>
进一步编辑
Numpy 似乎很受欢迎。然而,numpy 不会真正做我想做的事,至少没有另一个抽象层。
我不想存储数字,我想存储对类的引用。 Numpy 支持对象,但这并不能真正解决数组大小问题。我提出 numpy 只是作为一个不起作用的例子。
有什么建议吗?
编辑 好吧,我最终只是重写了所有逻辑,使其不再存储任何冗余值,从而将内存使用量从 O*n^2
减少到 O* ((n*(n-1))/2)
。
基本上,这整个事件是握手问题的一个版本,所以我已经从存储切换到了所有链接仅指向每个链接的一个版本。
这不是一个完整的解决方案,但我通常没有足够大的数据集来溢出它,所以我认为它会解决。 PyTables 真的很有趣,但我不懂任何 SQL,而且似乎没有任何好的传统切片或基于索引的方式来访问表数据。我将来可能会重新讨论这个问题。
Basically, what is the best way to go about storing and using dense matrices in python?
I have a project that generates similarity metrics between every item in an array.
Each item is a custom class, and stores a pointer to the other class and a number representing it's "closeness" to that class.
Right now, it works brilliantly up to about ~8000 items, after which it fails with a out-of-memory error.
Basically, if you assume that each comparison uses ~30 (seems accurate based on testing) bytes to store the similarity, that means the total required memory is:numItems^2 * itemSize = Memory
So the memory usage is exponential based on the number of items.
In my case, the memory size is ~30 bytes per link, so:8000 * 8000 * 30 = 1,920,000,000 bytes, or 1.9 GB
which is right at the memory limit for a single thread.
It seems to me that there has to be a more effective way of doing this. I've looked at memmapping, but it's already computationally intensive just to generate the similarity values, and bottlenecking it all through a harddrive seems a little ridiculous.
Edit
I've looked at numpy and scipy. Unfortunatly, they don't support very large arrays either.
>>> np.zeros((20000,20000), dtype=np.uint16)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
MemoryError
>>>
Further Edit
Numpy seems to be popular. However, numpy won't really do what I want, at least without another abstraction layer.
I don't want to store numbers, I want to store reference to classes. Numpy supports objects, but that doesn't really address the array size issues. I brought up numpy just as an example of what isn't working.
Any advice?
Edit Well, I wound up just rewriting all the logic so it no longer stores any redundant values, reducing the memory useage from O*n^2
to O*((n*(n-1))/2)
.
Basically, this whole affair is a version of the handshake problem, so I've switched from storing all links to only a single version of each link.
It's not a complete solution, but I generally don't have any datasets large enough to overflow it, so I think it will work out. PyTables is really interesting, but I don't know any SQL, and there doesn't appear to be any nice traditional slicing or index based way to access the table data. I may revisit the issue in the future.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以通过使用 uint8 来减少内存使用,但要小心避免溢出错误。 uint16 需要两个字节,因此示例中的最小内存要求是 8000*8000*30*2 字节 = 3.84 Gb。
如果第二个示例失败,那么您需要一台新机器。内存要求为 20000*20000*2*字节 =800 Mb。
我的建议是,您尝试创建较小的矩阵并使用“top”、“ps v”或 gnome 系统监视器来检查 python 进程使用的内存。开始用一个小矩阵检查单个线程并进行数学计算。请注意,您可以通过编写 del(x) 来释放变量 x 的内存。这对于测试很有用。
你机器上的内存是多少? pytables 创建一个 20000*20000 的表需要多少内存? numpy 使用 uint8 创建 20000*20000 表需要多少内存?
You can reduce the memory use by using uint8, but be careful to avoid overflow errors. A uint16 requires two bytes, so the minimal memory requirement in your example is 8000*8000*30*2 bytes = 3.84 Gb.
If the second example fails then you need a new machine. The memory requirement is 20000*20000*2*bytes =800 Mb.
My advice is that you try to create smaller matrices and use "top", "ps v" or the gnome system monitor to check the memory used by your python proces. Start out examining a single thread with a small matrix and do the math. Note that you can release the memory of a variable x by writing del(x). This is useful for testing.
What is the memory on your machine? How much memory does pytables use to create a 20000*20000 table? How much memory does numpy use to create a 20000*20000 table using uint8?
好吧,我找到了解决方案:
h5py
库,基本上呈现类似 numpy 的界面,但使用压缩的内存映射文件来存储任意大小的数组(它基本上是 HDF5 的包装器)。
PyTables 是建立在它的基础上的,PyTables 实际上引导我实现了它。但是,我不需要 PyTables 主要提供的任何 SQL 功能,并且 PyTables 没有提供我真正想要的干净的类似数组的接口。
h5py 基本上就像一个 numpy 数组,只是以不同的格式存储数据。
除了磁盘空间之外,它似乎对数组大小也没有限制。我目前正在 100,000 * 100,000 uint16 数组上进行测试。
Well, I've found my solution:
h5py
It's a library that basically presents a numpy-like interface, but uses compressed memmapped files to store arrays of arbitrary size (It's basically a wrapper for HDF5).
PyTables is built on it, and PyTables actually led me to it. However, I do not need any of the SQL functionality that is the main offering of PyTables, and PyTables does not provide the clean array-like interface I was really looking for.
h5py basically acts like a numpy array, and simply stores the data in a different format.
It also seems to have no limit to array size, except perhaps disk space. I'm currently doing testing on a 100,000 * 100,000 array of uint16.
PyTables 可以通过使用 memmap 和一些巧妙的压缩来处理任意大小的表(数百万列!)。
表面上,它为 python 提供了类似 SQL 的性能。然而,这将需要大量的代码修改。
在我进行更彻底的审查之前,我不会接受这个答案,以确保它确实能达到我想要的效果。或者有人提供更好的解决方案。
PyTables can handle tables of arbitrary size (Millions of columns!), through using memmap and some clever compression.
Ostensibly, it provides SQL like performance, to python. It will, however, require significant code modifications.
I'm not going to accept this answer until I've done a more thorough vetting, to ensure it can actually do what I want. Or someone provides a better solution.
对于 20,000 x 20,000,您正在寻找 12GB RAM?
尝试在 win32 中使用 12GB 人为地限制操作系统可以寻址的内存时,您是否会最终陷入交换地狱?
我正在寻找一个可以支持 12GB 的操作系统(如果您需要坚持使用 32 位 Windows,则 32 bin win 2003 服务器可以),但具有 64 位操作系统和 16GB RAM 的 64 位机器似乎更适合。
升级的好借口:)
64 位 numpy 可以支持你的矩阵
For the 20,000 x 20,000 you are looking at 12GB of RAM?
Aren't you going to end up in swap hell trying to work with a 12GB in win32 which artificially limits the memory that the OS can address?
I'd be looking for an OS that can support the 12GB (32 bin win 2003 server can if you need to stick with 32bit windows), but a 64 bit machine with 64bit OS and 16GB of RAM would seem a better fit.
Good excuse for an upgrade :)
64 bit numpy can support your matrix
如果您有 N 个对象,保存在列表 L 中,并且您希望存储每个对象与其他对象之间的相似度,则相似度为
O(N**2)
。在相似度(A, B) ==相似度(B, A)
和相似度(A, A) == 0
的常见条件下,您所需要的只是相似性三角数组S。该数组中的元素数量将为N*(N-1)//2
。您应该能够使用 array.array 来实现此目的。保持浮点数的相似度只需要 8 个字节。如果您可以将相似度表示为range(256)
中的整数,则可以使用无符号字节作为 array.array 元素。所以大约是 8000 * 8000 / 2 * 8,即大约 256 MB。仅使用一个字节来表示相似性意味着仅 32 MB。您可以通过模拟方形数组而不是使用 S[i*N+j] 来避免三角形事物的缓慢
S[i*Ni*(i+1)//2+j]
索引计算`;内存将增加一倍(浮点数为 512 MB,字节为 64 MB)。如果上面的内容不适合您,那么也许您可以解释“”“每个项目[在哪个容器中?]都是一个自定义类,并存储一个指向另一个类的指针和一个表示它与该类“接近”的数字。 “”和“”“我不想存储数字,我想存储对类的引用”“”即使在用“对象”替换“类”之后,我仍然很难理解什么。你的意思是。
If you have N objects, kept in a list L, and you wish to store the similarity betwen each object and each other object, that's
O(N**2)
similarities. Under the common conditions thatsimilarity(A, B) == similarity(B, A)
andsimilarity(A, A) == 0
, all that you need is a triangular array S of similarities. The number of elements in that array will beN*(N-1)//2
. You should be able to use an array.array for this purpose. Keeping your similarity as a float will take only 8 bytes. If you can represent your similarity as an integer inrange(256)
, you use an unsigned byte as the array.array element.So that's about 8000 * 8000 / 2 * 8 i.e. about 256 MB. Using only a byte for the similarity means only 32 MB. You could avoid the slow
S[i*N-i*(i+1)//2+j]
index calculation of the triangle thingie by simulating a square array instead using S[i*N+j]`; memory would double to (512 MB for float, 64 MB for byte).If the above doesn't suit you, then perhaps you could explain """Each item [in which container?] is a custom class, and stores a pointer to the other class and a number representing it's "closeness" to that class."" and """I don't want to store numbers, I want to store reference to classes""". Even after replacing "class(es)" by "object(s)", I'm struggling to understand what you mean.
您可能会在 NumPy(请参阅 SciPy) 文档(数组/矩阵)中找到一些建议:
You might find some advice in the NumPy (see SciPy) documentation (arrays/matrices):