当前位置：文江博客话题详情

与大的最小二乘（30k x 30k）非平方矩阵相似

发布于 2025-02-08 09:45:16 字数 2432 浏览 1 评论 0原文

令rg = a用于具有形状的密集的非结构化矩阵（例如） x 50k，条目0.0或1.0，大致相同），当然a ：（ 30k x 50k，float32）。

给定a和g，我想找到r的最小二乘解决方案。

我可以使用数百个CPU内核，数百个GB RAM和A40 GPU。使用此类资源来解决问题的最佳方法是什么？示例中使用了朱莉娅1.7

我在下面的零和一个？

尝试使用Julia `Linearalgebra`在许多CPU中，

我尝试了两种方法：“ Penrose倒数”和“ Right Disemand”

using LinearAlgebra
@show BLAS.get_num_threads()
# defaults to 8. Can change using BLAS.set_num_threads(N)

# build toy problem (order of magnitude smaller sizes)
R_true = rand(Float32, 3_000, 4_000) 
G = rand([0., 1.], 4_000, 5_000)
# note: using true/false here gives same results but is much slower!
A = R_true * G

# solve toy problem using matrix (right) division
R_fitted_rdiv = A / G

# solve toy problem using Penrose inverse
R_fitted_pinv = (pinv(G') * A')'

，首先设置blas.set_num_threads（64））（或任何更大的数字）实际上只给我blas.get_num_threads（）返回32。显然，这是上限。其次，

使用32个Blas线程实际上慢< / em>比使用8

。我多次运行了32个线程，以排除汇编时间问题。 我如何利用我的计算能力来解决这个问题？

我认为矩阵部门比Penrose反向方法快。是否应该期望这是期望的？我不知道这两个功能都可以为这些输入做些什么。文档说左部（\）使用旋转的QR分解。我找不到用于pinv或正确的diseman（/）的算法（s）（尽管它可能与> \相同它们通过转移矩阵而言）。我宁愿不太深入研究，因为我在数字线性代数方面的知识非常有限。

问题是，对于我的大型矩阵，任何一种方法都需要永远。我的〜100核？

方式使用

有没有一种方法可以以某种 > pinv ：

using CUDA
@time matrix = CUDA.rand(Float32, 10_000, 10_500) # 0.003037 seconds (5 allocations: 160 bytes)
@time pinv(matrix) #  57.417559 seconds (678 allocations: 172.094 KiB)

但是，当我尝试在20k尺寸的矩阵上进行矩阵时，我立即得到 error incacterror：trunc（int32，4811456640）。我认为这是由于使用int32使用int32进行索引，即使我不喜欢't理解为什么在这种情况下会导致错误。 （编辑：这大约是拟合到31位的字节中的数组大小。）

试图使用cuarray s使用正确的difers 给出了错误“ dimensionMismatch（” lu vastored matrix一个必须是正方形！”）。我想我必须手动选择其他算法？ 我不知道它叫什么。（尽管大型矩阵可能仍然会崩溃...？）

总结，看起来我可以轻松地使用朱莉娅的GPU解决我的问题。 我应该继续尝试将GPU用于此任务还是坚持许多CPU

？ /em>

原文

Let RG = A for dense unstructured matrices with shapes (e.g. roughly) R: (30k x 40k, entries float32) and G: (40k x 50k, entries either 0.0 or 1.0, roughly equally often) and of course A: (30k x 50k, entries float32).

Given A and G, I want to find the least squares solution for R.

I can use hundreds of CPU cores, hundreds of GB of RAM and also an A40 GPU. What is the best way to use such resources to solve the problem? I'm using Julia 1.7 in the examples below but I'm open to other options!

First question: Can I somehow exploit that the entries of G are only zeros and ones?

Trying to use Julia `LinearAlgebra` with many CPUs

I've tried two methods: "Penrose inverse" and "right division"

using LinearAlgebra
@show BLAS.get_num_threads()
# defaults to 8. Can change using BLAS.set_num_threads(N)

# build toy problem (order of magnitude smaller sizes)
R_true = rand(Float32, 3_000, 4_000) 
G = rand([0., 1.], 4_000, 5_000)
# note: using true/false here gives same results but is much slower!
A = R_true * G

# solve toy problem using matrix (right) division
R_fitted_rdiv = A / G

# solve toy problem using Penrose inverse
R_fitted_pinv = (pinv(G') * A')'

First, setting BLAS.set_num_threads(64) (or any bigger number) actually only gives me BLAS.get_num_threads() returning 32. Apparantly that's an upper limit. Second,

using 32 BLAS threads is actually slower than using 8.

(e.g. performing right division with sizes (4000, 9800) / (8500, 9800) takes less than 50 seconds on 8 threads but more than 55 seconds on 32 threads. I ran things multiple times to exclude compilation time issues.) I don't know why this is or if it's normal. How can I make use of my computing power for this problem?

I think that the matrix division is faster than the Penrose inverse method. Should this be expected? I don't know what either of the functions do exactly for these inputs. The docs say that left division (\) uses pivoted QR factorization. I couldn't find what algorithm(s) are used for pinv or right division (/) (although it's probably the same as \ since they are related by transposing the matrices). I'd rather not delve too deeply because my knowledge in numerical linear algebra is quite limited.

The issue is that for my large matrices either method takes forever. Is there a way to make use of my ~100 cores somehow?

Trying to use the GPU:

Using CUDA.jl, Matrices of size around 10k work fine and take a minute to pinv:

using CUDA
@time matrix = CUDA.rand(Float32, 10_000, 10_500) # 0.003037 seconds (5 allocations: 160 bytes)
@time pinv(matrix) #  57.417559 seconds (678 allocations: 172.094 KiB)

However, when I try to do matrices around size 20k, I get right away the error InexactError: trunc(Int32, 4811456640). I assume this is due to CUBLAS using int32 for indexing, even though I don't understand why it leads to an error in this case. (edit: it's about the size of the array in bytes fitting into 31 bits.)

Trying to use right division with CuArrays gives the error "DimensionMismatch("LU factored matrix A must be square!")". I guess I have to choose a different algorithm manually? I don't know what it's called. (Although, it probably would still crash for large matrices...?)

To summarize, it doesn't look like I can use the GPU from Julia easily to solve my problem. Should I keep trying to use the GPU for this task or stick to the many CPUs?

Yes this is really my problem, please refrain from commenting "nobody should ever need such large least squares"

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不必在意 2025-02-15 09:45:16

使用pytorch的天真答案

，如果系统可以维持与我的笔记本电脑相同的操作吞吐量，则至少需要30GB GPU内存，

import torch
A = torch.randint(0, 2, (50000, 40000), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (50000, 30000), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)

您应该在15分钟内得到答案。

我建议您尝试扩大尺寸的广义版本，以更好地感觉到您的系统将如何处理它，

def try_it(a,b,c):
  A = torch.randint(0, 2, (a, b), device='cuda', dtype=torch.float32).T
  G = torch.randint(0, 2, (a, c), device='cuda', dtype=torch.float32).T
  R = torch.lstsq(G.T, A.T)

我将一代中的尺寸转换为确保GT和AT是连续的。

您不能在整数上利用太大的优势。与整数相比，这种类型的问题更容易解决真实问题，因为查找整数解决方案需要您搜索解决方案，而通过执行代数操作可以找到的真实解决方案。

Naive answer

Using pytorch, this will require at least 30GB GPU memory

import torch
A = torch.randint(0, 2, (50000, 40000), device='cuda', dtype=torch.float32).T
G = torch.randint(0, 2, (50000, 30000), device='cuda', dtype=torch.float32).T
R = torch.lstsq(G.T, A.T)

If the system can sustain the same operation throughput as my laptop you should have an answer in about 15 minutes.

I would suggest you to try a generalized version scaling up the dimensions to get a better feeling of how your system will handle it

def try_it(a,b,c):
  A = torch.randint(0, 2, (a, b), device='cuda', dtype=torch.float32).T
  G = torch.randint(0, 2, (a, c), device='cuda', dtype=torch.float32).T
  R = torch.lstsq(G.T, A.T)

I transposed the dimensions in the generation in order to make sure G.T and A.T would be contiguous.

You can't take much advantage of the entries being integer. This type of problem is easier to solve on the reals than on the integers, because finding integer solutions would require you to search the solutions, while the real solution you can find by doing algebraic manipulations.

回复收藏 0 原文

~没有更多了~