文章来源于网络收集而来，版权归原创者所有，如有侵权请及时联系！

第三部分：用于主体建模的随机 SVD

发布于 2025-01-01 12:38:40 字数 4252 浏览 0 评论 0 收藏 0

随机 SVD

提醒：完整的 SVD 很慢。这是我们使用 Scipy 的 Linalg SVD 进行的计算：

import numpy as np

vectors = np.load("topics/vectors.npy")

vectors.shape

# (2034, 26576)

%time U, s, Vh = linalg.svd(vectors, full_matrices=False)

'''
CPU times: user 27.2 s, sys: 812 ms, total: 28 s
Wall time: 27.9 s
'''

print(U.shape, s.shape, Vh.shape)

# (2034, 2034) (2034,) (2034, 26576)

运行的是，还有更快的方法：

%time u, s, v = decomposition.randomized_svd(vectors, 5)

'''
CPU times: user 144 ms, sys: 8 ms, total: 152 ms
Wall time: 154 ms
'''

SVD 的运行时复杂度为 O(min(m^2 n,m n^2)) 。

问题：我们如何加快速度？（没有 SVD 研究的新突破的情况下）。

想法：让我们使用更小的矩阵（ n 更小）！

我们不使用 m×n 的整个矩阵 A 计算 SVD，而是使用 B = AQ ，它只是 m×r ，并且 r << n 。

我们还没有找到更好的 SVD 通用方法，我们只是在较小的矩阵上使用我们的方法。

%time u, s, v = decomposition.randomized_svd(vectors, 5)

'''
CPU times: user 144 ms, sys: 8 ms, total: 152 ms
Wall time: 154 ms
'''

u.shape, s.shape, v.shape

# ((2034, 5), (5,), (5, 26576))

show_topics(v)

'''
['jpeg image edu file graphics images gif data',
 'jpeg gif file color quality image jfif format',
 'space jesus launch god people satellite matthew atheists',
 'jesus god matthew people atheists atheism does graphics',
 'image data processing analysis software available tools display']
'''

随机 SVD，第二版

from scipy import linalg

方法 randomized_range_finder 找到一个正交矩阵，其范围近似于 A 的范围（我们的算法中的步骤 1）。为此，我们使用 LU 和 QR 分解，我们将在稍后深入介绍这两种分解。

我使用 sklearn.extmath.randomized_svd 源代码作为指南。

# 计算一个正交矩阵，其范围近似于 A 的范围
# power_iteration_normalizer 可以是 safe_sparse_dot（快但不稳定），LU（二者之间）或 QR（慢但最准确）
def randomized_range_finder(A, size, n_iter=5):
    Q = np.random.normal(size=(A.shape[1], size))

    for i in range(n_iter):
        Q, _ = linalg.lu(A @ Q, permute_l=True)
        Q, _ = linalg.lu(A.T @ Q, permute_l=True)

    Q, _ = linalg.qr(A @ Q, mode='economic')
    return Q

这里是我们的随机 SVD 方法。

def randomized_svd(M, n_components, n_oversamples=10, n_iter=4):

    n_random = n_components + n_oversamples

    Q = randomized_range_finder(M, n_random, n_iter)
    print(Q.shape)
    # project M to the (k + p) dimensional space using the basis vectors
    B = Q.T @ M
    print(B.shape)
    # compute the SVD on the thin matrix: (k + p) wide
    Uhat, s, V = linalg.svd(B, full_matrices=False)
    del B
    U = Q @ Uhat
    print(U.shape)

    return U[:, :n_components], s[:n_components], V[:n_components, :]

u, s, v = randomized_svd(vectors, 5)

'''
(2034, 15)
(15, 26576)
(2034, 15)
'''

测试

vectors.shape

# (2034, 26576)

Q = np.random.normal(size=(vectors.shape[1], 10)); Q.shape

# (26576, 10)

Q2, _ = linalg.qr(vectors @ Q, mode='economic'); Q2.shape

# (2034, 10)

Q2.shape

# (2034, 10)

测试结束

%time u, s, v = randomized_svd(vectors, 5)

'''
CPU times: user 136 ms, sys: 0 ns, total: 136 ms
Wall time: 137 ms
'''

u.shape, s.shape, v.shape

# ((2034, 5), (5,), (5, 26576))

show_topics(v)

'''
['jpeg image edu file graphics images gif data',
 'edu graphics data space pub mail 128 3d',
 'space jesus launch god people satellite matthew atheists',
 'space launch satellite commercial nasa satellites market year',
 'image data processing analysis software available tools display']
'''

在改变主题数时，写一个循环来计算分解的误差。绘制结果。

答案

# 在改变主题数时，写一个循环来计算分解的误差。绘制结果

plt.plot(range(0,n*step,step), error)

# [<matplotlib.lines.Line2D at 0x7fe3f8a1b438>]

%time u, s, v = decomposition.randomized_svd(vectors, 5)

'''
CPU times: user 144 ms, sys: 8 ms, total: 152 ms
Wall time: 154 ms
'''

%time u, s, v = decomposition.randomized_svd(vectors.todense(), 5)

'''
CPU times: user 2.38 s, sys: 592 ms, total: 2.97 s
Wall time: 2.96 s
'''