需要潜在语义索引方面的帮助

发布于 2024-08-16 21:45:04 字数 180 浏览 5 评论 0原文

如果我的问题听起来很愚蠢，我很抱歉:) 你能给我推荐一些在java中实现LSI的伪代码或好的算法吗？我不是数学专家。我尝试阅读维基百科和其他网站上的一些文章 LSI（潜在语义索引）它们充满了数学。我知道 LSI 充满了数学。但是如果我看到一些源代码或算法。我对事物的理解更加深刻容易地。这就是我在这里问的原因，因为这里有很多大师！提前致谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

长发绾君心 2024-08-23 21:45:04

LSA 的思想基于一个假设：同一文档中出现的两个单词越多，它们就越相似。事实上，我们可以预期，“编程”和“算法”这两个词在同一文档中出现的频率会比“编程”和“养狗”之类的词要高得多。

文档也是如此：两个文档的常见/相似单词越多，它们本身就越相似。因此，您可以通过单词的频率来表达文档的相似性，反之亦然。

知道了这一点，我们可以构建一个共现矩阵，其中列名代表文档，行名-单词，每个单元格[i][j]代表单词的频率<文档 documents[j] 中的 code>words[i]。频率可以通过多种方式计算，IIRC，原始LSA使用tf-idf 索引。

有了这样的矩阵，您可以通过比较相应的列来找到两个文档的相似性。如何比较它们？同样，有多种方法。最流行的是余弦距离。您必须记住学校数学中的内容，该矩阵可以被视为一堆向量，因此每一列只是某个多维空间中的向量。这就是为什么这个模型被称为“向量空间模型”。有关 VSM 和余弦距离的更多信息请此处。

但这样的矩阵有一个问题：它很大。非常非常大。使用它的计算成本太高，因此我们必须以某种方式减少它。 LSA 使用 SVD 技术来保留最“重要”的向量。缩减矩阵后即可使用。

因此，LSA 的算法将如下所示：

收集其中的所有文档和所有唯一单词。
提取频率信息并构建共现矩阵。
使用 SVD 来归约矩阵。

如果您打算自己编写 LSA 库，最好从 Lucene 搜索引擎，这将使步骤 1 和 2 变得更加容易，以及具有 SVD 功能的高维矩阵的一些实现，例如平行 Colt 或 UJMP。

还要注意从 LSA 发展而来的其他技术，例如随机索引。 RI 使用相同的想法并显示大致相同的结果，但不使用完整的矩阵阶段并且是完全增量的，这使得它的计算效率更高。

An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".

Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.

Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.

Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.

But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.

So, algorithm for LSA will look something like this:

Collect all documents and all unique words from them.
Extract frequency information and build co-occurrence matrix.
Reduce matrix with SVD.

If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.

Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.

回复收藏 0 原文