需要潜在语义索引方面的帮助

发布于 2024-08-16 21:45:04 字数 180 浏览 2 评论 0原文

如果我的问题听起来很愚蠢,我很抱歉:) 你能给我推荐一些在java中实现LSI的伪代码或好的算法吗? 我不是数学专家。我尝试阅读维基百科和其他网站上的一些文章 LSI(潜在语义索引)它们充满了数学。 我知道 LSI 充满了数学。但是如果我看到一些源代码或算法。我对事物的理解更加深刻 容易地。这就是我在这里问的原因,因为这里有很多大师! 提前致谢

I am sorry, if my question sounds stupid :)
Can you please recommend me any pseudo code or good algo for LSI implementation in java?
I am not math expert. I tried to read some articles on wikipedia and other websites about
LSI ( latent semantic indexing ) they were full of math.
I know LSI is full of math. But if i see some source code or algo. I understand things more
easily. That's why i asked here, because so many GURU are here !
Thanks in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

长发绾君心 2024-08-23 21:45:04

LSA 的思想基于一个假设:同一文档中出现的两个单词越多,它们就越相似。事实上,我们可以预期,“编程”和“算法”这两个词在同一文档中出现的频率会比“编程”和“养狗”之类的词要高得多。

文档也是如此:两个文档的常见/相似单词越多,它们本身就越相似。因此,您可以通过单词的频率来表达文档的相似性,反之亦然。

知道了这一点,我们可以构建一个共现矩阵,其中列名代表文档,行名-单词,每个单元格[i][j]代表单词的频率<文档 documents[j] 中的 code>words[i]。频率可以通过多种方式计算,IIRC,原始LSA使用tf-idf 索引。

有了这样的矩阵,您可以通过比较相应的列来找到两个文档的相似性。如何比较它们?同样,有多种方法。最流行的是余弦距离。您必须记住学校数学中的内容,该矩阵可以被视为一堆向量,因此每一列只是某个多维空间中的向量。这就是为什么这个模型被称为“向量空间模型”。有关 VSM 和余弦距离的更多信息请此处

但这样的矩阵有一个问题:它很大。非常非常大。使用它的计算成本太高,因此我们必须以某种方式减少它。 LSA 使用 SVD 技术来保留最“重要”的向量。缩减矩阵后即可使用。

因此,LSA 的算法将如下所示:

  1. 收集其中的所有文档所有唯一单词
  2. 提取频率信息并构建共现矩阵
  3. 使用 SVD 来归约矩阵。

如果您打算自己编写 LSA 库,最好从 Lucene 搜索引擎,这将使步骤 1 和 2 变得更加容易,以及具有 SVD 功能的高维矩阵的一些实现,例如 平行 ColtUJMP

还要注意从 LSA 发展而来的其他技术,例如随机索引 。 RI 使用相同的想法并显示大致相同的结果,但不使用完整的矩阵阶段并且是完全增量的,这使得它的计算效率更高。

An idea of LSA is based on one assumption: the more two words occur in same documents, the more similar they are. Indeed, we can expect that words "programming" and "algorithm" will occur in same documents much more often then, say, "programming" and "dog-breeding".

Same for documents: the more common/similar words two documents have, the more similar themselves they are. So, you can express similarity of documents by frequencies of words and vice versa.

Knowing this, we can construct a co-occurrence matrix, where column names represent documents, row names - words and each cells[i][j] represents frequency of word words[i] in document documents[j]. Frequency may be computed in many ways, IIRC, original LSA uses tf-idf index.

Having such matrix, you can find similarity of two documents by comparing corresponding columns. How to compare them? Again, there are several ways. The most popular is a cosine distance. You must remember from school maths, that matrix may be treated as a bunch of vectors, so each column is just a vector in some multidimensional space. That's why this model is called "Vector Space Model". More on VSM and cosine distance here.

But we have one problem with such matrix: it is big. Very very big. Working with it is too computationally expensive, so we have to reduce it somehow. LSA uses SVD technique to keep the most "important" vectors. After reduction matrix is ready to use.

So, algorithm for LSA will look something like this:

  1. Collect all documents and all unique words from them.
  2. Extract frequency information and build co-occurrence matrix.
  3. Reduce matrix with SVD.

If you're going to write LSA library by yourself, the good point to start is Lucene search engine, which will make much easier steps 1 and 2, and some implementation of high-dimensional matrices with SVD capability like Parallel Colt or UJMP.

Also pay attention to other techinques, which grown up from LSA, like Random Indexing. RI uses same idea and shows approximately same results, but doesn't use full matrix stage and is completely incremental, which makes it much more computationally efficient.

最佳男配角 2024-08-23 21:45:04

这可能有点晚了,但我一直喜欢 Sujit Pal 的博客 http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html 我在我的网站上写了一些如果你有兴趣。

这个过程比通常写的要简单得多。实际上,您所需要的只是一个可以对矩阵进行单值分解的库。

如果您感兴趣,我可以解释几个简短的要点:

1)您创建一个矩阵/数据集/等,其中包含各种文档的字数 - 不同的文档将是您的列,行是不同的单词。

2) 创建矩阵后,您可以使用 Jama(适用于 Java)或 SmartMathLibrary(适用于 C#)等库并运行单值分解。这一切所做的就是将你的原始矩阵分解成三个不同的部分/矩阵,它们本质上代表你的文档、你的单词和一种乘数(西格玛),这些被称为向量。

3) 一旦你有了单词、文档、sigma 向量,你就可以通过复制向量/矩阵的较小部分,然后将它们相乘来同等地缩小它们 (k)。通过缩小它们,可以使您的数据标准化,这就是 LSI。

这里有一些相当清晰的资源:

http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html

http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
http://www.soe.ucsc.edu/classes/ cmps290c/Spring07/proj/Flynn_talk.pdf

希望这对您有所帮助。

埃里克

This maybe a bit late but I always liked Sujit Pal's blog http://sujitpal.blogspot.com/2008/09/ir-math-with-java-tf-idf-and-lsi.html and I have written a bit on my site if you are interested.

The process is way less complicated than it is often written up as. And really all you need is a library that can do single value decomposition of a matrix.

If you are interested I can explain in a couple of the short take away bits:

1) you create a matrix/dataset/etc with word counts of various documents - the different documents will be your columns and the rows the distinct words.

2) Once you've created the matrix you use a library like Jama (for Java) or SmartMathLibrary (for C#) and run the single value decomposition. All this does is take your original matrix and break it up in to three different parts/matrix that essentially represent your documents, your words, and kind of a multiplier (sigma) these are called the vectors.

3) Once you have you word, document, sigma vectors you shrink them equally (k) by just copying smaller parts of the vector/matrix and then multiply them back together. By shrinking them it kind of normalizes your data and this is LSI.

here are some fairly clear resources:

http://puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html

http://lsa.colorado.edu/papers/JASIS.lsi.90.pdf
http://www.soe.ucsc.edu/classes/cmps290c/Spring07/proj/Flynn_talk.pdf

Hope this help you out a bit.

Eric

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文