雅虎! LDA 实施问题
总之,
我一直在一组文档上运行 Y!LDA (https://github.com/shravanmn/Yahoo_LDA),结果看起来很棒(或者至少是我所期望的)。现在我想使用生成的主题对语料库执行反向查询。有谁知道运行 learntopics 可执行文件后生成的 3 个人类可读的文本文件是否是该库的最终输出?如果是这样,我需要解析它来执行查询吗?此时我有点耸肩……
谢谢, 亚当
All,
I have been running Y!LDA (https://github.com/shravanmn/Yahoo_LDA) on a set of documents and the results look great (or at least what I would expect). Now I want to use the resulting topics to perform a reverse query against the corpus. Does anyone know if the 3 human readable text files that are generated after the learntopics executable is run is the final output for this library? If so, is that what I need to parse to perform my queries? I am stuck with a little shoulder shrugging at this point...
Thanks,
Adam
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

如果 LDA 按我认为的方式工作(我使用 java 实现,因此解释可能会有所不同),那么您将得到以下三件事:
P(word,concept) -- 获得单词的概率给定一个概念。因此,当 LDA 弄清楚语料库中存在哪些概念时,这个 P(w,c) 将告诉您(理论上)哪些单词映射到哪些概念。
确定概念的一种非常简单的方法是将此文件加载到矩阵中,并以某种方法(加法、乘法、均方根)组合测试文档的所有可能概念的所有这些概率,并对概念。
请注意,上述方法无法识别 LDA 中弱代表性主题或主导主题引入的各种偏差。为了适应这一点,您需要更复杂的算法(例如吉布斯采样),但这会给您一些结果。
P(概念,文档)——如果您试图在语料库中的文档中查找内在概念,您可以查看此处。您可以使用这些文档作为具有特定概念分布的文档示例,并将您的文档与 LDA 语料库文档进行比较...这样做有用途,但它可能不如 P(w,c) 有用。
其他可能与单词、文档或概念的权重有关的东西。这可以像一组带有 beta 权重(对于概念)的概念示例一样简单,或者是 LDA 输出的一些其他变量。这些可能重要也可能不重要,具体取决于您在做什么。 (如果您尝试将文档添加到 LDA 空间,那么具有 alpha 或 beta 值非常重要。)
要回答“反向查找”问题,确定测试文档的概念,请使用 P(w,c )对于测试文档中的每个单词 w。
要确定哪个文档与测试文档最相似,请确定上述概念,然后将它们与 P(c,d) 中找到的每个文档的概念进行比较(使用每个概念作为向量空间中的维度,然后确定余弦两个文档之间往往可以正常工作)。
要确定两个文档之间的相似度,与上面相同,只需确定两个概念向量之间的余弦即可。
希望有帮助。
If LDA is working the way I think it is (I use a java implementation, so explanations may vary) then what you get out are the three following things:
P(word,concept) -- The probability of getting a word given a concept. So, when LDA finishes figuring out what concepts exist within the corpus, this P(w,c) will tell you (in theory) which words map to which concepts.
A very naive method of determining concepts would be to load this file into a matrix and combine all these probabilities for all possible concepts for a test document in some method (add, multiply, Root-mean-squared) and rank order the concepts.
Do note that the above method does not recognize the various biases introduced by weakly represented topics or dominating topics in LDA. To accommodate that, you need more complicated algorithms (Gibbs sampling, for instance), but this will get you some results.
P(concept,document) -- If you are attempting to find the intrinsic concepts in the documents in the corpus, you would look here. You can use the documents as examples of documents that have a particular concept distribution, and compare your documents to the LDA corpus documents... There are uses for this, but it may not be as useful as the P(w,c).
Something else probably relating to the weights of words, documents, or concepts. This could be as simple as a set of concept examples with beta weights (for the concepts), or some other variables that are output from LDA. These may or may not be important depending on what you are doing. (If you are attempting to add a document to the LDA space, having the alpha or beta values -- very important.)
To answer your 'reverse lookup' question, to determine the concepts of the test document, use P(w,c) for each word w in the test document.
To determine which document is the most like the test document, determine the above concepts, then compare them to the concepts for each document found in P(c,d) (using each concept as a dimension in vector-space and then determining a cosine between the two documents tends to work alright).
To determine the similarity between two documents, same thing as above, just determine the cosine between the two concept-vectors.
Hope that helps.