确定 Mahout LDA 输出上的文档 ID
我已成功运行 mahout lda,并使用命令 mahout ldatopics 显示输出。
例如,我的主题是科学和体育。那么输出将是这样的: 主题 0 篮球, 玩, 棒球 主题1 研究, 学习, 我现在的问题
是如何识别单个文章的组或簇。 是否有 ID 号或某种跟踪,以便我添加的每一篇新文章都会被分组或添加到特定的集群/主题。
如果我已经拥有集群,下一步是什么?
谢谢
I've successfully ran mahout lda, and displayed the ouput using the command mahout ldatopics.
For example my topics are science and sports. then the output will be like:
topic 0
basketball,
play,
baseball
topic 1
research,
study,
philosophy
My question now is how can I, identify the the individual article's group or cluster.
Is there an id number or some sort of tracking, so that for every new article that I add it will be grouped or added to a specific cluster/topic.
If I already have the cluster, what's the next step?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我一直在查看源代码,但找不到任何提及用于计算给定文档的主题概率的 theta 矩阵,并且因为没有输入 Alpha 值来估计每个文档的主题和
LDAState
类有一个logProbWordGivenTopic(int, int)
方法,但没有像getProbTopicGivenDocument()
我只能假设 LDA 的 mahout 实现不处理发现特定文档的主题分布。如果其他人更了解的话,我很愿意错。I've been looking through the source code and I can't find any mention of a theta matrix for calculating the probability of topics given a document and since there's no input for an Alpha value to estimate the topics per document and the
LDAState
class has alogProbWordGivenTopic(int, int)
method but nothing likegetProbTopicGivenDocument()
I can only assume the mahout implementation of LDA doesn't deal with discovering the topic distribution for specific documents. I'd love to be wrong though if someone else knows better.