关于潜在狄利克雷分配(MALLET)的问题
老实说,我对 LDA 并不熟悉,但我的一个项目需要使用 MALLET 的主题建模。
我的问题是:给定特定时间戳内的一组文档作为主题模型的训练数据,对于文档+或-训练数据的时间戳,使用模型(使用推理器)来跟踪主题趋势是否合适。我的意思是,如果在模型构建阶段,我们只提供需要分析的数据集的子集,那么 MALLET 提供的主题分布是否是一个合适的指标来跟踪主题随时间的流行程度。
谢谢。
Honestly, I'm not familiar with LDA, but am required to use MALLET's topic modeling for one of my projects.
My question is: given a set of documents within a specific timestamp as the training data for the topic model, how appropriate is it to use the model (using the inferencer) to track the topic trends, for documents + or - the training data's timestamp. I mean, is the topic distributions being provided by MALLET a suitable metric to track the popularity of the topics over time if during the model building stage, we only provide a subset of the dataset I am required to analyze.
thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您熟悉潜在语义索引吗?潜在狄利克雷分析只是做同一件事的不同方式,所以 LSI 或 pLSI 您可能是了解 LDA 目标的更简单的起点。
所有三种技术都以无监督的方式锁定主题(您告诉它要查找多少个主题),然后假设每个文档以不同的比例涵盖每个主题。根据您分配的主题数量,它们的行为可能更像是您的语料库的子字段,并且可能不像人们在考虑热门主题时所想到的“主题”那么具体。新闻。
不知何故,我怀疑您想假设每个文档代表一个特定主题。 LSI/pLSI/LDA 不这样做——它们将每个文档建模为主题的混合。这并不意味着你不会得到好的结果,或者这不值得尝试,但我怀疑(尽管我对 LSI 文献没有全面的了解)你将解决一个全新的研究问题。
(FWIW,我怀疑使用像 k-Means 这样的聚类方法更容易模拟以下假设:每份文档都有一个主题。)
Are you famailiar with Latent Semantic Indexing? Latent Dirichlet Analysis is just a different way of doing the same kind of thing, so LSI or pLSI you may be an easier starting point to gain knowledge about the goals of LDA.
All three techniques lock on to topics in an unsupervised fashion (you tell it how many topics to look for), and then assume that each document covers each topic in varying proportions. Depending on how many topics you allocate, they may behave more like subfields of whatever your corpus is about, and may not be as specific as the "topics" that people think about when they think about trending topics in the news.
Somehow I suspect that you want to assume that each document represents a particular topic. LSI/pLSI/LDA don't do this -- they model each document as a mixture of topics. That doesn't mean you won't get good results, or that this isn't worth trying, but I suspect (though I don't have a comprehensive knowledge of LSI literature) that you'd be tackling a brand new research problem.
(FWIW, I suspect that using clustering methods like k-Means more readily model the assumption that each document has exactly one topic.)
您应该查看普林斯顿大学的 topic-models 邮件列表。他们讨论与主题模型相关的理论和实践问题。
You should check out the topic-models mailing list at Princeton. They discuss theoretical and practical issues relating to topic models.
我知道可以通过三种方法来跟踪主题随时间的流行程度。
听起来您可能会受益于动态主题建模方法,该方法着眼于主题如何随时间变化。 这里有一个很好的视频概述了 Blei 的工作,还有他的 < a href="http://www.cs.princeton.edu/~blei/publications.html" rel="nofollow">主页。他有一个 C 语言包可以做到这一点。
一种相关的方法是 Alice Oh 的主题字符串方法,她通过 LDA 从时间片中获取文本的主题,然后使用主题相似性度量将不同时间片中的主题链接到字符串中 (视频,PDF)。看起来 MALLET 可能是主题字符串分析的一部分,但她没有提到她是如何进行 LDA 分析的。
最简单的方法可能是 David Mimno 在他的论文中所做的< /a>,他根据主题中单词的时间分布计算主题的平均年份。他参与了 MALLET 的开发,因此可能完全使用该包完成。
I'm aware of three approaches to the tracking the popularity of the topics over time.
It sounds like you might benefit from a dynamic topic modeling approach, which looks at how topics change over time. There's a nice video overview of Blei's work on that here and a bunch of PDFs on his home page. He has a package in C that does it.
A related approach is Alice Oh's topic string approach, where she obtains topics by LDA for texts from time-slices and then uses a topic similarity metric to link topics from different time slices into strings (video, PDF). Looks like MALLET could be part of a topic string analysis, but she doesn't mention how she did the LDA analysis.
The simplest approach might be what David Mimno does in his paper, where he calculates the mean year of a topic from the chronological distribution of the words in the topic. He's involved in the development of MALLET, so it's probably entirely done with that package.