使用 mallet 进行主题建模
我正在尝试使用 Mallet 进行主题建模,但有一个问题。
我如何知道何时需要重建模型?例如,我从网上爬取了这么多文档,使用 Mallet 提供的主题建模,我也许能够创建模型并用它推断文档。但随着时间的推移,随着我抓取的新数据,可能会出现新的主题。在这种情况下,我如何知道是否应该从开始到当前重建模型?
我正在考虑对我每月抓取的文档这样做。有人可以建议吗?
那么,主题建模是否更适合固定主题数量(输入参数k,主题数量)下的文本?如果没有,我如何真正确定使用哪个号码?
I'm trying to use topic modeling with Mallet but have a question.
How do I know when do I need to rebuild the model? For instance I have this amount of documents I crawled from the web, using topic modeling provided by Mallet I might be able to create the models and infer documents with it. But overtime, with new data that I crawled, new subjects may appear. In that case, how do I know whether I should rebuild the model from start till current?
I was thinking of doing so for documents I crawled each month. Can someone please advise?
So, is topic modeling more suitable for text under a fixed amount of topics (the input parameter k, no. of topics). If not, how do I really determine what number to use?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您问题的答案在很大程度上取决于您正在使用的数据类型和语料库的大小。
关于频率,恐怕您只需要估计数据以有意义的方式更改的频率并以该速率进行重构即可。您可以从一周开始,看看新数据是否会导致模型显着不同。如果没有,请尝试两周等等。
您选择的主题数量取决于您在模型中查找的内容。数字越高,结果越细粒度。如果您想全面了解语料库中的内容,您可以选择 10 个主题。为了仔细观察,您可以使用 200 或其他适当高的数字。
我希望这有帮助。
The answers to your questions depend in large part on the kind of data you're working with and the size of the corpus.
Regarding frequency, I'm afraid you'll just have to estimate how often your data changes in a meaningful way and remodel at that rate. You could start with a week and see if the new data lead to a significantly different model. If not, try two weeks and so on.
The number of topics you select is determined by what you're looking for in the model. The higher the number, the more fine-grained the results. If you want a broad overview of what's in your corpus, you could select say 10 topics. For a closer look, you could use 200 or some other suitably high number.
I hope that helps.