如何理解Mallet中主题模型类的输出?
当我尝试主题建模开发人员指南上的示例代码时,我真的很想理解其含义该代码的输出。
首先在运行过程中,它给出:
Coded LDA: 10 topics, 4 topic bits, 1111 topic mask
max tokens: 148
total tokens: 1333
<10> LL/token: -9,24097
<20> LL/token: -9,1026
<30> LL/token: -8,95386
<40> LL/token: -8,75353
0 0,5 battle union confederate tennessee american states
1 0,5 hawes sunderland echo war paper commonwealth
2 0,5 test including cricket australian hill career
3 0,5 average equipartition theorem law energy system
4 0,5 kentucky army grant gen confederates buell
5 0,5 years yard national thylacine wilderness parks
6 0,5 gunnhild norway life extinct gilbert thespis
7 0,5 zinta role hindi actress film indian
8 0,5 rings south ring dust 2 uranus
9 0,5 tasmanian back time sullivan london century
<50> LL/token: -8,59033
<60> LL/token: -8,63711
<70> LL/token: -8,56168
<80> LL/token: -8,57189
<90> LL/token: -8,46669
0 0,5 battle union confederate tennessee united numerous
1 0,5 hawes sunderland echo paper commonwealth early
2 0,5 test cricket south australian hill england
3 0,5 average equipartition theorem law energy system
4 0,5 kentucky army grant gen war time
5 0,5 yard national thylacine years wilderness tasmanian
6 0,5 including gunnhild norway life time thespis
7 0,5 zinta role hindi actress film indian
8 0,5 rings ring dust 2 uranus survived
9 0,5 back london modern sullivan gilbert needham
<100> LL/token: -8,49005
<110> LL/token: -8,57995
<120> LL/token: -8,55601
<130> LL/token: -8,50673
<140> LL/token: -8,46388
0 0,5 battle union confederate tennessee war united
1 0,5 sunderland echo paper edward england world
2 0,5 test cricket south australian hill record
3 0,5 average equipartition theorem energy system kinetic
4 0,5 hawes kentucky army gen grant confederates
5 0,5 years yard national thylacine wilderness tasmanian
6 0,5 gunnhild norway including king life devil
7 0,5 zinta role hindi actress film indian
8 0,5 rings ring dust 2 uranus number
9 0,5 london sullivan gilbert thespis back mother
<150> LL/token: -8,51129
<160> LL/token: -8,50269
<170> LL/token: -8,44308
<180> LL/token: -8,47441
<190> LL/token: -8,62186
0 0,5 battle union confederate grant tennessee numerous
1 0,5 sunderland echo survived paper edward england
2 0,5 test cricket south australian hill park
3 0,5 average equipartition theorem energy system law
4 0,5 hawes kentucky army gen time confederates
5 0,5 yard national thylacine years wilderness tasmanian
6 0,5 gunnhild including norway life king time
7 0,5 zinta role hindi actress film indian
8 0,5 rings ring dust 2 uranus number
9 0,5 back london sullivan gilbert thespis 3
<200> LL/token: -8,54771
Total time: 6 seconds
那么问题1:第一行中的“Coded LDA: 10 topics, 4 topic bits, 1111 topic mask”是什么意思?我只知道“10个主题”是关于什么的。
问题2:“ <10> LL/token: -9,24097 <20> LL/token: -9,1026 <30> LL/token: -8 中的LL/Token是什么意思,95386 <40> LL/token: -8,75353" 意思是?这似乎是一个指标吉布斯采样。但它不是单调递增吗?
之后,打印以下内容:
elizabeth-9 needham-9 died-7 3-9 1731-6 mother-6 needham-9 english-7 procuress-6 brothel-4 keeper-9 18th-8.......
0 0.008 battle (8) union (7) confederate (6) grant (4) tennessee (4)
1 0.008 sunderland (6) years (6) echo (5) survived (3) paper (3)
2 0.040 test (6) cricket (5) hill (4) park (3) career (3)
3 0.008 average (6) equipartition (6) system (5) theorem (5) law (4)
4 0.073 hawes (7) kentucky (6) army (5) gen (4) war (4)
5 0.008 yard (6) national (6) thylacine (5) wilderness (4) tasmanian (4)
6 0.202 gunnhild (5) norway (4) life (4) including (3) king (3)
7 0.202 zinta (4) role (3) hindi (3) actress (3) film (3)
8 0.040 rings (10) ring (3) dust (3) 2 (3) uranus (3)
9 0.411 london (4) sullivan (3) gilbert (3) thespis (3) back (3)
0 0.55
这部分的第一行可能是令牌主题分配,对吗?
问题3: 对于第一个主题,
0 0.008 battle (8) union (7) confederate (6) grant (4) tennessee (4)
0.008被称为“主题分布”,是这个主题在整个语料库中的分布吗?然后好像有冲突: 如上所示的主题 0 将在 copus 中出现 8+7+6+4+4+... 次;相比之下,主题 7 在语料库中的识别次数为 4+3+3+3+3... 次。结果,主题7的分布应该比主题0低。这是我无法理解的。 另外,最后那个“0 0.55”是什么?
非常感谢您阅读这篇长文。希望您能回答这个问题,并希望这对其他对 Mallet 感兴趣的人有所帮助。
最好的
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

发布评论
评论(3)
薄荷港 2024-12-27 13:15:26
我的理解是:
0 0.008 battle (8) union (7) confederate (6) grant (4) tennessee (4)
- 0是主题编号。
- 0.008 是该主题的权重
- Battle (8) union (7) [...] 是该主题中的热门关键词。数字是该单词在主题中出现的次数。
然后,您还将获得一个 .csv 文件。我认为它包含了该过程中最重要的数据。您会发现每行的值如下所示:
0 0 285 10 page make items thing work put dec browsers recipes expressions
即:
- 树级别
- 主题 ID
- 总字数
- 文档总数
- 前 10 个单词
有点晚了,但我希望它对某人有所帮助
~没有更多了~
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
我认为我知道的信息不足以给出一个非常完整的答案,但这里有一些内容...对于第一季度,您可以检查一些 代码 查看这些值是如何计算的。对于 Q2,LL 是模型的对数似然除以标记总数,这是对数据被赋予模型的可能性的度量。增加值意味着模型正在改进。这些也可以在用于主题建模的
R
包中找到。 Q2,是的,我认为第一行是正确的。 Q3,好问题,我并不清楚,也许 (x) 是某种索引,令牌频率似乎不太可能......大概其中大多数都是某种诊断。可以使用
bin\mallet run cc.mallet.topics.tui.TopicTrainer ...您的各种选项... --diagnostics-file Diagnostics.xml
获得一组更有用的诊断信息,它将产生大量的主题质量衡量标准。它们绝对值得一看。有关所有这些的完整故事,我建议我在普林斯顿写一封电子邮件给戴维·米姆诺(David Mimno),谁是马利特的(主要?)维护者,或者通过"="">http://blog.gmane.org/gmane.comp.ai.mallet.devel 然后将答案发布回此处,供我们这些好奇的人使用的内部运作木槌...
I don't think I know enough to give a very complete answer, but here's a shot at some of it... for Q1 you can inspect some code to see how those values are calculated. For Q2, LL is the model's log-liklihood divided by the total number of tokens, this is a measure of how likely the data are given the model. Increasing values mean the model is improving. These are also available in the
R
packages for topic modeling. Q2, yes I think that's right for the first line. Q3, good question, it's not immediately clear to me, perhaps the (x) are some kind of index, token frequency seems unlikely... Presumably most of these are diagnostics of some kind.A more useful set of diagnostics can be obtained with
bin\mallet run cc.mallet.topics.tui.TopicTrainer ...your various options... --diagnostics-file diagnostics.xml
which will produce a large number of measures of topic quality. They're definitely worth checking out.For the full story about all of this I'd suggest writing an email to David Mimno at Princeton who is the (main?) maintainer of MALLET, or writing to him via the list at http://blog.gmane.org/gmane.comp.ai.mallet.devel and then posting answers back here for those of us curious about the inner workings of MALLET...