如何理解Mallet中主题模型类的输出?

发布于 12-20 13:15 字数 4543 浏览 3 评论 0 原文

当我尝试主题建模开发人员指南上的示例代码时,我真的很想理解其含义该代码的输出。

首先在运行过程中,它给出:

Coded LDA: 10 topics, 4 topic bits, 1111 topic mask
max tokens: 148
total tokens: 1333
<10> LL/token: -9,24097
<20> LL/token: -9,1026
<30> LL/token: -8,95386
<40> LL/token: -8,75353

0   0,5 battle union confederate tennessee american states 
1   0,5 hawes sunderland echo war paper commonwealth 
2   0,5 test including cricket australian hill career 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen confederates buell 
5   0,5 years yard national thylacine wilderness parks 
6   0,5 gunnhild norway life extinct gilbert thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings south ring dust 2 uranus 
9   0,5 tasmanian back time sullivan london century 

<50> LL/token: -8,59033
<60> LL/token: -8,63711
<70> LL/token: -8,56168
<80> LL/token: -8,57189
<90> LL/token: -8,46669

0   0,5 battle union confederate tennessee united numerous 
1   0,5 hawes sunderland echo paper commonwealth early 
2   0,5 test cricket south australian hill england 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen war time 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 including gunnhild norway life time thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus survived 
9   0,5 back london modern sullivan gilbert needham 

<100> LL/token: -8,49005
<110> LL/token: -8,57995
<120> LL/token: -8,55601
<130> LL/token: -8,50673
<140> LL/token: -8,46388

0   0,5 battle union confederate tennessee war united 
1   0,5 sunderland echo paper edward england world 
2   0,5 test cricket south australian hill record 
3   0,5 average equipartition theorem energy system kinetic 
4   0,5 hawes kentucky army gen grant confederates 
5   0,5 years yard national thylacine wilderness tasmanian 
6   0,5 gunnhild norway including king life devil 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 london sullivan gilbert thespis back mother 

<150> LL/token: -8,51129
<160> LL/token: -8,50269
<170> LL/token: -8,44308
<180> LL/token: -8,47441
<190> LL/token: -8,62186

0   0,5 battle union confederate grant tennessee numerous 
1   0,5 sunderland echo survived paper edward england 
2   0,5 test cricket south australian hill park 
3   0,5 average equipartition theorem energy system law 
4   0,5 hawes kentucky army gen time confederates 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 gunnhild including norway life king time 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 back london sullivan gilbert thespis 3 

<200> LL/token: -8,54771

Total time: 6 seconds

那么问题1:第一行中的“Coded LDA: 10 topics, 4 topic bits, 1111 topic mask”是什么意思?我只知道“10个主题”是关于什么的。

问题2:“ <10> LL/token: -9,24097 <20> LL/token: -9,1026 <30> LL/token: -8 中的LL/Token是什么意思,95386 <40> LL/token: -8,75353" 意思是?这似乎是一个指标吉布斯采样。但它不是单调递增吗?

之后,打印以下内容:

elizabeth-9 needham-9 died-7 3-9 1731-6 mother-6 needham-9 english-7 procuress-6 brothel-4 keeper-9 18th-8.......
0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4) 
1   0.008   sunderland (6) years (6) echo (5) survived (3) paper (3) 
2   0.040   test (6) cricket (5) hill (4) park (3) career (3) 
3   0.008   average (6) equipartition (6) system (5) theorem (5) law (4) 
4   0.073   hawes (7) kentucky (6) army (5) gen (4) war (4) 
5   0.008   yard (6) national (6) thylacine (5) wilderness (4) tasmanian (4) 
6   0.202   gunnhild (5) norway (4) life (4) including (3) king (3) 
7   0.202   zinta (4) role (3) hindi (3) actress (3) film (3) 
8   0.040   rings (10) ring (3) dust (3) 2 (3) uranus (3) 
9   0.411   london (4) sullivan (3) gilbert (3) thespis (3) back (3) 
0   0.55

这部分的第一行可能是令牌主题分配,对吗?

问题3: 对于第一个主题,

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)   

0.008被称为“主题分布”,是这个主题在整个语料库中的分布吗?然后好像有冲突: 如上所示的主题 0 将在 copus 中出现 8+7+6+4+4+... 次;相比之下,主题 7 在语料库中的识别次数为 4+3+3+3+3... 次。结果,主题7的分布应该比主题0低。这是我无法理解的。 另外,最后那个“0 0.55”是什么?

非常感谢您阅读这篇长文。希望您能回答这个问题,并希望这对其他对 Mallet 感兴趣的人有所帮助。

最好的

As I'm trying out the examples code on topic modeling developer's guide, I really want to understand the meaning of the output of that code.

First during the running process, it gives out:

Coded LDA: 10 topics, 4 topic bits, 1111 topic mask
max tokens: 148
total tokens: 1333
<10> LL/token: -9,24097
<20> LL/token: -9,1026
<30> LL/token: -8,95386
<40> LL/token: -8,75353

0   0,5 battle union confederate tennessee american states 
1   0,5 hawes sunderland echo war paper commonwealth 
2   0,5 test including cricket australian hill career 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen confederates buell 
5   0,5 years yard national thylacine wilderness parks 
6   0,5 gunnhild norway life extinct gilbert thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings south ring dust 2 uranus 
9   0,5 tasmanian back time sullivan london century 

<50> LL/token: -8,59033
<60> LL/token: -8,63711
<70> LL/token: -8,56168
<80> LL/token: -8,57189
<90> LL/token: -8,46669

0   0,5 battle union confederate tennessee united numerous 
1   0,5 hawes sunderland echo paper commonwealth early 
2   0,5 test cricket south australian hill england 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen war time 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 including gunnhild norway life time thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus survived 
9   0,5 back london modern sullivan gilbert needham 

<100> LL/token: -8,49005
<110> LL/token: -8,57995
<120> LL/token: -8,55601
<130> LL/token: -8,50673
<140> LL/token: -8,46388

0   0,5 battle union confederate tennessee war united 
1   0,5 sunderland echo paper edward england world 
2   0,5 test cricket south australian hill record 
3   0,5 average equipartition theorem energy system kinetic 
4   0,5 hawes kentucky army gen grant confederates 
5   0,5 years yard national thylacine wilderness tasmanian 
6   0,5 gunnhild norway including king life devil 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 london sullivan gilbert thespis back mother 

<150> LL/token: -8,51129
<160> LL/token: -8,50269
<170> LL/token: -8,44308
<180> LL/token: -8,47441
<190> LL/token: -8,62186

0   0,5 battle union confederate grant tennessee numerous 
1   0,5 sunderland echo survived paper edward england 
2   0,5 test cricket south australian hill park 
3   0,5 average equipartition theorem energy system law 
4   0,5 hawes kentucky army gen time confederates 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 gunnhild including norway life king time 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 back london sullivan gilbert thespis 3 

<200> LL/token: -8,54771

Total time: 6 seconds

so Question1: what does "Coded LDA: 10 topics, 4 topic bits, 1111 topic mask" mean in the first line? I only know what "10 topics" is about.

Question2: what does LL/Token in " <10> LL/token: -9,24097 <20> LL/token: -9,1026 <30> LL/token: -8,95386 <40> LL/token: -8,75353" mean? it seems like a metric to Gibss sampling. But isn't it monotonically increasing?

And after that, the following is printed:

elizabeth-9 needham-9 died-7 3-9 1731-6 mother-6 needham-9 english-7 procuress-6 brothel-4 keeper-9 18th-8.......
0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4) 
1   0.008   sunderland (6) years (6) echo (5) survived (3) paper (3) 
2   0.040   test (6) cricket (5) hill (4) park (3) career (3) 
3   0.008   average (6) equipartition (6) system (5) theorem (5) law (4) 
4   0.073   hawes (7) kentucky (6) army (5) gen (4) war (4) 
5   0.008   yard (6) national (6) thylacine (5) wilderness (4) tasmanian (4) 
6   0.202   gunnhild (5) norway (4) life (4) including (3) king (3) 
7   0.202   zinta (4) role (3) hindi (3) actress (3) film (3) 
8   0.040   rings (10) ring (3) dust (3) 2 (3) uranus (3) 
9   0.411   london (4) sullivan (3) gilbert (3) thespis (3) back (3) 
0   0.55

The first line in this part is probably the token-topic assignment, right?

Question3:
for the first topic,

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)   

0.008 is said to be the "topic distribution", is it the distribution of this topic in whole corpus? Then there seems to be a conflict:
topic 0 as shown above will have its token appeared in the copus 8+7+6+4+4+... times; and in comparison topic 7 have 4+3+3+3+3... times recognized in the corpus. As a result, topic 7 should have lower distribution than topic 0. This is what I cann't understand.
Further more, what ist that "0 0.55" at the end?

Thank you very much for reading this long post. Hope you can answer it and hope this could be helpful for others interested in Mallet.

best

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

装纯掩盖桑 2024-12-27 13:15:26

我认为我知道的信息不足以给出一个非常完整的答案,但这里有一些内容...对于第一季度,您可以检查一些 代码 查看这些值是如何计算的。对于 Q2,LL 是模型的对数似然除以标记总数,这是对数据被赋予模型的可能性的度量。增加值意味着模型正在改进。这些也可以在用于主题建模的 R 包中找到。 Q2,是的,我认为第一行是正确的。 Q3,好问题,我并不清楚,也许 (x) 是某种索引,令牌频率似乎不太可能......大概其中大多数都是某种诊断。

可以使用 bin\mallet run cc.mallet.topics.tui.TopicTrainer ...您的各种选项... --diagnostics-file Diagnostics.xml 获得一组更有用的诊断信息,它将产生大量的主题质量衡量标准。它们绝对值得一看。

有关所有这些的完整故事,我建议我在普林斯顿写一封电子邮件给戴维·米姆诺(David Mimno),谁是马利特的(主要?)维护者,或者通过"="">http://blog.gmane.org/gmane.comp.ai.mallet.devel 然后将答案发布回此处,供我们这些好奇的人使用的内部运作木槌...

I don't think I know enough to give a very complete answer, but here's a shot at some of it... for Q1 you can inspect some code to see how those values are calculated. For Q2, LL is the model's log-liklihood divided by the total number of tokens, this is a measure of how likely the data are given the model. Increasing values mean the model is improving. These are also available in the R packages for topic modeling. Q2, yes I think that's right for the first line. Q3, good question, it's not immediately clear to me, perhaps the (x) are some kind of index, token frequency seems unlikely... Presumably most of these are diagnostics of some kind.

A more useful set of diagnostics can be obtained with bin\mallet run cc.mallet.topics.tui.TopicTrainer ...your various options... --diagnostics-file diagnostics.xml which will produce a large number of measures of topic quality. They're definitely worth checking out.

For the full story about all of this I'd suggest writing an email to David Mimno at Princeton who is the (main?) maintainer of MALLET, or writing to him via the list at http://blog.gmane.org/gmane.comp.ai.mallet.devel and then posting answers back here for those of us curious about the inner workings of MALLET...

薄荷港 2024-12-27 13:15:26

我的理解是:

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)   
  • 0是主题编号。
  • 0.008 是该主题的权重
  • Battle (8) union (7) [...] 是该主题中的热门关键词。数字是该单词在主题中出现的次数。

然后,您还将获得一个 .csv 文件。我认为它包含了该过程中最重要的数据。您会发现每行的值如下所示:

0   0   285 10   page make items thing work put dec browsers recipes expressions 

即:

  1. 树级别
  2. 主题 ID
  3. 总字数
  4. 文档总数
  5. 前 10 个单词

有点晚了,但我希望它对某人有所帮助

what I understand is that:

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)   
  • 0 is the topic number.
  • 0.008 is the weight of such topic
  • battle (8) union (7) [...] are the top-keywords in such topic. The numbers are the occurrences of the word in the topic.

Then, as result, you also obtain a .csv file. I think it contains the most important data of the process. You will find values like the following for each row:

0   0   285 10   page make items thing work put dec browsers recipes expressions 

That is:

  1. Tree level
  2. Topic ID
  3. Total words
  4. Total documents
  5. Top-10 words

A little bit late, but I hope it helps someone

花桑 2024-12-27 13:15:26

对于问题 3,我相信 0.008(“主题分布”)与文档主题分布上的先验 α 相关。 Mallet 对此进行了优化,本质上允许某些主题承载更多“权重”。 Mallet 似乎估计主题 0 只占你的语料库的一小部分。

令牌计数仅代表计数最高的单词。例如,主题 0 的剩余计数可能为 0,而主题 9 的剩余计数可能为 3。因此,主题 9 在语料库中所占的单词数量比主题 0 多得多,即使顶部单词的计数为降低。

我必须在最后检查“0 0.55”的代码,但这可能是优化的 beta 值(我很确定这不是不对称完成的)。

For question 3, I believe the 0.008 (the "topic distribution") relates to the prior \alpha over topic distributions for documents. Mallet optimises this prior, essentially allowing some topics to carry more "weight". Mallet seems to be estimating that topic 0 accounts for a small proportion of your corpus.

The token counts represent only the words with highest counts. The remaining counts for topic 0 could, for example, be 0, and the remaining counts for topic 9 could be 3. Thus topic 9 can account for many more words in your corpus than topic 0, even though the counts for the top words are lower.

I'd have to check out the code for the "0 0.55" at the end, but that's probably the optimised \beta value (which I'm pretty sure isn't done asymetrically).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文