当前位置：文江博客话题详情

machine-learning mallet topic-modeling

如何理解Mallet中主题模型类的输出？

发布于 12-20 13:15 字数 4543 浏览 3 评论 0 原文

当我尝试主题建模开发人员指南上的示例代码时，我真的很想理解其含义该代码的输出。

首先在运行过程中，它给出：

Coded LDA: 10 topics, 4 topic bits, 1111 topic mask
max tokens: 148
total tokens: 1333
<10> LL/token: -9,24097
<20> LL/token: -9,1026
<30> LL/token: -8,95386
<40> LL/token: -8,75353

0   0,5 battle union confederate tennessee american states 
1   0,5 hawes sunderland echo war paper commonwealth 
2   0,5 test including cricket australian hill career 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen confederates buell 
5   0,5 years yard national thylacine wilderness parks 
6   0,5 gunnhild norway life extinct gilbert thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings south ring dust 2 uranus 
9   0,5 tasmanian back time sullivan london century 

<50> LL/token: -8,59033
<60> LL/token: -8,63711
<70> LL/token: -8,56168
<80> LL/token: -8,57189
<90> LL/token: -8,46669

0   0,5 battle union confederate tennessee united numerous 
1   0,5 hawes sunderland echo paper commonwealth early 
2   0,5 test cricket south australian hill england 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen war time 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 including gunnhild norway life time thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus survived 
9   0,5 back london modern sullivan gilbert needham 

<100> LL/token: -8,49005
<110> LL/token: -8,57995
<120> LL/token: -8,55601
<130> LL/token: -8,50673
<140> LL/token: -8,46388

0   0,5 battle union confederate tennessee war united 
1   0,5 sunderland echo paper edward england world 
2   0,5 test cricket south australian hill record 
3   0,5 average equipartition theorem energy system kinetic 
4   0,5 hawes kentucky army gen grant confederates 
5   0,5 years yard national thylacine wilderness tasmanian 
6   0,5 gunnhild norway including king life devil 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 london sullivan gilbert thespis back mother 

<150> LL/token: -8,51129
<160> LL/token: -8,50269
<170> LL/token: -8,44308
<180> LL/token: -8,47441
<190> LL/token: -8,62186

0   0,5 battle union confederate grant tennessee numerous 
1   0,5 sunderland echo survived paper edward england 
2   0,5 test cricket south australian hill park 
3   0,5 average equipartition theorem energy system law 
4   0,5 hawes kentucky army gen time confederates 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 gunnhild including norway life king time 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 back london sullivan gilbert thespis 3 

<200> LL/token: -8,54771

Total time: 6 seconds

那么问题1：第一行中的“Coded LDA: 10 topics, 4 topic bits, 1111 topic mask”是什么意思？我只知道“10个主题”是关于什么的。

问题2：“ <10> LL/token: -9,24097 <20> LL/token: -9,1026 <30> LL/token: -8 中的LL/Token是什么意思,95386 <40> LL/token: -8,75353" 意思是？这似乎是一个指标吉布斯采样。但它不是单调递增吗？

之后，打印以下内容：

elizabeth-9 needham-9 died-7 3-9 1731-6 mother-6 needham-9 english-7 procuress-6 brothel-4 keeper-9 18th-8.......
0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4) 
1   0.008   sunderland (6) years (6) echo (5) survived (3) paper (3) 
2   0.040   test (6) cricket (5) hill (4) park (3) career (3) 
3   0.008   average (6) equipartition (6) system (5) theorem (5) law (4) 
4   0.073   hawes (7) kentucky (6) army (5) gen (4) war (4) 
5   0.008   yard (6) national (6) thylacine (5) wilderness (4) tasmanian (4) 
6   0.202   gunnhild (5) norway (4) life (4) including (3) king (3) 
7   0.202   zinta (4) role (3) hindi (3) actress (3) film (3) 
8   0.040   rings (10) ring (3) dust (3) 2 (3) uranus (3) 
9   0.411   london (4) sullivan (3) gilbert (3) thespis (3) back (3) 
0   0.55

这部分的第一行可能是令牌主题分配，对吗？

问题3：对于第一个主题，

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)

0.008被称为“主题分布”，是这个主题在整个语料库中的分布吗？然后好像有冲突：如上所示的主题 0 将在 copus 中出现 8+7+6+4+4+... 次；相比之下，主题 7 在语料库中的识别次数为 4+3+3+3+3... 次。结果，主题7的分布应该比主题0低。这是我无法理解的。另外，最后那个“0 0.55”是什么？

非常感谢您阅读这篇长文。希望您能回答这个问题，并希望这对其他对 Mallet 感兴趣的人有所帮助。

最好的

原文

As I'm trying out the examples code on topic modeling developer's guide, I really want to understand the meaning of the output of that code.

First during the running process, it gives out:

Coded LDA: 10 topics, 4 topic bits, 1111 topic mask
max tokens: 148
total tokens: 1333
<10> LL/token: -9,24097
<20> LL/token: -9,1026
<30> LL/token: -8,95386
<40> LL/token: -8,75353

0   0,5 battle union confederate tennessee american states 
1   0,5 hawes sunderland echo war paper commonwealth 
2   0,5 test including cricket australian hill career 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen confederates buell 
5   0,5 years yard national thylacine wilderness parks 
6   0,5 gunnhild norway life extinct gilbert thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings south ring dust 2 uranus 
9   0,5 tasmanian back time sullivan london century 

<50> LL/token: -8,59033
<60> LL/token: -8,63711
<70> LL/token: -8,56168
<80> LL/token: -8,57189
<90> LL/token: -8,46669

0   0,5 battle union confederate tennessee united numerous 
1   0,5 hawes sunderland echo paper commonwealth early 
2   0,5 test cricket south australian hill england 
3   0,5 average equipartition theorem law energy system 
4   0,5 kentucky army grant gen war time 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 including gunnhild norway life time thespis 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus survived 
9   0,5 back london modern sullivan gilbert needham 

<100> LL/token: -8,49005
<110> LL/token: -8,57995
<120> LL/token: -8,55601
<130> LL/token: -8,50673
<140> LL/token: -8,46388

0   0,5 battle union confederate tennessee war united 
1   0,5 sunderland echo paper edward england world 
2   0,5 test cricket south australian hill record 
3   0,5 average equipartition theorem energy system kinetic 
4   0,5 hawes kentucky army gen grant confederates 
5   0,5 years yard national thylacine wilderness tasmanian 
6   0,5 gunnhild norway including king life devil 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 london sullivan gilbert thespis back mother 

<150> LL/token: -8,51129
<160> LL/token: -8,50269
<170> LL/token: -8,44308
<180> LL/token: -8,47441
<190> LL/token: -8,62186

0   0,5 battle union confederate grant tennessee numerous 
1   0,5 sunderland echo survived paper edward england 
2   0,5 test cricket south australian hill park 
3   0,5 average equipartition theorem energy system law 
4   0,5 hawes kentucky army gen time confederates 
5   0,5 yard national thylacine years wilderness tasmanian 
6   0,5 gunnhild including norway life king time 
7   0,5 zinta role hindi actress film indian 
8   0,5 rings ring dust 2 uranus number 
9   0,5 back london sullivan gilbert thespis 3 

<200> LL/token: -8,54771

Total time: 6 seconds

so Question1: what does "Coded LDA: 10 topics, 4 topic bits, 1111 topic mask" mean in the first line? I only know what "10 topics" is about.

Question2: what does LL/Token in " <10> LL/token: -9,24097 <20> LL/token: -9,1026 <30> LL/token: -8,95386 <40> LL/token: -8,75353" mean？ it seems like a metric to Gibss sampling. But isn't it monotonically increasing?

And after that, the following is printed:

elizabeth-9 needham-9 died-7 3-9 1731-6 mother-6 needham-9 english-7 procuress-6 brothel-4 keeper-9 18th-8.......
0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4) 
1   0.008   sunderland (6) years (6) echo (5) survived (3) paper (3) 
2   0.040   test (6) cricket (5) hill (4) park (3) career (3) 
3   0.008   average (6) equipartition (6) system (5) theorem (5) law (4) 
4   0.073   hawes (7) kentucky (6) army (5) gen (4) war (4) 
5   0.008   yard (6) national (6) thylacine (5) wilderness (4) tasmanian (4) 
6   0.202   gunnhild (5) norway (4) life (4) including (3) king (3) 
7   0.202   zinta (4) role (3) hindi (3) actress (3) film (3) 
8   0.040   rings (10) ring (3) dust (3) 2 (3) uranus (3) 
9   0.411   london (4) sullivan (3) gilbert (3) thespis (3) back (3) 
0   0.55

The first line in this part is probably the token-topic assignment, right?

Question3:
for the first topic,

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)

0.008 is said to be the "topic distribution", is it the distribution of this topic in whole corpus? Then there seems to be a conflict:
topic 0 as shown above will have its token appeared in the copus 8+7+6+4+4+... times; and in comparison topic 7 have 4+3+3+3+3... times recognized in the corpus. As a result, topic 7 should have lower distribution than topic 0. This is what I cann't understand.
Further more, what ist that "0 0.55" at the end?

Thank you very much for reading this long post. Hope you can answer it and hope this could be helpful for others interested in Mallet.

best

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

装纯掩盖桑 2024-12-27 13:15:26

我认为我知道的信息不足以给出一个非常完整的答案，但这里有一些内容...对于第一季度，您可以检查一些代码查看这些值是如何计算的。对于 Q2，LL 是模型的对数似然除以标记总数，这是对数据被赋予模型的可能性的度量。增加值意味着模型正在改进。这些也可以在用于主题建模的 R 包中找到。 Q2，是的，我认为第一行是正确的。 Q3，好问题，我并不清楚，也许 (x) 是某种索引，令牌频率似乎不太可能......大概其中大多数都是某种诊断。

可以使用 bin\mallet run cc.mallet.topics.tui.TopicTrainer ...您的各种选项... --diagnostics-file Diagnostics.xml 获得一组更有用的诊断信息，它将产生大量的主题质量衡量标准。它们绝对值得一看。

有关所有这些的完整故事，我建议我在普林斯顿写一封电子邮件给戴维·米姆诺（David Mimno），谁是马利特的（主要？）维护者，或者通过"="">http://blog.gmane.org/gmane.comp.ai.mallet.devel 然后将答案发布回此处，供我们这些好奇的人使用的内部运作木槌...

回复收藏 0 原文

薄荷港 2024-12-27 13:15:26

我的理解是：

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)

0是主题编号。
0.008 是该主题的权重
Battle (8) union (7) [...] 是该主题中的热门关键词。数字是该单词在主题中出现的次数。

然后，您还将获得一个 .csv 文件。我认为它包含了该过程中最重要的数据。您会发现每行的值如下所示：

0   0   285 10   page make items thing work put dec browsers recipes expressions

即：

树级别
主题 ID
总字数
文档总数
前 10 个单词

有点晚了，但我希望它对某人有所帮助

what I understand is that:

0   0.008   battle (8) union (7) confederate (6) grant (4) tennessee (4)

0 is the topic number.
0.008 is the weight of such topic
battle (8) union (7) [...] are the top-keywords in such topic. The numbers are the occurrences of the word in the topic.

Then, as result, you also obtain a .csv file. I think it contains the most important data of the process. You will find values like the following for each row:

0   0   285 10   page make items thing work put dec browsers recipes expressions

That is:

Tree level
Topic ID
Total words
Total documents
Top-10 words

A little bit late, but I hope it helps someone

回复收藏 0 原文

花桑 2024-12-27 13:15:26

对于问题 3，我相信 0.008（“主题分布”）与文档主题分布上的先验 α 相关。 Mallet 对此进行了优化，本质上允许某些主题承载更多“权重”。 Mallet 似乎估计主题 0 只占你的语料库的一小部分。

令牌计数仅代表计数最高的单词。例如，主题 0 的剩余计数可能为 0，而主题 9 的剩余计数可能为 3。因此，主题 9 在语料库中所占的单词数量比主题 0 多得多，即使顶部单词的计数为降低。

我必须在最后检查“0 0.55”的代码，但这可能是优化的 beta 值（我很确定这不是不对称完成的）。

回复收藏 0 原文

~没有更多了~

关于作者

日记撕了你也走了

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

如何理解Mallet中主题模型类的输出？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

墨落画卷

干掉丘比特

skjfmsvd

孤云独去闲

yizheng

谦

友情链接

如何理解Mallet中主题模型类的输出？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

墨落画卷

干掉丘比特

skjfmsvd

孤云独去闲

yizheng

谦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。