从 LDA 主题建模中创建更相关的结果?

发布于 2025-01-10 11:38:59 字数 425 浏览 2 评论 0原文

我正在为我的学位做一个项目,我有一个来自另一所大学的实际客户。他们希望我通过对他给我的论文摘要的 sql 文件进行主题建模来完成所有这些工作。我对主题建模的经验为零,但我一直在 Jupyter 笔记本中使用 Gensim 和 Nlkt。

他现在想要的是我生成 10 个或更多主题,记录 LDA 结果中最常见的前 10 个单词,然后如果它们在每个主题中非常频繁,则将它们从生成的词云中删除,如果它们是更多变体,从不常见的主题中删除单词,并将它们保留在更相关的主题中。

他还想让我从其他年份的sql文件中比较每个主题的频率。而且,他希望这些主题有一个由计算机智能生成的名称。

我每年和总体上都有主题模型,但当然它们每年的表现方式并不完全相同。我最关心的是他在移除过程中想要的第一件事。这有可能吗?我需要帮助弄清楚在哪里查找,因为谷歌没有给我我想要的东西,因为我可能搜索错误。

谢谢你!

I am doing a project for my degree and I have an actual client from another college. They want me to do all this stuff with topic modeling to an sql file of paper abstracts he's given me. I have zero experience with topic modeling but I've been using Gensim and Nlkt in a Jupyter notebook for this.

What he want's right now is for me to generate 10 or more topics, record the top 10 most overall common words from the LDA's results, and then if they are very frequent in each topic, remove them from the resulting word cloud and if they are more variant, remove the words from just the topics where they are infrequent and keep them in the more relevant topics.

He also wants me to compare the frequency of each topic from the sql files of other years. And, he wants these topics to have a name generated smartly from the computer.

I have topic models per year and overall, but of course they do not appear exactly the same way in each year. My biggest concern is the first thing he wants with the removal process. Is any of this possible? I need help figuring out where to look as google is giving me not what I want as I am probably searching it wrong.

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

甜是你 2025-01-17 11:38:59

显示您使用的一些代码,以便我们为您提供更多有用的提示。还可以使用 nlp 标签,您使用的标签有点特定,并且没有很多人关注,因此相关用户可能很难找到您的问题。

整个词删除的事情你的意思是停止词吗?或者你已经删除了那些?停用词是非常常见的单词(“the”、“it”、“me”等),它们经常出现在最常见的单词列表中,但实际上对于查找主题没有任何意义。

首先,删除停用词以使最常用的单词列表更有用。

然后,按照他的要求,你看看哪些(更常见的)单词在所有主题中都很常见(我可以想象在摘要的情况下,这是诸如假设、研究、论文、结果等内容,所以这些内容是特定于摘要的,但对于确定不同摘要中的主题并删除这些主题没有用,我可以想象,对于这种分析以及初始 LDA,使用所有年份的所有数据来获得大量数据以供模型识别模式是有意义的。但您应该尝试各种变化,看看是否符合。 在获得每个主题的全局

单词列表后,您可以返回原始数据(按年份划分)来计算某个主题的组合单词每年出现的频率。多年来,您可能会看到一些趋势,例如过去几年/现在流行的一些主题,但如果您回溯得足够远,它们与

您提到的最后一件事(自动为主题分配标签)实际上是相当相关的 。很棘手,取决于你如何去做。

例如,“简单”的方法是仅使用每个主题中最常见的单词作为标签,但结果可能会令人印象深刻。

更高级的方法是主题标签。或者您可以尝试使用更强大的模型修改文本摘要等方法。

Show some of the code you use so we can give you more useful tips. Also use nlp tag, the tags you used are kind of specific and not followed by many people so your question might be hard to find for the relevant users.

By the whole word-removal thing do you mean stop words too? Or did you already remove those? Stop words are very common words ("the", "it", "me" etc.) which often appear high in most frequent word lists but do not really have any meaning for finding topics.

First you remove the stop words to make the most common words list more useful.

Then, as he requested, you look which (more common) words are common in ALL the topics (I can imagine in case of abstracts this is stuff like hypothesis, research, paper, results etc., so stuff that is abstract-specific but not useful for determining topics within different abstracts and remove those. I can imagine for this kind of analysis as well as the initial LDA it makes sense to use all the data from all years to have a large amount of data for the model to recognize patterns. But you should try around the variations and see if the per year or overall versions get you nicer results.

After you have your global word lists per topic you go back to the original data (split up by year) to count the frequencies of how often the combined words from a topic occur per year. If you view this over the years you probably can see trends like some topics that are popular in the last few years/now but if you go back far enough they werent relevant.

The last thing you mentioned (automatically assigning labels to topics) is actually something quite tricky, depending on how you go about it.

The "easy" way would be e.g. just use the most frequent word in each topic as label but the results will probably be underwhelming.

A more advanced approach is Topic Labeling. Or you can try an approach like modified text summarization using more powerful models.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文