Quanteda DFM_Weight（）导致相对频率＆gt; 1

发布于 2025-01-20 08:00:45 字数 919 浏览 1 评论 0原文

我正在使用Quanteda，并试图计算由日期和政党组织的语料库中特定单词的相对频率。但是，将语料库转换为dfm并使用dfm_weight（dfmat，scheme =“ prop”），然后是textStat_frequency，我得到的分数大于1。

这是我的代码（我也可以干净和清洁我的令牌，而不是在这里，而不是在这里代码）：

corp <- corpus(title_df, text_field = "text", meta = list(title_df[,-4]))
toks <- tokens(corp)
dfmat <- dfm(toks, verbose=TRUE)
dfm_rel_freq <- dfm_weight(dfmat, scheme = "prop") 
rel_freq_all <- quanteda.textstats::textstat_frequency(dfm_rel_freq, groups = year)  
# arrange by max frequency:
rel_freq_all %>% arrange(frequency) %>% tail()

 
 
feature
<chr>
frequency
<dbl>
rank
<dbl>
docfreq
<dbl>
group
<chr>
81093   pension 5.802529    1   117 2004
40971   pension 6.117154    1   97  1998
148372  peopl   6.430454    1   220 2014
65747   pension 6.721089    1   138 2002
53303   pension 7.871011    1   153 2000
74391   pension 8.153381    1   156 2003
6 rows

原文

I'm using Quanteda and trying to compute the relative frequencies of specific words in a corpus which is organized by date and party. However, after converting the corpus to a dfm and using dfm_weight(dfmat, scheme = "prop") followed by textstat_frequency, I get scores of bigger than 1.

Here is my code (I also stem and clean my tokens, not here in the code):

corp <- corpus(title_df, text_field = "text", meta = list(title_df[,-4]))
toks <- tokens(corp)
dfmat <- dfm(toks, verbose=TRUE)
dfm_rel_freq <- dfm_weight(dfmat, scheme = "prop") 
rel_freq_all <- quanteda.textstats::textstat_frequency(dfm_rel_freq, groups = year)  
# arrange by max frequency:
rel_freq_all %>% arrange(frequency) %>% tail()

 
 
feature
<chr>
frequency
<dbl>
rank
<dbl>
docfreq
<dbl>
group
<chr>
81093   pension 5.802529    1   117 2004
40971   pension 6.117154    1   97  1998
148372  peopl   6.430454    1   220 2014
65747   pension 6.721089    1   138 2002
53303   pension 7.871011    1   153 2000
74391   pension 8.153381    1   156 2003
6 rows

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情话墙 2025-01-27 08:00:45

这是预期的行为：Quanteda.textStats :: TextStat_frequency（X，组=年）将在一年组中总结DFM。因此，您可以从DFM中获得的比例，这些比例可以超过1.0。

例如，如果您想在组上进行其他操作，例如，您不应使用groups参数，然后使用一些dplyr操作，例如

library(dplyr)
quanteda.textstats::textstat_frequency(dfm_rel_freq) %>%
    group_by(year) %>%
    summarize(mean_rel_freq = mean(frequency))

This is the expected behaviour: quanteda.textstats::textstat_frequency(x, groups = year) will sum the dfm within the year groups. So your proportions from the dfm are being summed, and these can exceed 1.0.

If you wanted a different operation on the groups, for instance mean, then you should not use a groups argument, and then use some dplyr operations such as

library(dplyr)
quanteda.textstats::textstat_frequency(dfm_rel_freq) %>%
    group_by(year) %>%
    summarize(mean_rel_freq = mean(frequency))

回复收藏 0 原文

~没有更多了~

关于作者

末骤雨初歇

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

Quanteda DFM_Weight（）导致相对频率＆gt; 1

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

╰ゝ天使的微笑

少女净妖师

朱洁

觉浅

滥情空心

hl1314520

友情链接

Quanteda DFM_Weight（）导致相对频率＆gt; 1

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

╰ゝ天使的微笑

少女净妖师

朱洁

觉浅

滥情空心

hl1314520

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。