Quanteda DFM_Weight()导致相对频率> 1
我正在使用Quanteda,并试图计算由日期和政党组织的语料库中特定单词的相对频率。但是,将语料库转换为dfm并使用dfm_weight(dfmat,scheme =“ prop”),然后是textStat_frequency,我得到的分数大于1。
这是我的代码(我也可以干净和清洁我的令牌,而不是在这里,而不是在这里代码):
corp <- corpus(title_df, text_field = "text", meta = list(title_df[,-4]))
toks <- tokens(corp)
dfmat <- dfm(toks, verbose=TRUE)
dfm_rel_freq <- dfm_weight(dfmat, scheme = "prop")
rel_freq_all <- quanteda.textstats::textstat_frequency(dfm_rel_freq, groups = year)
# arrange by max frequency:
rel_freq_all %>% arrange(frequency) %>% tail()
feature
<chr>
frequency
<dbl>
rank
<dbl>
docfreq
<dbl>
group
<chr>
81093 pension 5.802529 1 117 2004
40971 pension 6.117154 1 97 1998
148372 peopl 6.430454 1 220 2014
65747 pension 6.721089 1 138 2002
53303 pension 7.871011 1 153 2000
74391 pension 8.153381 1 156 2003
6 rows
I'm using Quanteda and trying to compute the relative frequencies of specific words in a corpus which is organized by date and party. However, after converting the corpus to a dfm and using dfm_weight(dfmat, scheme = "prop") followed by textstat_frequency, I get scores of bigger than 1.
Here is my code (I also stem and clean my tokens, not here in the code):
corp <- corpus(title_df, text_field = "text", meta = list(title_df[,-4]))
toks <- tokens(corp)
dfmat <- dfm(toks, verbose=TRUE)
dfm_rel_freq <- dfm_weight(dfmat, scheme = "prop")
rel_freq_all <- quanteda.textstats::textstat_frequency(dfm_rel_freq, groups = year)
# arrange by max frequency:
rel_freq_all %>% arrange(frequency) %>% tail()
feature
<chr>
frequency
<dbl>
rank
<dbl>
docfreq
<dbl>
group
<chr>
81093 pension 5.802529 1 117 2004
40971 pension 6.117154 1 97 1998
148372 peopl 6.430454 1 220 2014
65747 pension 6.721089 1 138 2002
53303 pension 7.871011 1 153 2000
74391 pension 8.153381 1 156 2003
6 rows
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是预期的行为:
Quanteda.textStats :: TextStat_frequency(X,组=年)
将在一年组中总结DFM。因此,您可以从DFM中获得的比例,这些比例可以超过1.0。例如,如果您想在组上进行其他操作,例如,您不应使用
groups
参数,然后使用一些dplyr操作,例如This is the expected behaviour:
quanteda.textstats::textstat_frequency(x, groups = year)
will sum the dfm within the year groups. So your proportions from the dfm are being summed, and these can exceed 1.0.If you wanted a different operation on the groups, for instance mean, then you should not use a
groups
argument, and then use some dplyr operations such as