将Bigram频率与Bigram令牌匹配多个列
我有两个数据框,一个是带有bigram频率的频率列表:
F_bigrams <- structure(list(word_tag = c("it_PNP 's_VBZ", "do_VDB n't_XX0",
"that_DT0 's_VBZ", "you_PNP know_VVB", "i_PNP 'm_VBB", "i_PNP do_VDB",
"in_PRP the_AT0", "i_PNP 've_VHB", "'ve_VHB got_VVN", "i_PNP mean_VVB"
), Freq_bigr = c(31831L, 26273L, 21691L, 14157L, 14010L, 12904L,
10994L, 10543L, 10089L, 9856L)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
另一个包含bigram令牌:
df <- data.frame(
bigr_1_2 = c("i_PNP 'm_VBB", NA, NA, NA),
bigr_2_3 = c("it_PNP 's_VBZ", "'ve_VHB got_VVN", NA, NA),
bigr_3_4 = c("you_PNP know_VVB", "it_PNP 's_VBZ", "'ve_VHB got_VVN", NA)
)
我想从频率列表中的frquencies f_bigrams
与df 。我可以在
df
中没有问题,这是实际数据的微小片段,使用此base r
方法:
df[, paste0("f_bigr_", 1:3, "_", 2:4)] <- sapply(df[, 1:3], function(x) F_bigrams$Freq_bigr[match(x, F_bigrams$word_tag)])
但是,在实际数据中,它具有更多列和半百万行,我始终获得数字2
应该有na
。这是为什么?而且,更重要的是,是否有一种的替代方法将频率与各自的Bigram代币匹配?
I have two dataframes, one is a frequency list with bigram frequencies:
F_bigrams <- structure(list(word_tag = c("it_PNP 's_VBZ", "do_VDB n't_XX0",
"that_DT0 's_VBZ", "you_PNP know_VVB", "i_PNP 'm_VBB", "i_PNP do_VDB",
"in_PRP the_AT0", "i_PNP 've_VHB", "'ve_VHB got_VVN", "i_PNP mean_VVB"
), Freq_bigr = c(31831L, 26273L, 21691L, 14157L, 14010L, 12904L,
10994L, 10543L, 10089L, 9856L)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
The other contains bigram tokens:
df <- data.frame(
bigr_1_2 = c("i_PNP 'm_VBB", NA, NA, NA),
bigr_2_3 = c("it_PNP 's_VBZ", "'ve_VHB got_VVN", NA, NA),
bigr_3_4 = c("you_PNP know_VVB", "it_PNP 's_VBZ", "'ve_VHB got_VVN", NA)
)
I want to match the frquencies from the frequency list F_bigrams
to each bigram token in df
. This I can do without problems in df
, which is a tiny snippet of the actual data, with this base R
method:
df[, paste0("f_bigr_", 1:3, "_", 2:4)] <- sapply(df[, 1:3], function(x) F_bigrams$Freq_bigr[match(x, F_bigrams$word_tag)])
However, in the actual data, which has far more columns and half a million rows, I consistently get the number 2
where there should be NA
. Why is that? And, more importantly, is there an alternative way to match the frequencies to their respective bigram tokens?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)