将Bigram频率与Bigram令牌匹配多个列

发布于 2025-01-24 05:54:53 字数 1300 浏览 3 评论 0原文

我有两个数据框,一个是带有bigram频率的频率列表:

F_bigrams <- structure(list(word_tag = c("it_PNP 's_VBZ", "do_VDB n't_XX0", 
                                         "that_DT0 's_VBZ", "you_PNP know_VVB", "i_PNP 'm_VBB", "i_PNP do_VDB", 
                                         "in_PRP the_AT0", "i_PNP 've_VHB", "'ve_VHB got_VVN", "i_PNP mean_VVB"
), Freq_bigr = c(31831L, 26273L, 21691L, 14157L, 14010L, 12904L, 
                 10994L, 10543L, 10089L, 9856L)), row.names = c(NA, -10L), class = c("tbl_df",                                                                                   "tbl", "data.frame"))

另一个包含bigram令牌:

df <- data.frame(
  bigr_1_2 = c("i_PNP 'm_VBB", NA, NA, NA),
  bigr_2_3 = c("it_PNP 's_VBZ", "'ve_VHB got_VVN", NA, NA),
  bigr_3_4 = c("you_PNP know_VVB", "it_PNP 's_VBZ", "'ve_VHB got_VVN", NA)
)

我想从频率列表中的frquencies f_bigramsdf 。我可以在df中没有问题,这是实际数据的微小片段,使用此base r方法:

df[, paste0("f_bigr_", 1:3, "_", 2:4)] <- sapply(df[, 1:3], function(x) F_bigrams$Freq_bigr[match(x, F_bigrams$word_tag)])

但是,在实际数据中,它具有更多列和半百万行,我始终获得数字2应该有na。这是为什么?而且,更重要的是,是否有一种的替代方法将频率与各自的Bigram代币匹配?

I have two dataframes, one is a frequency list with bigram frequencies:

F_bigrams <- structure(list(word_tag = c("it_PNP 's_VBZ", "do_VDB n't_XX0", 
                                         "that_DT0 's_VBZ", "you_PNP know_VVB", "i_PNP 'm_VBB", "i_PNP do_VDB", 
                                         "in_PRP the_AT0", "i_PNP 've_VHB", "'ve_VHB got_VVN", "i_PNP mean_VVB"
), Freq_bigr = c(31831L, 26273L, 21691L, 14157L, 14010L, 12904L, 
                 10994L, 10543L, 10089L, 9856L)), row.names = c(NA, -10L), class = c("tbl_df",                                                                                   "tbl", "data.frame"))

The other contains bigram tokens:

df <- data.frame(
  bigr_1_2 = c("i_PNP 'm_VBB", NA, NA, NA),
  bigr_2_3 = c("it_PNP 's_VBZ", "'ve_VHB got_VVN", NA, NA),
  bigr_3_4 = c("you_PNP know_VVB", "it_PNP 's_VBZ", "'ve_VHB got_VVN", NA)
)

I want to match the frquencies from the frequency list F_bigrams to each bigram token in df. This I can do without problems in df, which is a tiny snippet of the actual data, with this base R method:

df[, paste0("f_bigr_", 1:3, "_", 2:4)] <- sapply(df[, 1:3], function(x) F_bigrams$Freq_bigr[match(x, F_bigrams$word_tag)])

However, in the actual data, which has far more columns and half a million rows, I consistently get the number 2 where there should be NA. Why is that? And, more importantly, is there an alternative way to match the frequencies to their respective bigram tokens?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

贱贱哒 2025-01-31 05:54:53
df %>%
  rowid_to_column() %>%
  pivot_longer(-rowid, values_to = 'word_tag', values_drop_na = TRUE) %>%
  left_join(F_bigrams) %>%
  pivot_wider(rowid, values_from = c(word_tag, Freq_bigr))

  rowid word_tag_bigr_1_2 word_tag_bigr_2_3 word_tag_bigr_3_4 Freq_bigr_bigr_1_2 Freq_bigr_bigr_2_3 Freq_bigr_bigr_3_4
  <int> <chr>             <chr>             <chr>                          <int>              <int>              <int>
1     1 i_PNP 'm_VBB      it_PNP 's_VBZ     you_PNP know_VVB               14010              31831              14157
2     2 NA                've_VHB got_VVN   it_PNP 's_VBZ                     NA              10089              31831
3     3 NA                NA                've_VHB got_VVN                   NA                 NA              10089
df %>%
  rowid_to_column() %>%
  pivot_longer(-rowid, values_to = 'word_tag', values_drop_na = TRUE) %>%
  left_join(F_bigrams) %>%
  pivot_wider(rowid, values_from = c(word_tag, Freq_bigr))

  rowid word_tag_bigr_1_2 word_tag_bigr_2_3 word_tag_bigr_3_4 Freq_bigr_bigr_1_2 Freq_bigr_bigr_2_3 Freq_bigr_bigr_3_4
  <int> <chr>             <chr>             <chr>                          <int>              <int>              <int>
1     1 i_PNP 'm_VBB      it_PNP 's_VBZ     you_PNP know_VVB               14010              31831              14157
2     2 NA                've_VHB got_VVN   it_PNP 's_VBZ                     NA              10089              31831
3     3 NA                NA                've_VHB got_VVN                   NA                 NA              10089
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文