复合词的象征化不起作用

发布于 2025-01-25 17:02:15 字数 2333 浏览 3 评论 0原文

我正在尝试使用KWIC（）函数创建一个包含特定关键字的数据帧，但是不幸的是，在尝试将基础数据集进行令牌化时，我遇到了一些错误。

这是我使用的数据集的子集作为可再现的示例：

test_cluster <- speeches_subset %>%
  filter(grepl('Schwester Agnes',
                speechContent,
                ignore.case = TRUE))

test_corpus <- corpus(test_cluster,
                      docid_field = "id",
                      text_field = "speechContent")

此处，test_cluster包含六个观察值，即12个变量，即六行secedscontent column 包含复合词“ Schwester Agnes”。 test_corpus将基础数据转换为Quanteda语料库对象。

然后，当我运行以下代码时，我希望首先将SecedContent变量的内容被标记化，并且由于tokens_compound，复合词“ Schwester Agnes”这样被象征性。在第二步中，我希望KWIC（）函数返回由六行组成的数据框架，其中包含关键字变量，包括复合单词“ Schwester Agnes”。但是，KWIC（）返回一个空框架，其中包含7个变量的0个观测值。我认为这是因为我使用tokens_compound（）犯了一些错误，但是我不确定...任何帮助将不胜感激！

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_compound(pattern = phrase("Schwester Agnes"))

test_kwic <- kwic(test_tokens,
                  pattern = "Schwester Agnes",
                  window = 5)

编辑：我意识到上面的示例不容易再现，因此请参阅以下reprex：

speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")

data <- data.frame(id=1:3, 
                   speechContent = speech)

test_corpus <- corpus(data,
                      docid_field = "id",
                      text_field = "speechContent")

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_compound(pattern = c("stack", "overflow"))

test_kwic <- kwic(test_tokens,
                  pattern = "stack overflow",
                  window = 5)

原文

I'm trying to create a dataframe containing specific keywords-in-context using the kwic() function, but unfortunately, I'm running into some error when attempting to tokenize the underlying dataset.

This is the subset of the dataset I'm using as a reproducible example:

test_cluster <- speeches_subset %>%
  filter(grepl('Schwester Agnes',
                speechContent,
                ignore.case = TRUE))

test_corpus <- corpus(test_cluster,
                      docid_field = "id",
                      text_field = "speechContent")

Here, test_cluster contains six observations of 12 variables, that is, six rows in which the column speechContent contains the compound word "Schwester Agnes". test_corpus transforms the underlying data into a quanteda corpus object.

When I then run the following code, I would expect, first, the content of the speechContent variables to be tokenized, and due to tokens_compound, the compound word "Schwester Agnes" to be tokenized as such. In a second step, I would expect the kwic() function to return a dataframe consisting of six rows, with the keyword variable including the compound word "Schwester Agnes". Instead, however, kwic() returns an empty dataframe containing 0 observations of 7 variables. I think this is because of some mistake I'm making with tokens_compound(), but I'm not sure... Any help would be greatly appreciated!

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_compound(pattern = phrase("Schwester Agnes"))

test_kwic <- kwic(test_tokens,
                  pattern = "Schwester Agnes",
                  window = 5)

EDIT: I realize that the examples above are not easily reproducible, so please refer to the reprex below:

speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")

data <- data.frame(id=1:3, 
                   speechContent = speech)

test_corpus <- corpus(data,
                      docid_field = "id",
                      text_field = "speechContent")

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_compound(pattern = c("stack", "overflow"))

test_kwic <- kwic(test_tokens,
                  pattern = "stack overflow",
                  window = 5)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

两人的回忆 2025-02-01 17:02:15

您需要应用短语（“ stack Overflow”）和set concateNator =“”在tokens_compound（）中。

require(quanteda)
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1

speech <- c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", 
           "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", 
           "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")

data <- data.frame(id = 1:3, 
                   speechContent = speech)

test_corpus <- corpus(data,
                      docid_field = "id",
                      text_field = "speechContent")

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_compound(pattern = phrase("stack overflow"), concatenator = " ")

test_kwic <- kwic(test_tokens,
                  pattern = "stack overflow",
                  window = 5)
test_kwic
#> Keyword-in-context with 2 matches.                                                                             
#>  [1, 29] for example is the word | stack overflow | However there are so many
#>  [2, 24]     but at the very end | stack overflow |

^由

You need to apply phrase("stack overflow") and set concatenator = " " in tokens_compound().

require(quanteda)
#> Package version: 3.2.1
#> Unicode version: 13.0
#> ICU version: 69.1

speech <- c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", 
           "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", 
           "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")

data <- data.frame(id = 1:3, 
                   speechContent = speech)

test_corpus <- corpus(data,
                      docid_field = "id",
                      text_field = "speechContent")

test_tokens <- tokens(test_corpus, 
                      remove_punct = TRUE,
                      remove_numbers = TRUE) %>%
  tokens_compound(pattern = phrase("stack overflow"), concatenator = " ")

test_kwic <- kwic(test_tokens,
                  pattern = "stack overflow",
                  window = 5)
test_kwic
#> Keyword-in-context with 2 matches.                                                                             
#>  [1, 29] for example is the word | stack overflow | However there are so many
#>  [2, 24]     but at the very end | stack overflow |

^{Created on 2022-05-06 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~