复合词的象征化不起作用
我正在尝试使用KWIC()函数创建一个包含特定关键字的数据帧,但是不幸的是,在尝试将基础数据集进行令牌化时,我遇到了一些错误。
这是我使用的数据集的子集作为可再现的示例:
test_cluster <- speeches_subset %>%
filter(grepl('Schwester Agnes',
speechContent,
ignore.case = TRUE))
test_corpus <- corpus(test_cluster,
docid_field = "id",
text_field = "speechContent")
此处,test_cluster
包含六个观察值,即12个变量,即六行secedscontent
column 包含复合词“ Schwester Agnes”。
test_corpus
将基础数据转换为Quanteda
语料库对象。
然后,当我运行以下代码时,我希望首先将SecedContent
变量的内容被标记化,并且由于tokens_compound
,复合词“ Schwester Agnes”这样被象征性。在第二步中,我希望KWIC()函数返回由六行组成的数据框架,其中包含关键字
变量,包括复合单词“ Schwester Agnes”。但是,KWIC()返回一个空框架,其中包含7个变量的0个观测值。我认为这是因为我使用tokens_compound()
犯了一些错误,但是我不确定...任何帮助将不胜感激!
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = phrase("Schwester Agnes"))
test_kwic <- kwic(test_tokens,
pattern = "Schwester Agnes",
window = 5)
编辑:我意识到上面的示例不容易再现,因此请参阅以下reprex:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
test_corpus <- corpus(data,
docid_field = "id",
text_field = "speechContent")
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = c("stack", "overflow"))
test_kwic <- kwic(test_tokens,
pattern = "stack overflow",
window = 5)
I'm trying to create a dataframe containing specific keywords-in-context using the kwic() function, but unfortunately, I'm running into some error when attempting to tokenize the underlying dataset.
This is the subset of the dataset I'm using as a reproducible example:
test_cluster <- speeches_subset %>%
filter(grepl('Schwester Agnes',
speechContent,
ignore.case = TRUE))
test_corpus <- corpus(test_cluster,
docid_field = "id",
text_field = "speechContent")
Here, test_cluster
contains six observations of 12 variables, that is, six rows in which the column speechContent
contains the compound word "Schwester Agnes". test_corpus
transforms the underlying data into a quanteda
corpus object.
When I then run the following code, I would expect, first, the content of the speechContent
variables to be tokenized, and due to tokens_compound
, the compound word "Schwester Agnes" to be tokenized as such. In a second step, I would expect the kwic() function to return a dataframe consisting of six rows, with the keyword
variable including the compound word "Schwester Agnes". Instead, however, kwic() returns an empty dataframe containing 0 observations of 7 variables. I think this is because of some mistake I'm making with tokens_compound()
, but I'm not sure... Any help would be greatly appreciated!
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = phrase("Schwester Agnes"))
test_kwic <- kwic(test_tokens,
pattern = "Schwester Agnes",
window = 5)
EDIT: I realize that the examples above are not easily reproducible, so please refer to the reprex below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stack overflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stack overflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
test_corpus <- corpus(data,
docid_field = "id",
text_field = "speechContent")
test_tokens <- tokens(test_corpus,
remove_punct = TRUE,
remove_numbers = TRUE) %>%
tokens_compound(pattern = c("stack", "overflow"))
test_kwic <- kwic(test_tokens,
pattern = "stack overflow",
window = 5)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您需要应用
短语(“ stack Overflow”)
和setconcateNator =“”
在tokens_compound()
中。由
You need to apply
phrase("stack overflow")
and setconcatenator = " "
intokens_compound()
.Created on 2022-05-06 by the reprex package (v2.0.1)