KWIC()函数返回的行比应有
我目前正在尝试对kwic
对象进行情感分析,但恐怕kwic()
函数不会返回所有应该返回的行。我不太确定问题到底是什么使得很难发布一个可再现的例子,所以我希望对我想做的事情的详细说明足够了。
我将原始数据集划分为包含我要分析的语音的原始数据集中,该数据集将其分析到一个新的数据框架中,该数据框架仅包括提及某些关键字的演讲。我使用以下代码创建此子集:
ostalgie_cluster <- full_data %>%
filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie',
speechContent,
ignore.case = TRUE))
结果数据框架包括201个观察值。但是,当我使用以下代码在同一初始数据集上执行kwic()
时,它将返回一个只有82个观测值的数据框架。有人知道会导致这件事吗?再次,对不起,我无法提供可重现的例子,但是当我尝试从头开始创建一个preprex时。
#create quanteda corpus object
qtd_speeches_corp <- corpus(full_data,
docid_field = "id",
text_field = "speechContent")
#tokenize speeches
qtd_tokens <- tokens(qtd_speeches_corp,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
padding = FALSE) %>%
tokens_remove(stopwords("de"), padding = FALSE) %>%
tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ")
ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie")
test_kwic <- kwic(qtd_tokens,
pattern = ostalgie_words,
window = 5)
I'm currently trying to perform a sentiment analysis on a kwic
object, but I'm afraid that the kwic()
function does not return all rows it should return. I'm not quite sure what exactly the issue is which makes it hard to post a reproducible example, so I hope that a detailed explanation of what I'm trying to do will suffice.
I subsetted the original dataset containing speeches I want to analyze to a new data frame that only includes speeches mentioning certain keywords. I used the following code to create this subset:
ostalgie_cluster <- full_data %>%
filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie',
speechContent,
ignore.case = TRUE))
The resulting data frame consists of 201 observations. When I perform kwic()
on the same initial dataset using the following code, however, it returns a data frame with only 82 observations. Does anyone know what might cause this? Again, I'm sorry I can't provide a reproducible example, but when I try to create a reprex from scratch it just.. works...
#create quanteda corpus object
qtd_speeches_corp <- corpus(full_data,
docid_field = "id",
text_field = "speechContent")
#tokenize speeches
qtd_tokens <- tokens(qtd_speeches_corp,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
padding = FALSE) %>%
tokens_remove(stopwords("de"), padding = FALSE) %>%
tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ")
ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie")
test_kwic <- kwic(qtd_tokens,
pattern = ostalgie_words,
window = 5)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是一个猜测,而没有可重现的示例(您的输入
full_data
,即),但这是我最好的猜测。您的kwic()
调用使用默认的“ glob”模式匹配,而您想要的是正则表达式匹配。这样修复:
It's something of a guess without having a reproducible example (your input
full_data
, namely) but here's my best guess. Yourkwic()
call is using the default "glob" pattern matching, and what you want is a regular expression match instead.Fix it this way: