KWIC()函数返回的行比应有

发布于 2025-01-26 16:12:20 字数 1354 浏览 5 评论 0原文

我目前正在尝试对kwic对象进行情感分析,但恐怕kwic()函数不会返回所有应该返回的行。我不太确定问题到底是什么使得很难发布一个可再现的例子,所以我希望对我想做的事情的详细说明足够了。

我将原始数据集划分为包含我要分析的语音的原始数据集中,该数据集将其分析到一个新的数据框架中,该数据框架仅包括提及某些关键字的演讲。我使用以下代码创建此子集:

ostalgie_cluster <- full_data %>%
  filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie',
                speechContent,
                ignore.case = TRUE))

结果数据框架包括201个观察值。但是,当我使用以下代码在同一初始数据集上执行kwic()时,它将返回一个只有82个观测值的数据框架。有人知道会导致这件事吗?再次,对不起,我无法提供可重现的例子,但是当我尝试从头开始创建一个preprex时。

#create quanteda corpus object
qtd_speeches_corp <- corpus(full_data,
                            docid_field = "id",
                            text_field = "speechContent")

#tokenize speeches
qtd_tokens <- tokens(qtd_speeches_corp, 
                     remove_punct = TRUE,
                     remove_numbers = TRUE,
                     remove_symbols = TRUE,
                     padding = FALSE) %>%
  tokens_remove(stopwords("de"), padding = FALSE) %>%
  tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ")

ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie")

test_kwic <- kwic(qtd_tokens,
                  pattern = ostalgie_words,
                  window = 5)

I'm currently trying to perform a sentiment analysis on a kwic object, but I'm afraid that the kwic() function does not return all rows it should return. I'm not quite sure what exactly the issue is which makes it hard to post a reproducible example, so I hope that a detailed explanation of what I'm trying to do will suffice.

I subsetted the original dataset containing speeches I want to analyze to a new data frame that only includes speeches mentioning certain keywords. I used the following code to create this subset:

ostalgie_cluster <- full_data %>%
  filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie',
                speechContent,
                ignore.case = TRUE))

The resulting data frame consists of 201 observations. When I perform kwic() on the same initial dataset using the following code, however, it returns a data frame with only 82 observations. Does anyone know what might cause this? Again, I'm sorry I can't provide a reproducible example, but when I try to create a reprex from scratch it just.. works...

#create quanteda corpus object
qtd_speeches_corp <- corpus(full_data,
                            docid_field = "id",
                            text_field = "speechContent")

#tokenize speeches
qtd_tokens <- tokens(qtd_speeches_corp, 
                     remove_punct = TRUE,
                     remove_numbers = TRUE,
                     remove_symbols = TRUE,
                     padding = FALSE) %>%
  tokens_remove(stopwords("de"), padding = FALSE) %>%
  tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ")

ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie")

test_kwic <- kwic(qtd_tokens,
                  pattern = ostalgie_words,
                  window = 5)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

瞳孔里扚悲伤 2025-02-02 16:12:20

这是一个猜测,而没有可重现的示例(您的输入full_data,即),但这是我最好的猜测。您的kwic()调用使用默认的“ glob”模式匹配,而您想要的是正则表达式匹配。

这样修复:

kwic(qtd_tokens, pattern = ostalgie_words, valuetype = "regex", 
     window = 5

It's something of a guess without having a reproducible example (your input full_data, namely) but here's my best guess. Your kwic() call is using the default "glob" pattern matching, and what you want is a regular expression match instead.

Fix it this way:

kwic(qtd_tokens, pattern = ostalgie_words, valuetype = "regex", 
     window = 5
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文