从r中删除文字中的停止单词
我在从文本数据中删除stop_words有问题。数据集被网络刮擦并包含客户评论,看起来像:
data$Review <- c("Won't let me use my camera", "Does not load","I'ts truly mind blowing!")
我进行了波纹数据操纵,并在数据框架中创建了一个新变量,现在的评论看起来像这样:
Manipulation Part:
data$Proc_Review <- gsub("'", "", data$Review) # Removes Apostrophes white spaces
data$Proc_Review <- gsub('[[:punct:] ]+',' ',data$Proc_Review) # Remove Punctuation
data$Proc_Review <- gsub('[[:digit:]]+', '', data$Proc_Review) # Remove numbers
data$Proc_Review <- as.character(data$Proc_Review)
"wont let me use my camera", "does not load", "its truly mind blowing"
下一步是删除我使用的停止单词波纹码:
data("stop_words")
j<-1
for (j in 1:nrow(data)) {
description<- anti_join((data[j,] %>% unnest_tokens(word,Proc_Review, drop=FALSE,to_lower=FALSE) ),stop_words)
data[j,"Proc_Review"]<-paste((description),collapse = " ")
}
之后,输出是
c(1, 1) c(17304, 17304) c(\"Won't let me use my camera\", \"Won't let me use my camera\") c(1, 1) c(1, 1) c(32, 32) c(4, 4) c(\"wont let me use my camera\", \"wont let me use my camera\") c(\"wont\", \"camera\")"
我尝试了其他一些方法,但是,结果不是想要的,因为它从某些评论中删除了一些stop_words,但并非所有这些评论。例如,它在某些评论中删除了“是”,但在某些“它”中仍然存在。
我想做的是评论出现在数据集中的新列中,而无需停止单词! 非常感谢您!
I have a problem with removing stop_words from text data. The data set is web scraped and contains customer reviews and looks like:
data$Review <- c("Won't let me use my camera", "Does not load","I'ts truly mind blowing!")
I did the bellow data manipulation, and created a new variable in the data frame and now reviews look like this:
Manipulation Part:
data$Proc_Review <- gsub("'", "", data$Review) # Removes Apostrophes white spaces
data$Proc_Review <- gsub('[[:punct:] ]+',' ',data$Proc_Review) # Remove Punctuation
data$Proc_Review <- gsub('[[:digit:]]+', '', data$Proc_Review) # Remove numbers
data$Proc_Review <- as.character(data$Proc_Review)
"wont let me use my camera", "does not load", "its truly mind blowing"
The next step is to remove stop words, for which I use the bellow code:
data("stop_words")
j<-1
for (j in 1:nrow(data)) {
description<- anti_join((data[j,] %>% unnest_tokens(word,Proc_Review, drop=FALSE,to_lower=FALSE) ),stop_words)
data[j,"Proc_Review"]<-paste((description),collapse = " ")
}
After that the output is
c(1, 1) c(17304, 17304) c(\"Won't let me use my camera\", \"Won't let me use my camera\") c(1, 1) c(1, 1) c(32, 32) c(4, 4) c(\"wont let me use my camera\", \"wont let me use my camera\") c(\"wont\", \"camera\")"
I have tried some other ways, however, the result was not the wanted one, as it removed some stop_words from some reviews but not for all of them. For example, it removed "it's" in some reviews, but in some "it's" remained.
What I want to do is reviews to appear in a new column in the data set without the stop words!
Thank you so much in advance!!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
无需为循环使用
。此外,您的数据处理中有一个错误。在步骤2和3中,您使用原始矢量。因此,您在上一步中所做的所有处理都会被覆盖。
由
There is no need to use a
for
loop. Additionally there was a bug in your data processing. In steps 2 and 3 you use the original vector. Hence all processing you did in previous steps get overwritten.Created on 2022-06-04 by the reprex package (v2.0.1)