从r中删除文字中的停止单词

发布于 2025-02-04 06:55:54 字数 1329 浏览 1 评论 0原文

我在从文本数据中删除stop_words有问题。数据集被网络刮擦并包含客户评论，看起来像：

data$Review <- c("Won't let me use my camera", "Does not load","I'ts truly mind blowing!")

我进行了波纹数据操纵，并在数据框架中创建了一个新变量，现在的评论看起来像这样：

Manipulation Part: 
data$Proc_Review <- gsub("'", "", data$Review) # Removes Apostrophes white spaces
data$Proc_Review <-  gsub('[[:punct:] ]+',' ',data$Proc_Review) # Remove Punctuation 
data$Proc_Review <- gsub('[[:digit:]]+', '', data$Proc_Review) # Remove numbers
data$Proc_Review <- as.character(data$Proc_Review)
"wont let me use my camera", "does not load", "its truly mind blowing"

下一步是删除我使用的停止单词波纹码：

    data("stop_words")

j<-1
for (j in 1:nrow(data)) {
  description<-  anti_join((data[j,] %>% unnest_tokens(word,Proc_Review, drop=FALSE,to_lower=FALSE) ),stop_words)
  data[j,"Proc_Review"]<-paste((description),collapse = " ")
}

之后，输出是

c(1, 1) c(17304, 17304) c(\"Won't let me use my camera\", \"Won't let me use my camera\") c(1, 1) c(1, 1) c(32, 32) c(4, 4) c(\"wont let me use my camera\", \"wont let me use my camera\") c(\"wont\", \"camera\")"

我尝试了其他一些方法，但是，结果不是想要的，因为它从某些评论中删除了一些stop_words，但并非所有这些评论。例如，它在某些评论中删除了“是”，但在某些“它”中仍然存在。

我想做的是评论出现在数据集中的新列中，而无需停止单词！非常感谢您！

原文

I have a problem with removing stop_words from text data. The data set is web scraped and contains customer reviews and looks like:

data$Review <- c("Won't let me use my camera", "Does not load","I'ts truly mind blowing!")

I did the bellow data manipulation, and created a new variable in the data frame and now reviews look like this:

Manipulation Part: 
data$Proc_Review <- gsub("'", "", data$Review) # Removes Apostrophes white spaces
data$Proc_Review <-  gsub('[[:punct:] ]+',' ',data$Proc_Review) # Remove Punctuation 
data$Proc_Review <- gsub('[[:digit:]]+', '', data$Proc_Review) # Remove numbers
data$Proc_Review <- as.character(data$Proc_Review)
"wont let me use my camera", "does not load", "its truly mind blowing"

The next step is to remove stop words, for which I use the bellow code:

    data("stop_words")

j<-1
for (j in 1:nrow(data)) {
  description<-  anti_join((data[j,] %>% unnest_tokens(word,Proc_Review, drop=FALSE,to_lower=FALSE) ),stop_words)
  data[j,"Proc_Review"]<-paste((description),collapse = " ")
}

After that the output is

c(1, 1) c(17304, 17304) c(\"Won't let me use my camera\", \"Won't let me use my camera\") c(1, 1) c(1, 1) c(32, 32) c(4, 4) c(\"wont let me use my camera\", \"wont let me use my camera\") c(\"wont\", \"camera\")"

I have tried some other ways, however, the result was not the wanted one, as it removed some stop_words from some reviews but not for all of them. For example, it removed "it's" in some reviews, but in some "it's" remained.

What I want to do is reviews to appear in a new column in the data set without the stop words!
Thank you so much in advance!!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

魔法少女 2025-02-11 06:55:54

无需为循环使用。此外，您的数据处理中有一个错误。在步骤2和3中，您使用原始矢量。因此，您在上一步中所做的所有处理都会被覆盖。

library(tidytext)
library(dplyr)

data("stop_words")

df <- data.frame(
  Review = c("Won't let me use my camera", "Does not load","I'ts truly mind blowing!")
)

df$Proc_Review <- gsub("\\'", "", df$Review) # Removes Apostrophes white spaces
df$Proc_Review <-  gsub('[[:punct:] ]+',' ',df$Proc_Review) # Remove Punctuation 
df$Proc_Review <- gsub('[[:digit:]]+', '', df$Proc_Review) # Remove numbers
df$Proc_Review <- as.character(df$Proc_Review)

df %>%
  unnest_tokens(word, Proc_Review, drop = FALSE, to_lower = FALSE)  %>%
  anti_join(stop_words)
#> Joining, by = "word"
#>                       Review               Proc_Review    word
#> 1 Won't let me use my camera Wont let me use my camera    Wont
#> 2 Won't let me use my camera Wont let me use my camera  camera
#> 3              Does not load             Does not load    Does
#> 4              Does not load             Does not load    load
#> 5   I'ts truly mind blowing!   Its truly mind blowing      Its
#> 6   I'ts truly mind blowing!   Its truly mind blowing     mind
#> 7   I'ts truly mind blowing!   Its truly mind blowing  blowing

^由

There is no need to use a for loop. Additionally there was a bug in your data processing. In steps 2 and 3 you use the original vector. Hence all processing you did in previous steps get overwritten.

library(tidytext)
library(dplyr)

data("stop_words")

df <- data.frame(
  Review = c("Won't let me use my camera", "Does not load","I'ts truly mind blowing!")
)

df$Proc_Review <- gsub("\\'", "", df$Review) # Removes Apostrophes white spaces
df$Proc_Review <-  gsub('[[:punct:] ]+',' ',df$Proc_Review) # Remove Punctuation 
df$Proc_Review <- gsub('[[:digit:]]+', '', df$Proc_Review) # Remove numbers
df$Proc_Review <- as.character(df$Proc_Review)

df %>%
  unnest_tokens(word, Proc_Review, drop = FALSE, to_lower = FALSE)  %>%
  anti_join(stop_words)
#> Joining, by = "word"
#>                       Review               Proc_Review    word
#> 1 Won't let me use my camera Wont let me use my camera    Wont
#> 2 Won't let me use my camera Wont let me use my camera  camera
#> 3              Does not load             Does not load    Does
#> 4              Does not load             Does not load    load
#> 5   I'ts truly mind blowing!   Its truly mind blowing      Its
#> 6   I'ts truly mind blowing!   Its truly mind blowing     mind
#> 7   I'ts truly mind blowing!   Its truly mind blowing  blowing

^{Created on 2022-06-04 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~