如果另一列中的字符串包含具有不同标点符号和字体大小的单词,如何创建取 1 的新变量?

发布于 2025-01-12 17:30:08 字数 582 浏览 1 评论 0原文

我有一个看起来像这样的列

col1 
"business"
"BusinesS"
"education"
"some BUSINESS ."
"business of someone, that is cool"
" not the b word"
"busi ness"
"busines." 
"businesses"
"something else"

我需要一种有效的方法将所有这些字符串数据转换为新值

col1                col2
NA                  1
NA                  1
"education"         NA
NA                  1
NA                  1
" not the b word"   NA
NA                  1
NA                  1
NA                  1
"something else"    NA

所以共同点是“业务”,但我不知道如何有效地使其整理所有空间、标点符号、小写/大写、其他单词等在一个创建新列的突变中。

I have a column that looks something like this

col1 
"business"
"BusinesS"
"education"
"some BUSINESS ."
"business of someone, that is cool"
" not the b word"
"busi ness"
"busines." 
"businesses"
"something else"

And I need an efficient way of getting all this string data into a new value

col1                col2
NA                  1
NA                  1
"education"         NA
NA                  1
NA                  1
" not the b word"   NA
NA                  1
NA                  1
NA                  1
"something else"    NA

So the common denominator is "busines", but I don't know how to efficiently make it sort out all the spaces, punctuation, lower/uppercases, other words etc. in one mutate that creates a new column.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

盗心人 2025-01-19 17:30:08
library(dplyr)
library(stringr) 
df %>%
  mutate(col2 = ifelse(str_detect(col1, "(?i)busi\\s?ness?"),
                       1,
                       NA)

如果 str_detect 检测到任何形式的 business,并且 NA,我们可以使用 ifelse 设置 1代码> 如果没有。请注意,(?i) 使匹配不区分大小写,并且 \\s?s? 中的 ?使前面的项目可选;因此 \\s? 匹配可选空格,而 s? 匹配可选文字 s

library(dplyr)
library(stringr) 
df %>%
  mutate(col2 = ifelse(str_detect(col1, "(?i)busi\\s?ness?"),
                       1,
                       NA)

We can use ifelse to set 1 if str_detect detects any form of business, and NA if it doesn't. Note that (?i) makes the match case-insensitive and ? in \\s? and s? makes the preceding item optional; so \\s? matches an optional space and s? matches an optional literal s

动次打次papapa 2025-01-19 17:30:08

您可以使用 gsub 替换所有非单词字符,然后使用 grepl 来检测 busines

+grepl("busines", gsub("\\W+", "", s), ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0

另一种方法是使用 agrepl code> 用于近似字符串匹配,其中 1L 给出到给定模式的最大距离。

+agrepl("busines", s, 1L, ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0

如果您正在寻找 business 而不是 businesagrep 也可以是一个解决方案:

+agrepl("business", gsub("\\W+", "", s), 1L, ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0

数据:

s <- c("business","BusinesS","education","some BUSINESS .",
       "business of someone, that is cool"," not the b word",
       "busi ness","busines." ,"businesses","something else")

You can replace all non word characters using gsub and than use grepl to detect busines:

+grepl("busines", gsub("\\W+", "", s), ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0

Another way would be to use agrepl for Approximate String Matching, where here 1L gives the maximum distance to the given pattern.

+agrepl("busines", s, 1L, ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0

agrep can also be a solution in case you are looking for business instead of busines:

+agrepl("business", gsub("\\W+", "", s), 1L, ignore.case = TRUE)
# [1] 1 1 0 1 1 0 1 1 1 0

Data:

s <- c("business","BusinesS","education","some BUSINESS .",
       "business of someone, that is cool"," not the b word",
       "busi ness","busines." ,"businesses","something else")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文