如何使用 R 中的正则表达式删除字符串中表情符号的 Unicode 表示?

发布于 2025-01-20 09:01:19 字数 1137 浏览 0 评论 0 原文

我正在使用Twitter API中的数据,并且在用户在其名称字段中包含表情符号的任何地方,它们都被转换为我的数据框中的Unicode字符串表示形式。我的数据的结构有点像这样:

user_profiles <- as.data.frame(c("Susanne Bold", "Julian K. Peard <U+0001F41C>", 
"<U+0001F30A> Alexander K Miller <U+0001F30A>", "John Mason"))
colnames(user_profiles) <- "name"

看起来像这样:

                                          name
1                                 Susanne Bold
2                 Julian K. Peard <U+0001F41C>
3 <U+0001F30A> Alexander K Miller <U+0001F30A>
4                                   John Mason

我现在试图使用Regexp将实际名称隔离到一个新列中:

user_profiles <- user_profiles %>%
  mutate(clean_name = str_remove_all(name, "\\<U\\+[[:alnum:]]\\>[ ]?"))

但是此表达式1。似乎很复杂且2。 。我已经尝试了REGEXP的多种变体,奇怪的是, GREPL 能够使用此版本检测模式( string_remove_all 不接受,因为它缺少闭幕式支架):

grepl("\\<U\\+[[:alnum:]\\>[ ]?", user_profiles$name)
[1] FALSE  TRUE  TRUE FALSE
# note that the second bracket around alnum is left opened

有人可以解释这一点或提供更轻松的解决方案吗?

多谢!

I am working with data from the Twitter API and wherever users had included Emojis in their name field, they have been translated to Unicode string representations in my dataframe. The structure of my data is somewhat like this:

user_profiles <- as.data.frame(c("Susanne Bold", "Julian K. Peard <U+0001F41C>", 
"<U+0001F30A> Alexander K Miller <U+0001F30A>", "John Mason"))
colnames(user_profiles) <- "name"

which looks like this:

                                          name
1                                 Susanne Bold
2                 Julian K. Peard <U+0001F41C>
3 <U+0001F30A> Alexander K Miller <U+0001F30A>
4                                   John Mason

I am now trying to isolate the actual name into a new column using regexp:

user_profiles <- user_profiles %>%
  mutate(clean_name = str_remove_all(name, "\\<U\\+[[:alnum:]]\\>[ ]?"))

But this expression 1. seems rather complicated and 2. doesn't work for identifying the pattern. I have tried multiple variations of the regexp already, weirdly enough, grepl is able to detect the pattern with this version (which string_remove_all doesn't accept since it is missing a closing bracket):

grepl("\\<U\\+[[:alnum:]\\>[ ]?", user_profiles$name)
[1] FALSE  TRUE  TRUE FALSE
# note that the second bracket around alnum is left opened

Can somebody explain this or offer an easier solution?

Thanks a lot!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

善良天后 2025-01-27 09:01:19

第一个 str_remove_all 不起作用,因为您错过了字母数字模式后面的 + 量词。另请注意,在 之后,仅使用十六进制字符,因此您可以使用更精确的 [:xdigit,而不是 [:alnum:] :] POSIX 字符类。

您可以使用

user_profiles <- user_profiles %>%
  mutate(clean_name = str_remove_all(name, "<U\\+[[:xdigit:]]+>\\s*"))

Do not escape <>,它们在任何正则表达式风格中都不是特殊的,并且在 TRE 正则表达式中,与基本正则表达式函数一起使用,无需 Perl =TRUE,\<\> 是字边界。

模式详细信息

  • - 字符串
  • \+ - 文字 +
  • [[:xdigit:]]+ - 一个或多个十六进制字符
  • > - 一个 > 字符
  • \s*< /code> - 零个或多个空格。

为什么 grepl 正则表达式有效?这很有趣,因为您省略了 ] 右括号表达式边界字符,并“破坏”了正则表达式以进行如下匹配:

  • \ - 单词边界(在 TRE 中,\< 匹配左侧单词边界),然后 U+ 字符串
  • [[:alnum:]\>[ ]? - 这是一个可选的括号表达式匹配集合中的一个或零个字符:
    • [:alnum:] - 任何字母数字字符
    • \ - 反斜杠(是的,因为在 TRE 正则表达式风格中,正则表达式转义序列按字面处理)
    • > - > 字符
    • [ - 一个 [ 字符
    • - 一个空格。

因此,例如,它与 中的 匹配。

The first str_remove_all does not work because you missed the + quantifier after the alphanumeric pattern. Also, note that after <U+, only hex chars are used, so instead of [:alnum:], you can use a more precise [:xdigit:] POSIX character class.

You can use

user_profiles <- user_profiles %>%
  mutate(clean_name = str_remove_all(name, "<U\\+[[:xdigit:]]+>\\s*"))

Do not escape < and >, they are never special in any regex flavor, and in TRE regex, used with base regex functions without perl=TRUE, the \< and \> are word boundaries.

Pattern details

  • <U - <U string
  • \+ - a literal +
  • [[:xdigit:]]+ - one or more hex chars
  • > - a > char
  • \s* - zero or more whitespaces.

Why does the grepl regex work? This is interesting, because you omitted the ] closing bracket expression boundary char, and "spoilt" the regex to match like this:

  • \<U\+ - a word boundary (in TRE, \< matches a left-hand word boundary) and then U+ string
  • [[:alnum:]\>[ ]? - this is an optional bracket expression that matches one or zero chars from the set:
    • [:alnum:] - any alphanumeric char
    • \ - a backslash (yes, because in TRE regex flavor, regex escape sequences are treated literally)
    • > - a > char
    • [ - a [ char
    • - a space.

So, it matches <U+0 in <U+0001F41C>, for example.

韬韬不绝 2025-01-27 09:01:19

这是我们可以做到的另一种方法:

library(dplyr)
library(tidyr)

user_profiles %>% 
  separate_rows(name, sep = '\\<|\\>') %>% 
  filter(!str_detect(name, 'U+')) %>% 
  mutate(name = na_if(name, "")) %>% 
  na.omit()
  name                  
  <chr>                 
1 "Susanne Bold"        
2 "Julian K. Peard "    
3 " Alexander K Miller "
4 "John Mason" 

Here is an alternative way how we could do it:

library(dplyr)
library(tidyr)

user_profiles %>% 
  separate_rows(name, sep = '\\<|\\>') %>% 
  filter(!str_detect(name, 'U+')) %>% 
  mutate(name = na_if(name, "")) %>% 
  na.omit()
  name                  
  <chr>                 
1 "Susanne Bold"        
2 "Julian K. Peard "    
3 " Alexander K Miller "
4 "John Mason" 
偏爱你一生 2025-01-27 09:01:19

我们可以为 [:alnum:]]

library(dplyr)
library(stringr)
user_profiles <- user_profiles %>%
  mutate(clean_name = str_remove_all(name, "\\s*\\<U\\+[[:alnum:]]+\\>\\s*")) 

-Output添加一个或多个(+

user_profiles
                                      name         clean_name
1                                 Susanne Bold       Susanne Bold
2                 Julian K. Peard <U+0001F41C>    Julian K. Peard
3 <U+0001F30A> Alexander K Miller <U+0001F30A> Alexander K Miller
4                                   John Mason         John Mason

We can add one or more (+) for the [[:alnum:]]

library(dplyr)
library(stringr)
user_profiles <- user_profiles %>%
  mutate(clean_name = str_remove_all(name, "\\s*\\<U\\+[[:alnum:]]+\\>\\s*")) 

-output

user_profiles
                                      name         clean_name
1                                 Susanne Bold       Susanne Bold
2                 Julian K. Peard <U+0001F41C>    Julian K. Peard
3 <U+0001F30A> Alexander K Miller <U+0001F30A> Alexander K Miller
4                                   John Mason         John Mason
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文