如何使用正则表达式匹配重音字符?

发布于 2024-12-02 22:12:01 字数 588 浏览 4 评论 0原文

我正在运行 Ruby on Rails 3.0.10 和 Ruby 1.9.2。我正在使用以下正则表达式来匹配名称:

NAME_REGEX = /^[\w\s'"\-_&@!?()\[\]-]*$/u

validates :name,
  :presence   => true,
  :format     => {
    :with     => NAME_REGEX,
    :message  => "format is invalid"
  }

但是,如果我尝试保存如下所示的一些单词:

Oilalà
Pì
Rùby
...

# In few words, those with accented characters

我有一个验证错误“名称格式无效。

我该怎么办更改上面的正则表达式以匹配重音字符,例如 àèéìòù、...?

I am running Ruby on Rails 3.0.10 and Ruby 1.9.2. I am using the following Regex in order to match names:

NAME_REGEX = /^[\w\s'"\-_&@!?()\[\]-]*$/u

validates :name,
  :presence   => true,
  :format     => {
    :with     => NAME_REGEX,
    :message  => "format is invalid"
  }

However, if I try to save some words like the followings:

Oilalà
Pì
Rùby
...

# In few words, those with accented characters

I have a validation error "Name format is invalid..

How can I change the above Regex so to match also accented characters like à, è, é, ì, ò, ù, ...?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

慈悲佛祖 2024-12-09 22:12:01

使用 POSIX 括号表达式,而不是 \w [:alpha: ]

"blåbær dèjá vu".scan /[[:alpha:]]+/  # => ["blåbær", "dèjá", "vu"]

"blåbær dèjá vu".scan /\w+/  # => ["bl", "b", "r", "d", "j", "vu"]

在您的特定情况下,将正则表达式更改为:

NAME_REGEX = /^[[:alpha:]\s'"\-_&@!?()\[\]-]*$/u

不过,这确实匹配的不仅仅是重音字符。这是一个
好东西。请务必阅读此博客文章了解常见的误解
关于软件应用程序中的名称。

Instead of \w, use the POSIX bracket expression [:alpha:]:

"blåbær dèjá vu".scan /[[:alpha:]]+/  # => ["blåbær", "dèjá", "vu"]

"blåbær dèjá vu".scan /\w+/  # => ["bl", "b", "r", "d", "j", "vu"]

In your particular case, change the regex to this:

NAME_REGEX = /^[[:alpha:]\s'"\-_&@!?()\[\]-]*$/u

This does match much more than just accented characters, though. Which is a
good thing. Make sure you read this blog entry about common misconceptions
regarding names in software applications.

十级心震 2024-12-09 22:12:01

当然,一种解决方案是简单地找到所有这些,就像平常一样使用它们,尽管我认为它们可能相当多。

如果您使用 UTF8,那么您会发现这些字符通常分为两部分,“基本”字符本身,后面跟着重音符号(我相信是 0x0300 和 0x0301),也称为组合字符。然而,这可能并不总是正确的,因为某些字符也可以使用“硬编码”字符代码编写......所以您需要首先将 UTF8 字符串规范化为 NFD 形式。

当然,您也可以将任何字符串转换为 UTF8,然后再转换回原始字符集...但如果您正在进行批量操作,开销可能会变得相当大。

编辑:要具体回答您的问题,最好的解决方案可能是将您的字符串标准化为 UTF8 NPD 形式,然后只需将 0x0300 和 0x0301 添加到可接受的字符列表中,以及您想要允许的任何其他组合字符(例如点在 åäö 中,您可以在 Windows 中的“charmap”中找到它们,查看 0x0300 和“up”)。

One solution would of course be to simply find all of them just use them as you normally would, although I assume they can be fairly many.

If you are using UTF8 then you will find that such characters are often split into two parts, the "base" character itself, followed by the accent (0x0300 and 0x0301 I believe) also called a combining character. However, this may not always be true since some characters can also be written using the "hardcoded" character code... so you need to normalize the UTF8 string to NFD form first.

Of course, you could also turn any string you have into UTF8 and then back into the original charset... but the overhead might become quite large if you are doing bulk operations.

EDIT: To answer your question specifically, the best solution is likely to normalize your strings into UTF8 NPD form, and then simply add 0x0300 and 0x0301 to your list of acceptable characters, and whatever other combining characters you want to allow (such as the dots in åäö, you can find them all in "charmap" in Windows, look at 0x0300 and "up").

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文