具有特殊字符的名称的正则表达式 (Unicode)

发布于 2024-11-06 14:34:15 字数 1210 浏览 5 评论 0原文

好吧,我已经读了一整天有关正则表达式的内容,但仍然没有正确理解它。我想做的是验证名称,但我在互联网上可以找到的函数仅使用 [a-zA-Z],而忽略了我需要接受的字符。

我基本上需要一个正则表达式来检查名称是否至少有两个单词,并且不包含数字或特殊字符,例如 !"#¤%&/()=...,但是这些单词可以包含 æ、é、Â 等字符...

可接受的名称示例为:“John Elkjærd”或“André Svenson”
不可接受的名称为: “Hans”、“H4nn3 Andersen”或“Martin Henriksen

如果重要的话我会使用 并且希望仅在“负数”服务器端使用 php 的 preg_replace() (删除不匹配的字符)。

javascript .match() 函数客户端, 非常感谢。

更新:
好的,感谢 Alix Axel 的回答 我得到了重要的部分下来,服务器端一项。

但正如LightWing的答案中的页面所示,我无法找到有关 javascript 支持 unicode 的任何内容,因此我最终为客户端提供了一半的解决方案,只需检查至少两个单词和至少 5 个字符,如下所示:

if(name.match(/\S+/g).length >= minWords && name.length >= 5) {
  //valid
}

另一种方法是是按照shifty的答案中的建议指定所有unicode字符,我最终可能会做类似的事情以及上面的解决方案,但这有点不切实际。

Okay, I have read about regex all day now, and still don't understand it properly. What i'm trying to do is validate a name, but the functions i can find for this on the internet only use [a-zA-Z], leaving characters out that i need to accept to.

I basically need a regex that checks that the name is at least two words, and that it does not contain numbers or special characters like !"#¤%&/()=..., however the words can contain characters like æ, é, Â and so on...

An example of an accepted name would be: "John Elkjærd" or "André Svenson"
An non-accepted name would be: "Hans", "H4nn3 Andersen" or "Martin Henriksen!"

If it matters i use the javascript .match() function client side and want to use php's preg_replace() only "in negative" server side. (removing non-matching characters).

Any help would be much appreciated.

Update:
Okay, thanks to Alix Axel's answer i have the important part down, the server side one.

But as the page from LightWing's answer suggests, i'm unable to find anything about unicode support for javascript, so i ended up with half a solution for the client side, just checking for at least two words and minimum 5 characters like this:

if(name.match(/\S+/g).length >= minWords && name.length >= 5) {
  //valid
}

An alternative would be to specify all the unicode characters as suggested in shifty's answer, which i might end up doing something like, along with the solution above, but it is a bit unpractical though.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

沫尐诺 2024-11-13 14:34:15

尝试以下正则表达式:

^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$

在 PHP 中,这翻译为:

if (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0)
{
    // valid
}

你应该这样读:

^   # start of subject
    (?:     # match this:
        [           # match a:
            \p{L}       # Unicode letter, or
            \p{Mn}      # Unicode accents, or
            \p{Pd}      # Unicode hyphens, or
            \'          # single quote, or
            \x{2019}    # single quote (alternative)
        ]+              # one or more times
        \s          # any kind of space
        [               #match a:
            \p{L}       # Unicode letter, or
            \p{Mn}      # Unicode accents, or
            \p{Pd}      # Unicode hyphens, or
            \'          # single quote, or
            \x{2019}    # single quote (alternative)
        ]+              # one or more times
        \s?         # any kind of space (0 or more times)
    )+      # one or more times
$   # end of subject

老实说,我不知道如何将其移植到 Javascript,我什至不确定 Javascript 支持 Unicode 属性,但在 PHP PCRE 中,这个 似乎工作完美@ IDEOne.com

$names = array
(
    'Alix',
    'André Svenson',
    'H4nn3 Andersen',
    'Hans',
    'John Elkjærd',
    'Kristoffer la Cour',
    'Marco d\'Almeida',
    'Martin Henriksen!',
);

foreach ($names as $name)
{
    echo sprintf('%s is %s' . "\n", $name, (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0) ? 'valid' : 'invalid');
}

很抱歉,我无法帮助您了解 Javascript 部分,但可能有人会帮您。


有效

  • John Elkjærd
  • André Svenson
  • Marco d'Almeida
  • Kristoffer la Cour

无效

  • Hans
  • H4nn3 Andersen
  • Martin Henriksen!

要替换无效字符,虽然我不确定为什么需要这个,但只需稍微更改一下即可:

$name = preg_replace('~[^\p{L}\p{Mn}\p{Pd}\'\x{2019}\s]~u', '$1', $name);

示例:

  • H4nn3 Andersen -> Hnn Andersen
  • Martin Henriksen! -> Martin Henriksen

请注意,您始终需要使用 u 修饰符。

Try the following regular expression:

^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$

In PHP this translates to:

if (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0)
{
    // valid
}

You should read it like this:

^   # start of subject
    (?:     # match this:
        [           # match a:
            \p{L}       # Unicode letter, or
            \p{Mn}      # Unicode accents, or
            \p{Pd}      # Unicode hyphens, or
            \'          # single quote, or
            \x{2019}    # single quote (alternative)
        ]+              # one or more times
        \s          # any kind of space
        [               #match a:
            \p{L}       # Unicode letter, or
            \p{Mn}      # Unicode accents, or
            \p{Pd}      # Unicode hyphens, or
            \'          # single quote, or
            \x{2019}    # single quote (alternative)
        ]+              # one or more times
        \s?         # any kind of space (0 or more times)
    )+      # one or more times
$   # end of subject

I honestly don't know how to port this to Javascript, I'm not even sure Javascript supports Unicode properties but in PHP PCRE this seems to work flawlessly @ IDEOne.com:

$names = array
(
    'Alix',
    'André Svenson',
    'H4nn3 Andersen',
    'Hans',
    'John Elkjærd',
    'Kristoffer la Cour',
    'Marco d\'Almeida',
    'Martin Henriksen!',
);

foreach ($names as $name)
{
    echo sprintf('%s is %s' . "\n", $name, (preg_match('~^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+\s?)+$~u', $name) > 0) ? 'valid' : 'invalid');
}

I'm sorry I can't help you regarding the Javascript part but probably someone here will.


Validates:

  • John Elkjærd
  • André Svenson
  • Marco d'Almeida
  • Kristoffer la Cour

Invalidates:

  • Hans
  • H4nn3 Andersen
  • Martin Henriksen!

To replace invalid characters, though I'm not sure why you need this, you just need to change it slightly:

$name = preg_replace('~[^\p{L}\p{Mn}\p{Pd}\'\x{2019}\s]~u', '$1', $name);

Examples:

  • H4nn3 Andersen -> Hnn Andersen
  • Martin Henriksen! -> Martin Henriksen

Note that you always need to use the u modifier.

著墨染雨君画夕 2024-11-13 14:34:15

对于 JavaScript,情况更加棘手,因为 JavaScript Regex 语法不支持 unicode 字符属性。一个实用的解决方案是像这样匹配字母:

[a-zA-Z\xC0-\uFFFF]

这允许所有语言中的字母,并排除数字和键盘上常见的所有特殊(非字母)字符。它是不完美的,因为它还允许非字母的 unicode 特殊符号,例如表情符号、雪人等。然而,由于这些符号通常在键盘上不可用,我认为它们不会被意外输入。因此,根据您的要求,这可能是一个可以接受的解决方案。

Regarding JavaScript it is more tricky, since JavaScript Regex syntax doesn't support unicode character properties. A pragmatic solution would be to match letters like this:

[a-zA-Z\xC0-\uFFFF]

This allows letters in all languages and excludes numbers and all the special (non-letter) characters commonly found on keyboards. It is imperfect because it also allows unicode special symbols which are not letters, e.g. emoticons, snowman and so on. However, since these symbols are typically not available on keyboards I don't think they will be entered by accident. So depending on your requirements it may be an acceptable solution.

守不住的情 2024-11-13 14:34:15

这是对上面 @Alix 的精彩答案的优化。它无需两次定义字符类,并且可以更轻松地定义任意数量的所需单词。

^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+(?:$|\s+)){2,}$

它可以分解如下:

^         # start
  (?:       # non-capturing group
    [         # match a:
      \p{L}     # Unicode letter, or
      \p{Mn}    # Unicode accents, or
      \p{Pd}    # Unicode hyphens, or
      \'        # single quote, or
      \x{2019}  # single quote (alternative)
    ]+        # one or more times
    (?:       # non-capturing group
      $         # either end-of-string
    |         # or
      \s+       # one or more spaces
    )         # end of group
  ){2,}     # two or more times
$         # end-of-string

本质上,它是说找到字符类定义的单词,然后找到一个或多个空格或一行的末尾。最后的 {2,} 告诉它必须找到至少两个单词才能匹配成功。这确保了OP的“Hans”示例不会匹配。


最后,因为我在寻找 ,这是可以在 Ruby 1.9+ 中使用的正则表达式。

\A(?:[\p{L}\p{Mn}\p{Pd}\'\U+2019]+(?:\Z|\s+)){2,}\Z

主要变化是使用 \A 和 \Z 作为字符串的开头和结尾(而不是行)以及 Ruby 的 Unicode 字符表示法。

Here's an optimization over the fantastic answer by @Alix above. It removes the need to define the character class twice, and allows for easier definition of any number of required words.

^(?:[\p{L}\p{Mn}\p{Pd}\'\x{2019}]+(?:$|\s+)){2,}$

It can be broken down as follows:

^         # start
  (?:       # non-capturing group
    [         # match a:
      \p{L}     # Unicode letter, or
      \p{Mn}    # Unicode accents, or
      \p{Pd}    # Unicode hyphens, or
      \'        # single quote, or
      \x{2019}  # single quote (alternative)
    ]+        # one or more times
    (?:       # non-capturing group
      $         # either end-of-string
    |         # or
      \s+       # one or more spaces
    )         # end of group
  ){2,}     # two or more times
$         # end-of-string

Essentially, it is saying to find a word as defined by the character class, then either find one or more spaces or an end of a line. The {2,} at the end tells it that a minimum of two words must be found for a match to succeed. This ensures the OP's "Hans" example will not match.


Lastly, since I found this question while looking for a similar solution for , here is the regular expression as can be used in Ruby 1.9+

\A(?:[\p{L}\p{Mn}\p{Pd}\'\U+2019]+(?:\Z|\s+)){2,}\Z

The primary changes are using \A and \Z for beginning and end of string (instead of line) and Ruby's Unicode character notation.

两仪 2024-11-13 14:34:15

您可以将允许的特殊字符添加到正则表达式中。

示例:

[a-zA-ZßöäüÖÄÜæé]+

编辑:

不是最好的解决方案,但如果至少有单词,这会给出结果。

[a-zA-ZßöäüÖÄÜæé]+\s[a-zA-ZßöäüÖÄÜæé]+

you can add the allowed special chars to the regex.

example:

[a-zA-ZßöäüÖÄÜæé]+

EDIT:

not the best solution, but this would give a result if there are at least to words.

[a-zA-ZßöäüÖÄÜæé]+\s[a-zA-ZßöäüÖÄÜæé]+
神魇的王 2024-11-13 14:34:15

检查输入字符串时,您可以

  • 修剪()它以删除
  • 与 [^\w\s] 匹配的前导/尾随空格,以检测与 \s+ 匹配的非单词\非空白字符,
  • 以获得等于的单词分隔符的数量单词数 + 1。

但是我不确定 \w 简写是否包含重音字符,但它应该属于“单词字符”类别。

When checking your input string you could

  • trim() it to remove leading/trailing whitespaces
  • match against [^\w\s] to detect non-word\non-whitespace characters
  • match against \s+ to get the number of word separators which equals to number of words + 1.

However I'm not sure that the \w shorthand includes accented characters, but it should fall into "word characters" category.

伪装你 2024-11-13 14:34:15

这是我用于由最多 3 个单词(1 到 60 个字符)组成的奇特名称的 JS 正则表达式,由空格/单引号/减号分隔

^([a-zA-Z\xC0-\uFFFF]{1,60}[ \-\']{0,1}){1,3}$

This is the JS regex that I use for fancy names composed with max 3 words (1 to 60 chars), separated by space/single quote/minus sign

^([a-zA-Z\xC0-\uFFFF]{1,60}[ \-\']{0,1}){1,3}$
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文