Unicode 正则表达式来匹配换行符？

发布于 2024-10-07 00:05:18 字数 211 浏览 18 评论 0原文

我有这个表单，我想将数据提交到数据库。数据为UTF8。我在匹配换行符时遇到问题。我使用的模式是这样的：

~^[\p{L}\p{M}\p{N} ]+$~u

这种模式工作正常，直到用户在文本框中添加新行。我尝试在类中使用 \p{Z} 但没有成功。我也尝试过“s”，但没有成功。

非常感谢任何帮助。谢谢！

原文

I have this form from where I want to submit data to a database. The data is UTF8. I am having trouble with matching line breaks. The pattern I am using is something like this:

~^[\p{L}\p{M}\p{N} ]+$~u

This pattern works fine until the user puts a new line in his text box. I have tried using \p{Z} inside the class but with no success. I also tried "s" but it didn’t work.

Any help is much appreciated. Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

遗心遗梦遗幸福 2024-10-14 00:05:18

Unicode 换行符要么是回车符后紧跟换行符，要么是具有垂直空白属性的任何字符。

但看起来您正在尝试匹配那里的通用空白。在 Java 中，可以

 [\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u2028\u2029\u202F\u205F\u3000]

通过使用范围“仅”这一点来缩短：

 [\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]

同时包含水平空白 (\h) 和垂直空白 (\v)，它可能与一般空白 (\s) 相同，也可能不同。

看起来您正在尝试匹配字母数字。

单独的字母通常是[\pL\pM\p{Nl}]。
数字并不总是全部 \pN ，而是有时只是 \p{Nd} 或有时 [\p{Nd}\p{Nl }]。
标识符字符需要连接标点符号和更多，因此 [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnheldAlphanumerics}&&\p{So}] ] — 如果您的正则表达式引擎支持这些类型的操作（Java 支持）。这就是 \w 在支持 Unicode 的正则表达式语言（Java 不是其中之一）中的作用。

在 Perl 的旧版本中，您可能会编写换行符，

 (?:\r\n|\p{VertSpace})

尽管现在更好地编写为

 (?:(?>\r\n)|\v)

完全

\R

匹配。

Java 在这些事情上非常笨拙。在那里你必须写一个换行符，因为

  (?:(?>\u000D\u000A)|[\u000A-\u000D\u0085\u2028\u2029])

当写成字符串时，这当然需要额外的 bbaacckkssllasshheess 。

14 个常见字符类正则表达式转义的其他 Java 等价物，以便它们与 Unicode 一起使用，我给出在此答案中。您可能必须使用其他类似 Java 的正则表达式语言中的那些语言，这些语言不能充分识别 Unicode。

A Unicode linebreak is either a carriage return immediately followed by a line feed, or else it is any character with the vertical whitespace property.

But it looks like you’re trying to match generic whitespace there. In Java, that would be

 [\u000A\u000B\u000C\u000D\u0020\u0085\u00A0\u1680\u180E\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u2028\u2029\u202F\u205F\u3000]

which can be shortened by using ranges to “only” this:

 [\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]

to include both horizontal whitespace (\h) and vertical whitespace (\v), which may or may not be the same as general whitespace (\s).

It also looks like you’re trying to match alphanumerics.

Alphabetics alone are usually [\pL\pM\p{Nl}].
Numerics are not so often all \pN as often as they are either just \p{Nd} or else sometimes [\p{Nd}\p{Nl}].
Identifer characters need connector punctuation and a bit more, so [\pL\pM\p{Nd}\p{Nl}\p{Pc}[\p{InEnclosedAlphanumerics}&&\p{So}]] — if your regex engine supports those sorts of operations (Java’s does). That’s what \w works out to in Unicode-aware regex languages (of which Java is not one).

In older versions of Perl, you would likely write a linebreak as

 (?:\r\n|\p{VertSpace})

although that’s now better written as

 (?:(?>\r\n)|\v)

which is exactly what

\R

matches.

Java is very clumsy at these things. There you must write a linebreak as

  (?:(?>\u000D\u000A)|[\u000A-\u000D\u0085\u2028\u2029])

which of course requires extra bbaacckkssllasshheess when written as a string.

The other Java equivalences for the 14 common character-class regex escapes so that they work with Unicode I give in this answer. You may have to use those in other Java-like regex languages that aren’t sufficiently Unicode-aware.

回复收藏 0 原文

~没有更多了~