有没有办法匹配任意 Unicode 字母字符?
我有一些文档经过 OCR 从 PDF 转换为 HTML。因此,他们最终会出现很多随机的 unicode 标点符号,而转换器会搞砸(即省略号等)。他们也正确地有一堆非英语,但仍然是字母字符,如 é 和俄语字符等...
有没有办法制作一个正则表达式来匹配任何 unicode 字母字符(来自任何语言的字母表)?或者只匹配非字母字符?任何一个都会非常有帮助并且很棒。我正在使用 Perl,如果这会改变什么的话。谢谢!
I have some documents that went through OCR conversion from PDF into HTML. Because of that, they wound up having lots of random unicode punctuation where the converter messed up (i.e. elipses, etc...). They also correctly have a bunch of Non-English, but still Alphabetic characters, like é, and Russian characters, etc...
Is there any way to make a Regex that will match any unicode alphabetic character (from alphabets of any language)? Or one that will only match non-alphabetic characters? Either one would be really helpful and awesome. I'm using Perl, if that changes anything. Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
查看 Unicode 字符属性: http://www.regular-expressions.info/unicode.html #prop。我认为您正在寻找的可能是
与任何字母或表意文字匹配的内容。您可能还想包含带有标记的字母,因此您可以
在任何情况下执行,第一个链接中详细介绍了所有不同类型的字符属性。
编辑:您可能还想查看这个 Stack Overflow 答案,讨论 \w 是否匹配 unicode 字符。他们建议您也可以使用 \p{Word} 或 \p{Alnum}: \w 是否匹配 Unicode 标准中定义的所有字母数字字符?
Check out Unicode character properties: http://www.regular-expressions.info/unicode.html#prop. I think what you are looking for is probably
which will match any letters or ideographs. You may also want to include letters with marks on them, so you could do
In any case, all the different types of character properties are detailed in the first link.
Edit: You may also want to look at this Stack Overflow answer discussing whether \w matches unicode characters. They suggest that you could also use \p{Word} or \p{Alnum}: Does \w match all alphanumeric characters defined in the Unicode standard?
根据您使用的语言,正则表达式引擎可能支持或不支持 Unicode。如果是,它可能知道也可能不知道
\p{}
属性标记。如果是这样,您的答案就在 Jan Goyvaerts 的Unicode 字符和属性中正则表达式教程。如果支持的话,您可以使用
\p{Latin}
来检测来自使用任何 Unicode Latin 块的语言(当然也不是)的所有内容。Depending on which language you're using, the regular expression engine may or may not be Unicode aware. If it is, it may or may not know the
\p{}
property tokens. If it does, your answer is in Unicode Characters and Properties in Jan Goyvaerts' regex tutorial.You can use
\p{Latin}
, if supported, to detect everything that is (or isn't, of course) from a language that uses any of the Unicode Latin blocks.