如何运行正则表达式来测试文本中特定字母或脚本中的字符?
我想在 Perl 中创建一个正则表达式来测试字符串中特定脚本中的字符。这就像:
$text =~ .*P{'Chinese'}.*
有没有一种简单的方法可以做到这一点,对于英语来说,只需测试 [a-zA-Z] 就很容易,但是对于像中文这样的脚本或日语脚本之一,我不能找出任何方法来做到这一点,而不是显式地写出每个字符,这将导致一些非常丑陋的代码。有想法吗?我不可能是第一个/唯一一个想要这样做的人。
I'd like to make a regex in Perl that will test a string for characters in a particular script. This would be something like:
$text =~ .*P{'Chinese'}.*
Is there a simple way of doing this, for English it's pretty easy by just testing for [a-zA-Z], but for a script like Chinese, or one of the Japanese scripts, I can't figure out any way of doing this short of writing out every character explicitly, which would make for some very ugly code. Ideas? I can't be the first/only person that's wanted to do this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
有两种方法可以做到这一点。通过块 (
\p{Block=...}
) 和脚本 (\p{Script=...}
)。后者可能更自然。我对中文不太了解,但我认为您需要
\p{Script=Han}
又名\p{Han}
来表示中文。日语使用三种脚本:
\p{Script=Han}
又名\p{Han}
\p{Script=Hiragana}
又名\p{Hiragana}
又名\p{Hira}
\p{Script=Katakana}
又名\p{Katakana}
又名\p{Kana}
您可以查看 perluniprops 查找您要查找的属性,或者您可以使用
uniprops
* 查找哪些属性与特定字符匹配。要找出给定属性中包含哪些字符,您可以使用
unichars
*。 (这的用处有限,因为大多数 CJK 字符都没有命名。)* —
uniprops
和unichars
可从 Unicode::Tussle 发行版。There are two ways of doing that. By block (
\p{Block=...}
) and by script (\p{Script=...}
). The latter is probably more natural.I don't know much about Chinese languages, but I think you want
\p{Script=Han}
aka\p{Han}
for Chinese.Japanese uses three scripts:
\p{Script=Han}
aka\p{Han}
\p{Script=Hiragana}
aka\p{Hiragana}
aka\p{Hira}
\p{Script=Katakana}
aka\p{Katakana}
aka\p{Kana}
You could take a look at perluniprops to find the one you are looking for, or you could use
uniprops
* to find which properties match a specific character.To find out which characters are in a given property, you can use
unichars
*. (This is of limited usefulness since most CJK chars aren't named.)* —
uniprops
andunichars
are available from the Unicode::Tussle distro.查看 perldoc perluniprops,它提供了可与
\p 一起使用的属性的详尽列表
。您将对\p{CJK_Unified_Ideographs}
和相关属性(例如\p{CJK_Symbols_And_Punctuation}
)感兴趣。\p{Hiragana}
和\p{Katakana}
为您提供假名。许多脚本还有一个\p{Script=...}
属性:\p{Han}
和\p{Script=Han}
匹配汉字(中文),但没有对应的\p{Script=日语}
,很简单,因为日语有多个脚本。Look at perldoc perluniprops, which provides an exhaustive list of properties you can use with
\p
. You’ll be interested in\p{CJK_Unified_Ideographs}
and related properties such as\p{CJK_Symbols_And_Punctuation}
.\p{Hiragana}
and\p{Katakana}
give you the kana. There is also a\p{Script=...}
property for a number of scripts:\p{Han}
and\p{Script=Han}
match Han characters (Chinese), but there is no corresponding\p{Script=Japanese}
, quite simply because Japanese has multiple scripts.