如何运行正则表达式来测试文本中特定字母或脚本中的字符?

发布于 2024-12-19 01:25:45 字数 240 浏览 0 评论 0原文

我想在 Perl 中创建一个正则表达式来测试字符串中特定脚本中的字符。这就像:

$text =~ .*P{'Chinese'}.*

有没有一种简单的方法可以做到这一点,对于英语来说,只需测试 [a-zA-Z] 就很容易,但是对于像中文这样的脚本或日语脚本之一,我不能找出任何方法来做到这一点,而不是显式地写出每个字符,这将导致一些非常丑陋的代码。有想法吗?我不可能是第一个/唯一一个想要这样做的人。

I'd like to make a regex in Perl that will test a string for characters in a particular script. This would be something like:

$text =~ .*P{'Chinese'}.*

Is there a simple way of doing this, for English it's pretty easy by just testing for [a-zA-Z], but for a script like Chinese, or one of the Japanese scripts, I can't figure out any way of doing this short of writing out every character explicitly, which would make for some very ugly code. Ideas? I can't be the first/only person that's wanted to do this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

星光不落少年眉 2024-12-26 01:25:46

有两种方法可以做到这一点。通过块 (\p{Block=...}) 和脚本 (\p{Script=...})。后者可能更自然。

我对中文不太了解,但我认为您需要 \p{Script=Han} 又名 \p{Han} 来表示中文。

日语使用三种脚本:

  • Kanij: \p{Script=Han} 又名 \p{Han}
  • 平假名: \p{Script=Hiragana}又名 \p{Hiragana} 又名 \p{Hira}
  • 片假名: \p{Script=Katakana} 又名\p{Katakana} 又名 \p{Kana}

您可以查看 perluniprops 查找您要查找的属性,或者您可以使用 uniprops* 查找哪些属性与特定字符匹配。

$ uniprops 4E2D
U+4E2D ‹中› \N{CJK UNIFIED IDEOGRAPH-4E2D}
    \w \pL \p{L_} \p{Lo}
    All Any Alnum Alpha Alphabetic Assigned InCJK_UnifiedIdeographs
    CJK_Unified_Ideographs L Lo Gr_Base Grapheme_Base Graph GrBase
    Han Hani ID_Continue IDC ID_Start IDS Ideo Ideographic Letter
    L_ Other_Letter Print UIdeo Unified_Ideograph Word XID_Continue
    XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph
    X_POSIX_Print X_POSIX_Word

要找出给定属性中包含哪些字符,您可以使用 unichars*。 (这的用处有限,因为大多数 CJK 字符都没有命名。)

$ unichars -au '\p{Han}'
 ⺀ U+2E80 CJK RADICAL REPEAT
 ⺁ U+2E81 CJK RADICAL CLIFF
 ⺂ U+2E82 CJK RADICAL SECOND ONE
 ⺃ U+2E83 CJK RADICAL SECOND TWO
 ⺄ U+2E84 CJK RADICAL SECOND THREE
 ⺅ U+2E85 CJK RADICAL PERSON
 ⺆ U+2E86 CJK RADICAL BOX
 ⺇ U+2E87 CJK RADICAL TABLE
 ⺈ U+2E88 CJK RADICAL KNIFE ONE
...

* — unipropsunichars 可从 Unicode::Tussle 发行版。

There are two ways of doing that. By block (\p{Block=...}) and by script (\p{Script=...}). The latter is probably more natural.

I don't know much about Chinese languages, but I think you want \p{Script=Han} aka \p{Han} for Chinese.

Japanese uses three scripts:

  • Kanij: \p{Script=Han} aka \p{Han}
  • Hiragana: \p{Script=Hiragana} aka \p{Hiragana} aka \p{Hira}
  • Katakana: \p{Script=Katakana} aka \p{Katakana} aka \p{Kana}

You could take a look at perluniprops to find the one you are looking for, or you could use uniprops* to find which properties match a specific character.

$ uniprops 4E2D
U+4E2D ‹中› \N{CJK UNIFIED IDEOGRAPH-4E2D}
    \w \pL \p{L_} \p{Lo}
    All Any Alnum Alpha Alphabetic Assigned InCJK_UnifiedIdeographs
    CJK_Unified_Ideographs L Lo Gr_Base Grapheme_Base Graph GrBase
    Han Hani ID_Continue IDC ID_Start IDS Ideo Ideographic Letter
    L_ Other_Letter Print UIdeo Unified_Ideograph Word XID_Continue
    XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph
    X_POSIX_Print X_POSIX_Word

To find out which characters are in a given property, you can use unichars*. (This is of limited usefulness since most CJK chars aren't named.)

$ unichars -au '\p{Han}'
 ⺀ U+2E80 CJK RADICAL REPEAT
 ⺁ U+2E81 CJK RADICAL CLIFF
 ⺂ U+2E82 CJK RADICAL SECOND ONE
 ⺃ U+2E83 CJK RADICAL SECOND TWO
 ⺄ U+2E84 CJK RADICAL SECOND THREE
 ⺅ U+2E85 CJK RADICAL PERSON
 ⺆ U+2E86 CJK RADICAL BOX
 ⺇ U+2E87 CJK RADICAL TABLE
 ⺈ U+2E88 CJK RADICAL KNIFE ONE
...

* — uniprops and unichars are available from the Unicode::Tussle distro.

长不大的小祸害 2024-12-26 01:25:45

查看 perldoc perluniprops,它提供了可与 \p 一起使用的属性的详尽列表。您将对 \p{CJK_Unified_Ideographs} 和相关属性(例如 \p{CJK_Symbols_And_Punctuation})感兴趣。 \p{Hiragana}\p{Katakana} 为您提供假名。许多脚本还有一个 \p{Script=...} 属性:\p{Han}\p{Script=Han} 匹配汉字(中文),但没有对应的 \p{Script=日语},很简单,因为日语有多个脚本。

Look at perldoc perluniprops, which provides an exhaustive list of properties you can use with \p. You’ll be interested in \p{CJK_Unified_Ideographs} and related properties such as \p{CJK_Symbols_And_Punctuation}. \p{Hiragana} and \p{Katakana} give you the kana. There is also a \p{Script=...} property for a number of scripts: \p{Han} and \p{Script=Han} match Han characters (Chinese), but there is no corresponding \p{Script=Japanese}, quite simply because Japanese has multiple scripts.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文