正则表达式过滤日语
我想允许使用 AZ、az 和日语汉字、平假名和片假名,仅允许使用其他字符。 到目前为止,我已经想到了这一点:
$pattern = '/[^\w\x{3041}-\x{3094}\x{30A1}-\x{30fA}\x{30fC}\x{4E00}-\x{9FFF}_\-]+/u';
preg_replace($pattern, '', $string);
我不确定这种形式的正则表达式是否是 PHP 特定的。我接受 URL 中的字符串,并希望过滤掉引号和其他“危险”字符。上述“模式”的奇怪之处在于,无论有没有“d”,数字都不匹配。
因此,以下内容做了同样的事情:
$pattern = '/[^\d\w\x{3041}-\x{3094}\x{30A1}-\x{30fA}\x{30fC}\x{4E00}-\x{9FFF}_\-]+/u';
我对任何改进或更正感兴趣 - 我自己不是正则表达式向导。
I want to allow A-Z, a-z, and Japanese kanji, hiragana, and katakana and nothing else.
So far I've come up with this:
$pattern = '/[^\w\x{3041}-\x{3094}\x{30A1}-\x{30fA}\x{30fC}\x{4E00}-\x{9FFF}_\-]+/u';
preg_replace($pattern, '', $string);
I'm not sure if this form of regex is PHP specific. I'm accepting a string in the URL and want to filter out quotes and other "dangerous" characters. The odd thing about the above "pattern" is that with or without "d", digits are not matched.
So the following does the same thing:
$pattern = '/[^\d\w\x{3041}-\x{3094}\x{30A1}-\x{30fA}\x{30fC}\x{4E00}-\x{9FFF}_\-]+/u';
I'm interested in any improvements or corrections - not being a regex wizard myself.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在 Unicode 中,
x3040-x308f:平假名,包括一些旧字符。
x30a1-x30ff:片假名。包括一些符号。
但是,x4e00-x9eff 被分配给 CJK 字符集。不是日本人。
CJK是指中国、日本、韩国。
您可以通过 unicode 来描述 CJK 字符区域。但是,您无法通过 unicode 来描述日语汉字区域。因为在CJK码区,中文汉字和日文汉字是混合的。没有完全分开。中文和日文汉字共享一些字符。但由于双方各自的进化,大多数角色都不同。
请参阅以下网站。它相当重。并且您的计算机应该有足够的字体来阅读它。
http://www.tamasoft.co.jp/en/general-info /unicode.html
At Unicode,
x3040-x308f: Hiragana including a few old chars.
x30a1-x30ff: Katakana. including a few symbol.
However,x4e00-x9eff are assigned for CJK character set. Not Japanese.
CJK means China, Japan and Korea.
You can describe CJK char area by unicode. However you can not describe Japanese Kanji area by unicode. Because in CJK code area, Chinese Kanji and Japanese Kanji are mixed. Not completely separated. Chinese and Japanese Kanji share some chars. But most of chars are different due to each sides' own evolution.
See following site. It is quite heavy. And your computer should have enough fonts to read it.
http://www.tamasoft.co.jp/en/general-info/unicode.html
\w
包含数字;它相当于[A-Za-z0-9_]
。所以无论哪种方式你都允许他们。\w
includes digits; it's equivalent to[A-Za-z0-9_]
. So either way you're allowing them.