如何找出 UTF 8 中标点符号的形式?
我有一组字符,例如
.
、!
、?
、;
、 (空格)
和一个字符串,该字符串可能是也可能不是 UTF 8(任何语言)。
有没有一种简单的方法可以找出字符串是否具有上述字符集之一?
例如:
This is a string in chinese.
翻译为
这是一个在中国的字符串。
第一个字符串中的点字符看起来不同。这是一个完全不同的字符,还是 utf 8 中的点对应字符?
或者也许某个地方有一个包含 Unicode 标点字符代码的列表?
I have a set of characters like
.
, !
, ?
, ;
, (space)
and a string, which may or may not be UTF 8 (any language).
Is there a easy way to find out if the string has one of the character set above?
For example:
这是一个在中国的字符串。
which translates to
This is a string in chinese.
The dot character looks different in the first string. Is that a totally different character, or the dot correspondent in utf 8?
Or maybe there's a list somewhere with Unicode punctuation character codes?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在 Unicode 中,有字符属性PHP 文档,例如符号、字母等。您可以使用
preg_match
Docs 和u
修饰符。但是,您的字符串必须是
UTF-8
才能执行此操作。您可以自己测试一下,我创建了 一个小脚本,通过 <代码>preg_match:
相关:PHP -从 utf8 字符串中删除浏览器中无法显示的所有字符的快速方法。
In Unicode there are character propertiesPHP Docs, for example Symbols, Letters and the like. You can search for any string of a specific class with
preg_match
Docs and theu
modifier.However, your string needs to be
UTF-8
to do that.You can test this on your own, I created a little script that tests for all properties via
preg_match
:Related: PHP - Fast way to strip all characters not displayable in browser from utf8 string.
是的,
。
(U+3002, IDEOGRAPHIC 句号) 与完全不同。
(U+002E, 句号)。如果您想查明字符串是否包含列出的字符之一,可以使用正则表达式:这将返回
0
或1
以及$match< /code> 将包含匹配的字符。因此,
$str
中的字符串正确编码为 UTF-8 非常重要。如果要匹配任何 Unicode 标点符号,可以使用模式
\p{P}
来描述 Unicode 字符属性:Yes,
。
(U+3002, IDEOGRAPHIC FULL STOP) is a totally different character than.
(U+002E, FULL STOP). If you want to find out whether a string contains one of the listed characters, you can use regular expressions:This will return either
0
or1
and$match
will contain the matched character. With this it’s important that your string in$str
is properly encoded in UTF-8.If you want to match any Unicode punctuation character, you can use the pattern
\p{P}
to describe the Unicode character property instead:你不是在尝试音译,你是在尝试翻译!
UTF-8 不是一种语言,是一种 unicode 字符集,支持(几乎)世界上已知的所有语言
您想要做的事情是这样的:
这不适用于您的中文示例
you are not trying to transliterate, you are trying to translate!
UTF-8 is not a language, is a unicode character set that supports (virtually) all languages known in the world
what you are trying to do is something like this:
that not works with your chinese example