如何找出 UTF 8 中标点符号的形式?

发布于 2024-12-08 10:48:48 字数 390 浏览 3 评论 0原文

我有一组字符,例如

.!?; (空格)

和一个字符串,该字符串可能是也可能不是 UTF 8(任何语言)。

有没有一种简单的方法可以找出字符串是否具有上述字符集之一?

例如:

This is a string in chinese.

翻译为

这是一个在中国的字符串。

第一个字符串中的点字符看起来不同。这是一个完全不同的字符,还是 utf 8 中的点对应字符?

或者也许某个地方有一个包含 Unicode 标点字符代码的列表?

I have a set of characters like

., !, ?, ;, (space)

and a string, which may or may not be UTF 8 (any language).

Is there a easy way to find out if the string has one of the character set above?

For example:

这是一个在中国的字符串。

which translates to

This is a string in chinese.

The dot character looks different in the first string. Is that a totally different character, or the dot correspondent in utf 8?

Or maybe there's a list somewhere with Unicode punctuation character codes?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

沉默的熊 2024-12-15 10:48:48

在 Unicode 中,有字符属性PHP 文档,例如符号、字母等。您可以使用 preg_matchDocsu 修饰符。

echo preg_match('/pP$/u', $str);

但是,您的字符串必须是 UTF-8 才能执行此操作。

您可以自己测试一下,我创建了 一个小脚本,通过 <代码>preg_match:

Looking for properties of last character in "Test.":
Found Punctuation (P).
Found Other punctuation (Po).

Looking for properties of last character in "这是一个在中国的字符串。":
Found Punctuation (P).
Found Other punctuation (Po).

相关:PHP -从 utf8 字符串中删除浏览器中无法显示的所有字符的快速方法

In Unicode there are character propertiesPHP Docs, for example Symbols, Letters and the like. You can search for any string of a specific class with preg_matchDocs and the u modifier.

echo preg_match('/pP$/u', $str);

However, your string needs to be UTF-8 to do that.

You can test this on your own, I created a little script that tests for all properties via preg_match:

Looking for properties of last character in "Test.":
Found Punctuation (P).
Found Other punctuation (Po).

Looking for properties of last character in "这是一个在中国的字符串。":
Found Punctuation (P).
Found Other punctuation (Po).

Related: PHP - Fast way to strip all characters not displayable in browser from utf8 string.

青衫负雪 2024-12-15 10:48:48

是的, (U+3002, IDEOGRAPHIC 句号) 与 完全不同。 (U+002E, 句号)。如果您想查明字符串是否包含列出的字符之一,可以使用正则表达式:

preg_match('/[.!?;。]/u', $str, $match)

这将返回 01 以及 $match< /code> 将包含匹配的字符。因此,$str 中的字符串正确编码为 UTF-8 非常重要。

如果要匹配任何 Unicode 标点符号,可以使用模式 \p{P} 来描述 Unicode 字符属性

/\p{P}/u

Yes, (U+3002, IDEOGRAPHIC FULL STOP) is a totally different character than . (U+002E, FULL STOP). If you want to find out whether a string contains one of the listed characters, you can use regular expressions:

preg_match('/[.!?;。]/u', $str, $match)

This will return either 0 or 1 and $match will contain the matched character. With this it’s important that your string in $str is properly encoded in UTF-8.

If you want to match any Unicode punctuation character, you can use the pattern \p{P} to describe the Unicode character property instead:

/\p{P}/u
哑剧 2024-12-15 10:48:48

你不是在尝试音译,你是在尝试翻译!

UTF-8 不是一种语言,是一种 unicode 字符集,支持(几乎)世界上已知的所有语言

您想要做的事情是这样的:

echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE",  "这是一个在中国的字符串。");
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE",  "à è ò ù");

这不适用于您的中文示例

you are not trying to transliterate, you are trying to translate!

UTF-8 is not a language, is a unicode character set that supports (virtually) all languages known in the world

what you are trying to do is something like this:

echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE",  "这是一个在中国的字符串。");
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE",  "à è ò ù");

that not works with your chinese example

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文