当前位置：文江博客话题详情

如何找出 UTF 8 中标点符号的形式？

发布于 2024-12-08 10:48:48 字数 390 浏览 6 评论 0原文

我有一组字符，例如

.、!、?、;、（空格）

和一个字符串，该字符串可能是也可能不是 UTF 8（任何语言）。

有没有一种简单的方法可以找出字符串是否具有上述字符集之一？

例如：

This is a string in chinese.

翻译为

这是一个在中国的字符串。

第一个字符串中的点字符看起来不同。这是一个完全不同的字符，还是 utf 8 中的点对应字符？

或者也许某个地方有一个包含 Unicode 标点字符代码的列表？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沉默的熊 2024-12-15 10:48:48

在 Unicode 中，有字符属性^{PHP 文档}，例如符号、字母等。您可以使用 preg_match^Docs 和 u 修饰符。

echo preg_match('/pP$/u', $str);

但是，您的字符串必须是 UTF-8 才能执行此操作。

您可以自己测试一下，我创建了一个小脚本，通过 <代码>preg_match：

Looking for properties of last character in "Test.":
Found Punctuation (P).
Found Other punctuation (Po).

Looking for properties of last character in "这是一个在中国的字符串。":
Found Punctuation (P).
Found Other punctuation (Po).

In Unicode there are character properties^{PHP Docs}, for example Symbols, Letters and the like. You can search for any string of a specific class with preg_match^Docs and the u modifier.

echo preg_match('/pP$/u', $str);

However, your string needs to be UTF-8 to do that.

You can test this on your own, I created a little script that tests for all properties via preg_match:

Looking for properties of last character in "Test.":
Found Punctuation (P).
Found Other punctuation (Po).

Looking for properties of last character in "这是一个在中国的字符串。":
Found Punctuation (P).
Found Other punctuation (Po).

回复收藏 0 原文

青衫负雪 2024-12-15 10:48:48

是的，。 (U+3002, IDEOGRAPHIC 句号) 与 完全不同。 (U+002E, 句号）。如果您想查明字符串是否包含列出的字符之一，可以使用正则表达式：

preg_match('/[.!?;。]/u', $str, $match)

这将返回 0 或 1 以及 $match< /code> 将包含匹配的字符。因此，$str 中的字符串正确编码为 UTF-8 非常重要。

如果要匹配任何 Unicode 标点符号，可以使用模式 \p{P} 来描述 Unicode 字符属性：

/\p{P}/u

Yes, 。 (U+3002, IDEOGRAPHIC FULL STOP) is a totally different character than . (U+002E, FULL STOP). If you want to find out whether a string contains one of the listed characters, you can use regular expressions:

preg_match('/[.!?;。]/u', $str, $match)

This will return either 0 or 1 and $match will contain the matched character. With this it’s important that your string in $str is properly encoded in UTF-8.

If you want to match any Unicode punctuation character, you can use the pattern \p{P} to describe the Unicode character property instead:

/\p{P}/u

回复收藏 0 原文

哑剧 2024-12-15 10:48:48

你不是在尝试音译，你是在尝试翻译！

UTF-8 不是一种语言，是一种 unicode 字符集，支持（几乎）世界上已知的所有语言

您想要做的事情是这样的：

echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE",  "这是一个在中国的字符串。");
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE",  "à è ò ù");

这不适用于您的中文示例

you are not trying to transliterate, you are trying to translate!

UTF-8 is not a language, is a unicode character set that supports (virtually) all languages known in the world

what you are trying to do is something like this:

echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE",  "这是一个在中国的字符串。");
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE",  "à è ò ù");

that not works with your chinese example

回复收藏 0 原文

~没有更多了~

关于作者

谁许谁一生繁华

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

如何找出 UTF 8 中标点符号的形式？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

15077827184

遗失的美好

离不开的别离

3857621955

懒猫

洋洋洒洒

友情链接

如何找出 UTF 8 中标点符号的形式？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

15077827184

遗失的美好

离不开的别离

3857621955

懒猫

洋洋洒洒

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。