当前位置：文江博客话题详情

Python 和字符规范化

发布于 2024-10-02 10:13:24 字数 143 浏览 8 评论 0原文

你好我从外部源检索基于文本的 utf8 数据，其中包含特殊字符，例如 u"ıöüç"，同时我想将它们规范化为英语，例如 "ıöüç" -> “iouc” 。实现这一目标的最佳方法是什么？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

兰花执着 2024-10-09 10:13:24

我建议使用 Unidecode 模块：

>>> from unidecode import unidecode
>>> unidecode(u'ıöüç')
'iouc'

注意如何向其提供 unicode 字符串并输出字节字符串。保证输出为 ASCII。

I recommend using Unidecode module:

>>> from unidecode import unidecode
>>> unidecode(u'ıöüç')
'iouc'

Note how you feed it a unicode string and it outputs a byte string. The output is guaranteed to be ASCII.

回复收藏 0 原文

梦情居士 2024-10-09 10:13:24

这完全取决于您想要对结果进行音译的程度。如果您想将所有内容全部转换为 ASCII（αβγ 到 abg），那么 unidecode 就是您的最佳选择。

如果您只想删除重音字母中的重音，那么您可以尝试使用规范化形式 NFKD 分解字符串（这会将重音字母 á 转换为纯字母 a ，然后是U+0301 组合尖锐重音），然后丢弃重音（属于 Unicode 字符类 Mn —“标记，非空格”）。

import unicodedata

def remove_nonspacing_marks(s):
    "Decompose the unicode string s and remove non-spacing marks."
    return ''.join(c for c in unicodedata.normalize('NFKD', s)
                   if unicodedata.category(c) != 'Mn')

It all depends on how far you want to go in transliterating the result. If you want to convert everything all the way to ASCII (αβγ to abg) then unidecode is the way to go.

If you just want to remove accents from accented letters, then you could try decomposing your string using normalization form NFKD (this converts the accented letter á to a plain letter a followed by U+0301 COMBINING ACUTE ACCENT) and then discarding the accents (which belong to the Unicode character class Mn — "Mark, nonspacing").

import unicodedata

def remove_nonspacing_marks(s):
    "Decompose the unicode string s and remove non-spacing marks."
    return ''.join(c for c in unicodedata.normalize('NFKD', s)
                   if unicodedata.category(c) != 'Mn')

回复收藏 0 原文

浅黛梨妆こ 2024-10-09 10:13:24

我发现的最简单的方法：

unicodedata.normalize('NFKD', s).encode("ascii", "ignore")

回复收藏 0 原文

や三分注定 2024-10-09 10:13:24

import unicodedata
unicodedata.normalize()

http://docs.python.org/library/unicodedata.html

import unicodedata
unicodedata.normalize()

http://docs.python.org/library/unicodedata.html

回复收藏 0 原文

~没有更多了~

关于作者

云归处

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

Python 和字符规范化

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

Python 和字符规范化

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。