在 Python 中将 unicode 文本规范化为文件名等
是否有任何独立的解决方案可以将国际 unicode 文本标准化为 Python 中的安全 ID 和文件名?
例如将 My International Text: åäö
转换为 my-international-text-aao
plone .i18n 确实做得很好,但不幸的是它依赖于 zope.security
和 zope.publisher
以及其他一些软件包,使其依赖变得脆弱。
Are there any standalonenish solutions for normalizing international unicode text to safe ids and filenames in Python?
E.g. turn My International Text: åäö
to my-international-text-aao
plone.i18n does really good job, but unfortunately it depends on zope.security
and zope.publisher
and some other packages making it fragile dependency.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您想要做的也称为“slugify”字符串。这是一个可能的解决方案:
用法:
您还可以更改分隔符:
来源: 生成 Slugs
对于 Python 3: pastebin.com/ft7Yb3KS(感谢@MrPoxipol)。
What you want to do is also known as "slugify" a string. Here's a possible solution:
Usage:
You can also change the delimeter:
Source: Generating Slugs
For Python 3: pastebin.com/ft7Yb3KS (thanks @MrPoxipol).
解决这个问题的方法是决定允许哪些字符(不同的系统对有效标识符有不同的规则。
一旦决定允许哪些字符,请编写一个 allowed() 谓词和一个与 str.translate 一起使用的 dict 子类a>:
这个功能很它让您可以轻松指定规则来决定保留哪些文本以及替换或删除哪些文本。
这是一个使用规则“仅允许 unicode 类别 L 中的字符”的简单示例:
该代码生成安全输出:如下:
The way to solve this problem is to make a decision on which characters are allowed (different systems have different rules for valid identifiers.
Once you decide on which characters are allowed, write an allowed() predicate and a dict subclass for use with str.translate:
This function is very flexible. It let's you easily specify rules for deciding which text is kept and which text is either replaced or removed.
Here's a simple example using the rule, "only allow characters that are in the unicode category L":
That code produces safe output as follows:
以下内容将从 Unicode 可以分解为组合对的任何字符中删除重音符号,丢弃它不能分解的任何奇怪字符,并删除空格:
它不会尝试删除文件系统上不允许的字符,但您可以窃取 DANGEROUS_CHARS_REGEX 来自您为此链接的文件。
The following will remove accents from whatever characters Unicode can decompose into combining pairs, discard any weird characters it can't, and nuke whitespace:
It doesn't try to get rid of characters disallowed on filesystems, but you can steal
DANGEROUS_CHARS_REGEX
from the file you linked for that.我也会在这里抛出我自己的(部分)解决方案:
这并不能满足您的所有要求,但提供了一些封装在便捷方法中的好技巧:
unicode.normalise('NFD', some_unicode_string)
对 unicode 字符进行分解,例如,它将 'ä' 分解为两个 unicode 代码点U+03B3
和U+0308
。另一种方法
unicodedata.category(char)
返回该特定char
的 enicode 字符类别。类别Mn
包含所有组合重音,因此deaccent
会删除单词中的所有重音。但请注意,这只是部分解决方案,它消除了重音。此后您仍然需要某种您想要允许的字符白名单。
I'll throw my own (partial) solution here too:
This does not do all you want, but gives a few nice tricks wrapped up in a convenience method:
unicode.normalise('NFD', some_unicode_string)
does a decomposition of unicode characters, for example, it breaks 'ä' to two unicode codepointsU+03B3
andU+0308
.The other method,
unicodedata.category(char)
, returns the enicode character category for that particularchar
. CategoryMn
contains all combining accents, thusdeaccent
removes all accents from the words.But note, that this is just a partial solution, it gets rid of accents. You still need some sort of whitelist of characters you want to allow after this.
我会选择
https://pypi.python.org/pypi?% 3Aaction=search&term=slug
很难想出其中一个场景不符合您的需求。
I'd go with
https://pypi.python.org/pypi?%3Aaction=search&term=slug
Its hard to come up with a scenario where one of these does not fit your needs.