当前位置：文江博客话题详情

在 Python 中将 unicode 文本规范化为文件名等

发布于 2024-12-29 13:50:22 字数 444 浏览 3 评论 0原文

是否有任何独立的解决方案可以将国际 unicode 文本标准化为 Python 中的安全 ID 和文件名？

例如将 My International Text: åäö 转换为 my-international-text-aao

plone .i18n 确实做得很好，但不幸的是它依赖于 zope.security 和 zope.publisher 以及其他一些软件包，使其依赖变得脆弱。

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

女中豪杰 2025-01-05 13:50:22

您想要做的也称为“slugify”字符串。这是一个可能的解决方案：

import re
from unicodedata import normalize

_punct_re = re.compile(r'[\t !"#$%&\'()*\-/<=>?@\[\\\]^_`{|},.:]+')

def slugify(text, delim=u'-'):
    """Generates an slightly worse ASCII-only slug."""
    result = []
    for word in _punct_re.split(text.lower()):
        word = normalize('NFKD', word).encode('ascii', 'ignore')
        if word:
            result.append(word)
    return unicode(delim.join(result))

用法：

>>> slugify(u'My International Text: åäö')
u'my-international-text-aao'

您还可以更改分隔符：

>>> slugify(u'My International Text: åäö', delim='_')
u'my_international_text_aao'

来源： 生成 Slugs

对于 Python 3： pastebin.com/ft7Yb3KS（感谢@MrPoxipol）。

What you want to do is also known as "slugify" a string. Here's a possible solution:

import re
from unicodedata import normalize

_punct_re = re.compile(r'[\t !"#$%&\'()*\-/<=>?@\[\\\]^_`{|},.:]+')

def slugify(text, delim=u'-'):
    """Generates an slightly worse ASCII-only slug."""
    result = []
    for word in _punct_re.split(text.lower()):
        word = normalize('NFKD', word).encode('ascii', 'ignore')
        if word:
            result.append(word)
    return unicode(delim.join(result))

Usage:

>>> slugify(u'My International Text: åäö')
u'my-international-text-aao'

You can also change the delimeter:

>>> slugify(u'My International Text: åäö', delim='_')
u'my_international_text_aao'

Source: Generating Slugs

For Python 3: pastebin.com/ft7Yb3KS (thanks @MrPoxipol).

回复收藏 0 原文

我家小可爱 2025-01-05 13:50:22

解决这个问题的方法是决定允许哪些字符（不同的系统对有效标识符有不同的规则。

一旦决定允许哪些字符，请编写一个 allowed() 谓词和一个与 str.translate 一起使用的 dict 子类a>：

def makesafe(text, allowed, substitute=None):
    ''' Remove unallowed characters from text.
        If *substitute* is defined, then replace
        the character with the given substitute.
    '''
    class D(dict):
        def __getitem__(self, key):
            return key if allowed(chr(key)) else substitute
    return text.translate(D())

这个功能很它让您可以轻松指定规则来决定保留哪些文本以及替换或删除哪些文本。

这是一个使用规则“仅允许 unicode 类别 L 中的字符”的简单示例：

import unicodedata

def allowed(character):
    return unicodedata.category(character).startswith('L')

print(makesafe('the*ides&of*march', allowed, '_'))
print(makesafe('the*ides&of*march', allowed))

该代码生成安全输出：如下：

the_ides_of_march
theidesofmarch

The way to solve this problem is to make a decision on which characters are allowed (different systems have different rules for valid identifiers.

Once you decide on which characters are allowed, write an allowed() predicate and a dict subclass for use with str.translate:

def makesafe(text, allowed, substitute=None):
    ''' Remove unallowed characters from text.
        If *substitute* is defined, then replace
        the character with the given substitute.
    '''
    class D(dict):
        def __getitem__(self, key):
            return key if allowed(chr(key)) else substitute
    return text.translate(D())

This function is very flexible. It let's you easily specify rules for deciding which text is kept and which text is either replaced or removed.

Here's a simple example using the rule, "only allow characters that are in the unicode category L":

import unicodedata

def allowed(character):
    return unicodedata.category(character).startswith('L')

print(makesafe('the*ides&of*march', allowed, '_'))
print(makesafe('the*ides&of*march', allowed))

That code produces safe output as follows:

the_ides_of_march
theidesofmarch

回复收藏 0 原文

浅蓝的眸勾画不出的柔情 2025-01-05 13:50:22

以下内容将从 Unicode 可以分解为组合对的任何字符中删除重音符号，丢弃它不能分解的任何奇怪字符，并删除空格：

# encoding: utf-8
from unicodedata import normalize
import re

original = u'ľ š č ť ž ý á í é'
decomposed = normalize("NFKD", original)
no_accent = ''.join(c for c in decomposed if ord(c)<0x7f)
no_spaces = re.sub(r'\s', '_', no_accent)

print no_spaces
# output: l_s_c_t_z_y_a_i_e

它不会尝试删除文件系统上不允许的字符，但您可以窃取 DANGEROUS_CHARS_REGEX 来自您为此链接的文件。

The following will remove accents from whatever characters Unicode can decompose into combining pairs, discard any weird characters it can't, and nuke whitespace:

# encoding: utf-8
from unicodedata import normalize
import re

original = u'ľ š č ť ž ý á í é'
decomposed = normalize("NFKD", original)
no_accent = ''.join(c for c in decomposed if ord(c)<0x7f)
no_spaces = re.sub(r'\s', '_', no_accent)

print no_spaces
# output: l_s_c_t_z_y_a_i_e

It doesn't try to get rid of characters disallowed on filesystems, but you can steal DANGEROUS_CHARS_REGEX from the file you linked for that.

回复收藏 0 原文

放低过去 2025-01-05 13:50:22

我也会在这里抛出我自己的（部分）解决方案：

import unicodedata

def deaccent(some_unicode_string):
    return u''.join(c for c in unicodedata.normalize('NFD', some_unicode_string)
               if unicodedata.category(c) != 'Mn')

这并不能满足您的所有要求，但提供了一些封装在便捷方法中的好技巧：unicode.normalise('NFD', some_unicode_string) 对 unicode 字符进行分解，例如，它将 'ä' 分解为两个 unicode 代码点 U+03B3 和 U+0308。

另一种方法 unicodedata.category(char) 返回该特定 char 的 enicode 字符类别。类别 Mn 包含所有组合重音，因此 deaccent 会删除单词中的所有重音。

但请注意，这只是部分解决方案，它消除了重音。此后您仍然需要某种您想要允许的字符白名单。

I'll throw my own (partial) solution here too:

import unicodedata

def deaccent(some_unicode_string):
    return u''.join(c for c in unicodedata.normalize('NFD', some_unicode_string)
               if unicodedata.category(c) != 'Mn')

This does not do all you want, but gives a few nice tricks wrapped up in a convenience method: unicode.normalise('NFD', some_unicode_string) does a decomposition of unicode characters, for example, it breaks 'ä' to two unicode codepoints U+03B3 and U+0308.

The other method, unicodedata.category(char), returns the enicode character category for that particular char. Category Mn contains all combining accents, thus deaccent removes all accents from the words.

But note, that this is just a partial solution, it gets rid of accents. You still need some sort of whitelist of characters you want to allow after this.

回复收藏 0 原文