如何使 Django slugify 与 Unicode 字符串正常工作?

发布于 2024-07-16 02:15:22 字数 409 浏览 6 评论 0原文

如何防止 slugify 过滤器剥离非 ASCII 字母数字字符? (我使用的是 Django 1.0.2)

cnprog.com 在问题 URL 中有中文字符,所以我查看了他们的代码。 他们没有在模板中使用 slugify ,而是在 Question 模型中调用此方法来获取永久链接。

def get_absolute_url(self):
    return '%s%s' % (reverse('question', args=[self.id]), self.title)

他们是否对 URL 进行了 slugify 处理?

What can I do to prevent slugify filter from stripping out non-ASCII alphanumeric characters? (I'm using Django 1.0.2)

cnprog.com has Chinese characters in question URLs, so I looked in their code. They are not using slugify in templates, instead they're calling this method in Question model to get permalinks

def get_absolute_url(self):
    return '%s%s' % (reverse('question', args=[self.id]), self.title)

Are they slugifying the URLs or not?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

表情可笑 2024-07-23 02:15:22

有一个名为 unidecode 的 python 包,我已将其用于 Askbot 问答论坛,它对于基于拉丁语的字母表效果很好,甚至对于希腊语看起来也很合理:

>>> import unidecode
>>> from unidecode import unidecode
>>> unidecode(u'διακριτικός')
'diakritikos'

它对亚洲语言做了一些奇怪的事情:

>>> unidecode(u'影師嗎')
'Ying Shi Ma '
>>> 

这有意义吗?

在askbot中,我们像这样计算slugs:

from unidecode import unidecode
from django.template import defaultfilters
slug = defaultfilters.slugify(unidecode(input_text))

There is a python package called unidecode that I've adopted for the askbot Q&A forum, it works well for the latin-based alphabets and even looks reasonable for greek:

>>> import unidecode
>>> from unidecode import unidecode
>>> unidecode(u'διακριτικός')
'diakritikos'

It does something weird with asian languages:

>>> unidecode(u'影師嗎')
'Ying Shi Ma '
>>> 

Does this make sense?

In askbot we compute slugs like so:

from unidecode import unidecode
from django.template import defaultfilters
slug = defaultfilters.slugify(unidecode(input_text))
水中月 2024-07-23 02:15:22

对于 Django >= 1.9django.utils.text.slugify 有一个 allow_unicode 参数:

>>> slugify("你好 World", allow_unicode=True)
"你好-world"

如果您使用 Django <= 1.8(自 2018 年 4 月起就不应该使用),你可以从 Django 1.9 获取代码。

With Django >= 1.9, django.utils.text.slugify has a allow_unicode parameter:

>>> slugify("你好 World", allow_unicode=True)
"你好-world"

If you use Django <= 1.8 (which you should not since April 2018), you can pick up the code from Django 1.9.

冧九 2024-07-23 02:15:22

The Mozilla website team has been working on an implementation :
https://github.com/mozilla/unicode-slugify
sample code at
http://davedash.com/2011/03/24/how-we-slug-at-mozilla/

牵你的手,一向走下去 2024-07-23 02:15:22

此外,Django 版本的 slugify 不使用 re.UNICODE 标志,因此它甚至不会尝试理解 \w\s 的含义,因为它涉及非 ascii 字符。

这个自定义版本对我来说效果很好:

def u_slugify(txt):
        """A custom version of slugify that retains non-ascii characters. The purpose of this
        function in the application is to make URLs more readable in a browser, so there are 
        some added heuristics to retain as much of the title meaning as possible while 
        excluding characters that are troublesome to read in URLs. For example, question marks 
        will be seen in the browser URL as %3F and are thereful unreadable. Although non-ascii
        characters will also be hex-encoded in the raw URL, most browsers will display them
        as human-readable glyphs in the address bar -- those should be kept in the slug."""
        txt = txt.strip() # remove trailing whitespace
        txt = re.sub('\s*-\s*','-', txt, re.UNICODE) # remove spaces before and after dashes
        txt = re.sub('[\s/]', '_', txt, re.UNICODE) # replace remaining spaces with underscores
        txt = re.sub('(\d):(\d)', r'\1-\2', txt, re.UNICODE) # replace colons between numbers with dashes
        txt = re.sub('"', "'", txt, re.UNICODE) # replace double quotes with single quotes
        txt = re.sub(r'[?,:!@#~`+=$%^&\\*()\[\]{}<>]','',txt, re.UNICODE) # remove some characters altogether
        return txt

请注意最后的正则表达式替换。 这是针对更稳健的表达式 r'\W' 问题的解决方法,它似乎会删除一些非 ascii 字符或错误地重新编码它们,如以下 python 解释器所示会话:

Python 2.5.1 (r251:54863, Jun 17 2009, 20:37:34) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> # Paste in a non-ascii string (simplified Chinese), taken from http://globallives.org/wiki/152/
>>> str = '您認識對全球社區感興趣的中國攝影師嗎'
>>> str
'\xe6\x82\xa8\xe8\xaa\x8d\xe8\xad\x98\xe5\xb0\x8d\xe5\x85\xa8\xe7\x90\x83\xe7\xa4\xbe\xe5\x8d\x80\xe6\x84\x9f\xe8\x88\x88\xe8\xb6\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print str
您認識對全球社區感興趣的中國攝影師嗎
>>> # Substitute all non-word characters with X
>>> re_str = re.sub('\W', 'X', str, re.UNICODE)
>>> re_str
'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print re_str
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX?的中國攝影師嗎
>>> # Notice above that it retained the last 7 glyphs, ostensibly because they are word characters
>>> # And where did that question mark come from?
>>> 
>>> 
>>> # Now do the same with only the last three glyphs of the string
>>> str = '影師嗎'
>>> print str
影師嗎
>>> str
'\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> re.sub('\W','X',str,re.U)
'XXXXXXXXX'
>>> re.sub('\W','X',str)
'XXXXXXXXX'
>>> # Huh, now it seems to think those same characters are NOT word characters

我不确定上面的问题是什么,但我猜测它源于“无论是什么在 Unicode 字符属性数据库中被分类为字母数字,”以及它是如何实现的。 我听说 python 3.x 对更好的 unicode 处理有很高的优先级,所以这个问题可能已经被修复了。 或者,也许这是正确的 python 行为,而我滥用了 unicode 和/或中文。

目前,解决方法是避免使用字符类,并根据显式定义的字符集进行替换。

Also, the Django version of slugify doesn't use the re.UNICODE flag, so it wouldn't even attempt to understand the meaning of \w\s as it pertains to non-ascii characters.

This custom version is working well for me:

def u_slugify(txt):
        """A custom version of slugify that retains non-ascii characters. The purpose of this
        function in the application is to make URLs more readable in a browser, so there are 
        some added heuristics to retain as much of the title meaning as possible while 
        excluding characters that are troublesome to read in URLs. For example, question marks 
        will be seen in the browser URL as %3F and are thereful unreadable. Although non-ascii
        characters will also be hex-encoded in the raw URL, most browsers will display them
        as human-readable glyphs in the address bar -- those should be kept in the slug."""
        txt = txt.strip() # remove trailing whitespace
        txt = re.sub('\s*-\s*','-', txt, re.UNICODE) # remove spaces before and after dashes
        txt = re.sub('[\s/]', '_', txt, re.UNICODE) # replace remaining spaces with underscores
        txt = re.sub('(\d):(\d)', r'\1-\2', txt, re.UNICODE) # replace colons between numbers with dashes
        txt = re.sub('"', "'", txt, re.UNICODE) # replace double quotes with single quotes
        txt = re.sub(r'[?,:!@#~`+=$%^&\\*()\[\]{}<>]','',txt, re.UNICODE) # remove some characters altogether
        return txt

Note the last regex substitution. This is a workaround to a problem with the more robust expression r'\W', which seems to either strip out some non-ascii characters or incorrectly re-encode them, as illustrated in the following python interpreter session:

Python 2.5.1 (r251:54863, Jun 17 2009, 20:37:34) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> # Paste in a non-ascii string (simplified Chinese), taken from http://globallives.org/wiki/152/
>>> str = '您認識對全球社區感興趣的中國攝影師嗎'
>>> str
'\xe6\x82\xa8\xe8\xaa\x8d\xe8\xad\x98\xe5\xb0\x8d\xe5\x85\xa8\xe7\x90\x83\xe7\xa4\xbe\xe5\x8d\x80\xe6\x84\x9f\xe8\x88\x88\xe8\xb6\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print str
您認識對全球社區感興趣的中國攝影師嗎
>>> # Substitute all non-word characters with X
>>> re_str = re.sub('\W', 'X', str, re.UNICODE)
>>> re_str
'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print re_str
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX?的中國攝影師嗎
>>> # Notice above that it retained the last 7 glyphs, ostensibly because they are word characters
>>> # And where did that question mark come from?
>>> 
>>> 
>>> # Now do the same with only the last three glyphs of the string
>>> str = '影師嗎'
>>> print str
影師嗎
>>> str
'\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> re.sub('\W','X',str,re.U)
'XXXXXXXXX'
>>> re.sub('\W','X',str)
'XXXXXXXXX'
>>> # Huh, now it seems to think those same characters are NOT word characters

I am unsure what the problem is above, but I'm guessing that it stems from "whatever is classified as alphanumeric in the Unicode character properties database," and how that is implemented. I have heard that python 3.x has a high priority on better unicode handling, so this may be fixed already. Or, maybe it is correct python behavior, and I am misusing unicode and/or the Chinese language.

For now, a work-around is to avoid character classes, and make substitutions based on explicitly defined character sets.

魔法少女 2024-07-23 02:15:22

恐怕 django 的 slug 定义意味着 ascii,尽管 django 文档没有明确说明这一点。 这是 slugify 的默认过滤器的来源...您可以看到这些值正在转换为 ascii,并在出现错误时使用“忽略”选项:

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
return mark_safe(re.sub('[-\s]+', '-', value))

基于此,我猜测 cnprog.com 不是使用官方 slugify 函数。 如果您想要不同的行为,您可能希望调整上面的 django 代码片段。

尽管如此,URL 的 RFC 确实规定非 us-ascii 字符(或者更具体地说,除了字母数字和 $-_.+!*'() 以外的任何字符)应该使用 %hex 表示法进行编码。 如果您查看浏览器发送的实际原始 GET 请求(例如,使用 Firebug),您会发现中文字符实际上在发送之前已进行编码...浏览器只是使其在显示中看起来很漂亮。 我怀疑这就是为什么 slugify 坚持只使用 ascii,fwiw。

I'm afraid django's definition of slug means ascii, though the django docs don't explicitly state this. This is the source of the defaultfilters for the slugify... you can see that the values are being converted to ascii, with the 'ignore' option in case of errors:

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
return mark_safe(re.sub('[-\s]+', '-', value))

Based on that, I'd guess that cnprog.com is not using an official slugify function. You may wish to adapt the django snippet above if you want a different behaviour.

Having said that, though, the RFC for URLs does state that non-us-ascii characters (or, more specifically, anything other than the alphanumerics and $-_.+!*'()) should be encoded using the %hex notation. If you look at the actual raw GET request that your browser sends (say, using Firebug), you'll see that the chinese characters are in fact encoded before being sent... the browser just makes it look pretty in the display. I suspect this is why slugify insists on ascii only, fwiw.

能怎样 2024-07-23 02:15:22

您可能想看看:
https://github.com/un33k/django-uuslug

它会为你处理两个“U”。 U 表示唯一,U 表示 Unicode。

它将为您轻松完成这项工作。

You might want to look at:
https://github.com/un33k/django-uuslug

It will take care of both "U"s for you. U in unique and U in Unicode.

It will do the job for you hassle free.

回忆躺在深渊里 2024-07-23 02:15:22

这就是我使用的:

http://trac.django -fr.org/browser/site/trunk/djangofr/links/slughifi.py

SlugHiFi 是常规 slugify 的包装器,不同之处在于它用对应的英文字母替换了国家字符。

因此,您得到的不是“Ą”,而是“A”,而不是“Ł” => “L”等。

This is what I use:

http://trac.django-fr.org/browser/site/trunk/djangofr/links/slughifi.py

SlugHiFi is a wrapper for regular slugify, with a difference that it replaces national chars with their English alphabet counterparts.

So instead of "Ą" you get "A", instead of "Ł" => "L", and so on.

唠甜嗑 2024-07-23 02:15:22

我有兴趣在 slug 中只允许 ASCII 字符,这就是为什么我尝试对同一字符串的一些可用工具进行基准测试:

  • Unicode Slugify

    在[5]中: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o', only_ascii=True) 
      每个循环 37.8 µs ± 86.7 ns(7 次运行的平均值 ± 标准差,每次 10000 个循环) 
    
      'paizo-trekho-kai-glo-la-fdo' 
      
  • Django Uuslug

    在[3]中: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o') 
      每个循环 35.3 µs ± 303 ns(7 次运行的平均值 ± 标准差,每次 10000 个循环) 
    
      'paizo-trekho-kai-g-lo-la-fd-o' 
      
  • 很棒的 Slugify

    在[3]中:%timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o') 
      每个循环 47.1 µs ± 1.94 µs(7 次运行的平均值 ± 标准差,每次 10000 个循环) 
    
      'Paizo-trekho-kai-g-lo-la-fd-o' 
      
  • Python Slugify

    在[3]中:%timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o') 
      每个循环 24.6 µs ± 122 ns(7 次运行的平均值 ± 标准差,每次 10000 个循环) 
    
      'paizo-trekho-kai-g-lo-la-fd-o' 
      
  • django.utils .text.slugifyUnicode

    在[15]中:%timeit slugify(unidecode('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')) 
      每个循环 36.5 µs ± 89.7 ns(7 次运行的平均值 ± 标准差,每次 10000 个循环) 
    
      'paizo-trekho-kai-glo-la-fdo' 
      

I am interested in allowing only ASCII characters in the slug this is why I tried to benchmark some of the available tools for the same string:

  • Unicode Slugify:

    In [5]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o', only_ascii=True)
    37.8 µs ± 86.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    'paizo-trekho-kai-glo-la-fdo'
    
  • Django Uuslug:

    In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
    35.3 µs ± 303 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    'paizo-trekho-kai-g-lo-la-fd-o'
    
  • Awesome Slugify:

    In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
    47.1 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    'Paizo-trekho-kai-g-lo-la-fd-o'
    
  • Python Slugify:

    In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
    24.6 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    'paizo-trekho-kai-g-lo-la-fd-o'
    
  • django.utils.text.slugify with Unidecode:

    In [15]: %timeit slugify(unidecode('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o'))
    36.5 µs ± 89.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    'paizo-trekho-kai-glo-la-fdo'
    
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文