Python 无法正确排序 unicode。 Strcoll 没有帮助

发布于 2024-09-13 03:35:46 字数 457 浏览 1 评论 0原文

我在 OSX 以及 Linux 上的 Python 2.5.1 和 2.6.5 中使用 unicode 排序规则对列表进行排序时遇到问题。

import locale   
locale.setlocale(locale.LC_ALL, 'pl_PL.UTF-8')
print [i for i in sorted([u'a', u'z', u'ą'], cmp=locale.strcoll)]

应该打印:

[u'a', u'ą', u'z']

但是却打印出:

[u'a', u'z', u'ą']

总结一下 - 看起来 strcoll 好像被破坏了。尝试使用各种类型的变量(例如非 unicode 编码的字符串)。

我做错了什么?

此致, 托马斯·科普丘克。

I've got a problem with sorting lists using unicode collation in Python 2.5.1 and 2.6.5 on OSX, as well as on Linux.

import locale   
locale.setlocale(locale.LC_ALL, 'pl_PL.UTF-8')
print [i for i in sorted([u'a', u'z', u'ą'], cmp=locale.strcoll)]

Which should print:

[u'a', u'ą', u'z']

But instead prints out:

[u'a', u'z', u'ą']

Summing it up - it looks as if strcoll was broken. Tried it with various types of variables (fe. non-unicode encoded strings).

What do I do wrong?

Best regards,
Tomasz Kopczuk.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

一页 2024-09-20 03:35:47

显然,排序在所有平台上工作的唯一方法是使用带有 PyICU 绑定的 ICU 库 (PyICU在 PyPI 上)。

在 OS X 上:sudo port install py26-pyicu,注意此处描述的错误:https ://svn.macports.org/ticket/23429(哦,使用 macports 的乐趣)。

不幸的是,PyICUs 文档严重缺乏,但我设法找出它是如何完成的:

import PyICU
collator = PyICU.Collator.createInstance(PyICU.Locale('pl_PL.UTF-8'))
print [i for i in sorted([u'a', u'z', u'ą'], cmp=collator.compare)]

这给出了:

[u'a', u'ą', u'z']

另一个优点 - @bobince:它是线程安全的,因此在设置请求明智的语言环境时并非无用。

Apparently, the only way for sorting to work on all platforms is to use the ICU library with PyICU bindings (PyICU on PyPI).

On OS X: sudo port install py26-pyicu, minding bug described here: https://svn.macports.org/ticket/23429 (oh the joy of using macports).

PyICUs documentation is unfortunately severely lacking, but I managed to find out how it's done:

import PyICU
collator = PyICU.Collator.createInstance(PyICU.Locale('pl_PL.UTF-8'))
print [i for i in sorted([u'a', u'z', u'ą'], cmp=collator.compare)]

which gives:

[u'a', u'ą', u'z']

Another pro - @bobince: it's thread-safe, so not useless when setting request-wise locales.

明媚如初 2024-09-20 03:35:47

只是为了补充 tkopczuk 的调查:这绝对是一个 gcc bug,至少对于 OS X 10.6.4 上的版本 4.2.1 来说是这样。可以通过直接调用 C strcoll() 来重现它,就像此代码段中的 一样。

编辑:仍然在同一系统上,我发现对于 UTF-8 版本的 de_DE、fr_FR、pl_PL,问题确实存在,但对于 ISO-88591 版本的 fr_FR 和 de_DE,排序顺序是正确的。不幸的是,对于OP来说,ISO-88592 pl_PL也有错误:

The order for Polish ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH OGONEK
The LC_COLLATE culture and encoding settings were pl_PL, ISO8859-2.

The order for Polish Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH OGONEK
The LC_COLLATE culture and encoding settings were pl_PL, UTF8.

The order for German Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH DIAERESIS
The LC_COLLATE culture and encoding settings were de_DE, UTF8.

The order for German ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER A WITH DIAERESIS
LATIN SMALL LETTER Z
The LC_COLLATE culture and encoding settings were de_DE, ISO8859-1.

The order for Fremch ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER E WITH ACUTE
LATIN SMALL LETTER Z
The LC_COLLATE culture and encoding settings were fr_FR, ISO8859-1.

The order for French Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER E WITH ACUTE
The LC_COLLATE culture and encoding settings were fr_FR, UTF8.

Just to add to tkopczuk's investigation: This is definitely a gcc bug, at least for version 4.2.1 on OS X 10.6.4. It can be reproduced by calling C strcoll() directly as in this snippet.

EDIT: Still on the same system, I find that for the UTF-8 versions of de_DE, fr_FR, pl_PL, the problem is there, but for the ISO-88591 versions of fr_FR and de_DE, sort order is correct. Unfortunately for the OP, ISO-88592 pl_PL is also buggy:

The order for Polish ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH OGONEK
The LC_COLLATE culture and encoding settings were pl_PL, ISO8859-2.

The order for Polish Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH OGONEK
The LC_COLLATE culture and encoding settings were pl_PL, UTF8.

The order for German Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER A WITH DIAERESIS
The LC_COLLATE culture and encoding settings were de_DE, UTF8.

The order for German ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER A WITH DIAERESIS
LATIN SMALL LETTER Z
The LC_COLLATE culture and encoding settings were de_DE, ISO8859-1.

The order for Fremch ISO-8859 is:
LATIN SMALL LETTER A
LATIN SMALL LETTER E WITH ACUTE
LATIN SMALL LETTER Z
The LC_COLLATE culture and encoding settings were fr_FR, ISO8859-1.

The order for French Unicode is:
LATIN SMALL LETTER A
LATIN SMALL LETTER Z
LATIN SMALL LETTER E WITH ACUTE
The LC_COLLATE culture and encoding settings were fr_FR, UTF8.
墨离汐 2024-09-20 03:35:47

以下是我如何正确排序波斯语(没有 PyICU)(使用 python 3.x):

首先设置语言环境(不要忘记导入语言环境平台

if platform.system() == 'Linux':
    locale.setlocale(locale.LC_ALL, 'fa_IR.UTF-8')
elif platform.system() == 'Windows':
   locale.setlocale(locale.LC_ALL, 'Persian_Iran.1256')
else:
   pass (or any other OS)

然后使用键排序:

a = ['ا','ب','پ','ت','ث','ج','چ','ح','خ','د','ذ','ر','ز','ژ','س','ش','ص','ض','ط','ظ','ع','غ','ف','ق','ک','گ','ل','م','ن','و','ه','ي']

print(sorted(a,key=locale.strxfrm))

对于对象列表:

a = [{'id':"ا"},{'id':"ب"},{'id':"پ"},{'id':"ت"},{'id':"ث"},{'id':"ج"},{'id':"چ"},{'id':"ح"},{'id':"خ"},{'id':"د"},{'id':"ذ"},{'id':"ر"},{'id':"ز"},{'id':"ژ"},{'id':"س"},{'id':"ش"},{'id':"ص"},{'id':"ض"},{'id':"ط"},{'id':"ظ"},{'id':"ع"},{'id':"غ"},{'id':"ف"},{'id':"ق"},{'id':"ک"},{'id':"گ"},{'id':"ل"},{'id':"م"},{'id':"ن"},{'id':"و"},{'id':"ه"},{'id':"ي"}]

print(sorted(a, key=lambda x: locale.strxfrm(x['id']))

最后您可以返回区域设置:

locale.setlocale(locale.LC_ALL, '')

Here is how i managed to sort Persian language correctly (without PyICU)(using python 3.x):

First set the locale (don't forget to import locale and platform)

if platform.system() == 'Linux':
    locale.setlocale(locale.LC_ALL, 'fa_IR.UTF-8')
elif platform.system() == 'Windows':
   locale.setlocale(locale.LC_ALL, 'Persian_Iran.1256')
else:
   pass (or any other OS)

Then sort using key:

a = ['ا','ب','پ','ت','ث','ج','چ','ح','خ','د','ذ','ر','ز','ژ','س','ش','ص','ض','ط','ظ','ع','غ','ف','ق','ک','گ','ل','م','ن','و','ه','ي']

print(sorted(a,key=locale.strxfrm))

For list of Objects:

a = [{'id':"ا"},{'id':"ب"},{'id':"پ"},{'id':"ت"},{'id':"ث"},{'id':"ج"},{'id':"چ"},{'id':"ح"},{'id':"خ"},{'id':"د"},{'id':"ذ"},{'id':"ر"},{'id':"ز"},{'id':"ژ"},{'id':"س"},{'id':"ش"},{'id':"ص"},{'id':"ض"},{'id':"ط"},{'id':"ظ"},{'id':"ع"},{'id':"غ"},{'id':"ف"},{'id':"ق"},{'id':"ک"},{'id':"گ"},{'id':"ل"},{'id':"م"},{'id':"ن"},{'id':"و"},{'id':"ه"},{'id':"ي"}]

print(sorted(a, key=lambda x: locale.strxfrm(x['id']))

Finally you can return the locale:

locale.setlocale(locale.LC_ALL, '')
随波逐流 2024-09-20 03:35:47

@gnibbler,使用 PyICU 和 sorted() 函数确实可以在 Python3 环境中工作。经过一番深入研究 ICU API 文档和一些实验后,我发现了 getSortKey() 函数:

import PyICU
collator = PyICU.Collator.createInstance(PyICU.Locale('de_DE.UTF-8'))
sorted(['a','b','c','ä'],key=collator.getSortKey)

它生成所需的排序规则:

['a', 'ä', 'b', 'c']

而不是不需要的排序规则:

sorted(['a','b','c','ä'])
['a', 'b', 'c', 'ä']

@gnibbler, using PyICU with the sorted() function does work in a Python3 Environment. After a little digging through the ICU API documentation and some experimentation, I came across the getSortKey() function:

import PyICU
collator = PyICU.Collator.createInstance(PyICU.Locale('de_DE.UTF-8'))
sorted(['a','b','c','ä'],key=collator.getSortKey)

which produces the desired collation:

['a', 'ä', 'b', 'c']

instead of the undesired collation:

sorted(['a','b','c','ä'])
['a', 'b', 'c', 'ä']
山色无中 2024-09-20 03:35:47
import locale
from functools import cmp_to_key
iterable = [u'a', u'z', u'ą']
sorted(iterable, key=cmp_to_key(locale.strcoll))  # locale-aware sort order

(参考:http://docs.python.org/3.3/library/functools.html )

import locale
from functools import cmp_to_key
iterable = [u'a', u'z', u'ą']
sorted(iterable, key=cmp_to_key(locale.strcoll))  # locale-aware sort order

(Ref.: http://docs.python.org/3.3/library/functools.html)

站稳脚跟 2024-09-20 03:35:47

自 2012 年以来,出现了一个库 natsort。它包括令人惊叹的功能,例如 natsortedhumansorted。更重要的是,它们不仅适用于列表!。代码:

from natsort import natsorted, humansorted

lst = [u"a", u"z", u"ą"]
dct = {"ą": 1, "ż": 3, "Ż": 4, "b": 5}

lst_natsorted = natsorted(lst)
lst_humansorted = humansorted(lst)
dct_natsorted = dict(natsorted(dct.items()))
dct_humansorted = dict(humansorted(dct.items()))

print("List natsorted: ", lst_natsorted)
print("List humansorted: ", lst_humansorted, "\n")
print("Dictionary natsorted: ", dct_natsorted)
print("Dictionary humansorted: ", dct_humansorted)

输出:

List natsorted:  ['a', 'ą', 'z']
List humansorted:  ['a', 'ą', 'z']

Dictionary natsorted:  {'Ż': 4, 'ą': 1, 'b': 5, 'ż': 3}  
Dictionary humansorted:  {'ą': 1, 'b': 5, 'ż': 3, 'Ż': 4}

正如您所看到的,对字典进行排序时结果有所不同,但考虑到给定的列表,两个结果都是正确的。

顺便说一句,这个库也非常适合对包含数字的字符串进行排序:

from natsort import natsorted, humansorted

lst_mixed = ["a9", "a10", "a1", "c4", "c40", "c5"]

mixed_sorted = sorted(lst_mixed)
mixed_natsorted = natsorted(lst_mixed)
mixed_humansorted = humansorted(lst_mixed)

输出:

List with mixed strings sorted:  ['a1', 'a10', 'a9', 'c4', 'c40', 'c5']
List with mixed strings natsorted:  ['a1', 'a9', 'a10', 'c4', 'c5', 'c40']
List with mixed strings humansorted:  ['a1', 'a9', 'a10', 'c4', 'c5', 'c40']

Since 2012 there's been a library natsort. It includes amazing functions such as natsorted and humansorted. More importantly, they work not only with lists!. Code:

from natsort import natsorted, humansorted

lst = [u"a", u"z", u"ą"]
dct = {"ą": 1, "ż": 3, "Ż": 4, "b": 5}

lst_natsorted = natsorted(lst)
lst_humansorted = humansorted(lst)
dct_natsorted = dict(natsorted(dct.items()))
dct_humansorted = dict(humansorted(dct.items()))

print("List natsorted: ", lst_natsorted)
print("List humansorted: ", lst_humansorted, "\n")
print("Dictionary natsorted: ", dct_natsorted)
print("Dictionary humansorted: ", dct_humansorted)

Output:

List natsorted:  ['a', 'ą', 'z']
List humansorted:  ['a', 'ą', 'z']

Dictionary natsorted:  {'Ż': 4, 'ą': 1, 'b': 5, 'ż': 3}  
Dictionary humansorted:  {'ą': 1, 'b': 5, 'ż': 3, 'Ż': 4}

As you can see results differ when sorting dictionaries but considering given list both results are correct.

By the way, this library is also great to sort strings containing numbers:

from natsort import natsorted, humansorted

lst_mixed = ["a9", "a10", "a1", "c4", "c40", "c5"]

mixed_sorted = sorted(lst_mixed)
mixed_natsorted = natsorted(lst_mixed)
mixed_humansorted = humansorted(lst_mixed)

Output:

List with mixed strings sorted:  ['a1', 'a10', 'a9', 'c4', 'c40', 'c5']
List with mixed strings natsorted:  ['a1', 'a9', 'a10', 'c4', 'c5', 'c40']
List with mixed strings humansorted:  ['a1', 'a9', 'a10', 'c4', 'c5', 'c40']
我家小可爱 2024-09-20 03:35:47

在 ubuntu lucid 上,使用 cmp 进行排序似乎工作正常,但我的输出编码是错误的。

>>> import locale   
>>> locale.setlocale(locale.LC_ALL, 'pl_PL.UTF-8')
'pl_PL.UTF-8'
>>> print [i for i in sorted([u'a', u'z', u'ą'], cmp=locale.strcoll)]
[u'a', u'\u0105', u'z']

将 key 与 locale.strxfrm 结合使用不起作用,除非我丢失了某些东西

>>> print [i for i in sorted([u'a', u'z', u'ą'], key=locale.strxfrm)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0105' in position 0: ordinal not in range(128)

On ubuntu lucid the sorting with cmp seems to work ok, but my output encoding is wrong.

>>> import locale   
>>> locale.setlocale(locale.LC_ALL, 'pl_PL.UTF-8')
'pl_PL.UTF-8'
>>> print [i for i in sorted([u'a', u'z', u'ą'], cmp=locale.strcoll)]
[u'a', u'\u0105', u'z']

Using key with locale.strxfrm does not work unless I am missing something

>>> print [i for i in sorted([u'a', u'z', u'ą'], key=locale.strxfrm)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0105' in position 0: ordinal not in range(128)
情域 2024-09-20 03:35:47

这是一个老问题,但需要澄清一些。对于 Python 中的区域设置敏感排序,有两种方法可用。您采用哪种方法取决于您使用的操作系统。

第一种方法是使用内置的 locale 模块。这取决于您使用的操作系统以及可用的区域设置。

import locale
locale.setlocale(locale.LC_COLLATE, 'pl_PL.UTF-8')
test_list = ['a', 'z', 'ą']
sorted(test_list, key=locale.strxfrm)

如果我使用的是使用 glibc 的 Linux 版本,我将得到 ['a', 'ą', 'z']

如果我使用的是使用 Musl libc 的 Linux 版本,或者是为嵌入式系统开发的 Linux 发行版,我将得到 ['a', 'z', 'ą'],即区域设置敏感排序是不支持。

如果我使用基于 BSD libc 的系统(例如 macOS),我将得到 ['a', 'z', 'ą']

在 macOS 上,如果运行以下命令:

ls -al  /usr/share/locale/pl_PL/LC_COLLATE

您将得到 /usr/share/locale/pl_PL/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE,即波兰语排序规则表符号链接到另一个排序规则表,创建语言不敏感的排序。这与其他 BSD libc 派生系统类似,其中优先考虑文件系统中稳定的区域设置独立排序。

对于安装了 icu4c 的系统,第二种方法是使用 PyICU。 ICU4C 使用通用区域设置数据存储库 (CLDR)。 CLDR 语言环境数据比基于 libc 的实现中的语言环境数据更广泛。

import icu
collator = icu.Collator.createInstance(icu.Locale('pl'))
sorted(test_list, key=collator.getSortKey)

给出['a', 'ą', 'z']

区域设置数据因实现而异,这不会影响排序,但也可以在其他区域设置敏感操作中看到。

An old question, but some clarifications are required. For locale sensitive sorting in Python, two approaches are available. Which approach you take, depends on what operating system you are using.

First approach is to use the in-built locale module. This will depend on what operating system you are on, and what locales are available.

import locale
locale.setlocale(locale.LC_COLLATE, 'pl_PL.UTF-8')
test_list = ['a', 'z', 'ą']
sorted(test_list, key=locale.strxfrm)

If I am using a version of Linux using glibc, I will get ['a', 'ą', 'z'].

If I am using a version of Linux using Musl libc, or a Linux distro developed for embedded systems, I will get ['a', 'z', 'ą'], i.e. locale sensitive sorting is unsupported.

If I am using a system based on BSD libc (like macOS), I will get ['a', 'z', 'ą'].

On macOS, if you run the following command:

ls -al  /usr/share/locale/pl_PL/LC_COLLATE

you get /usr/share/locale/pl_PL/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE, i.e. The Polish collation table is symlinked to another collation table, creating a language insensitive sort. This is similar to other BSD libc derived system, where priority was given to stable locale independent sorting in the filesystem.

The second approach, for systems with icu4c installed, is to use PyICU. ICU4C uses the Common Locale Data Repository (CLDR). CLDR locale data is more extensive than locale data in libc based implementations.

import icu
collator = icu.Collator.createInstance(icu.Locale('pl'))
sorted(test_list, key=collator.getSortKey)

Which gives ['a', 'ą', 'z'].

Locale data varies across implementations, this just doesn't affect sorting, but can be seen in other locale sensitive operations as well.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文