在 Haskell 中按区域设置对字符串进行排序和比较?
是否可以在 Haskell (GHC) 中对带有国家字符的字符串进行正确排序?换句话说,当前区域设置对字符的正确排序?
我确实只找到了 ICU 模块,但它需要安装额外的库,因为它不是 Linux 发行版的标准部分。我想要基于 POSIX 的 C(类似 glibc)库的解决方案,这样处理额外的依赖项就不会有任何麻烦。
is it possible to properly sort strings with national characters in Haskell (GHC) ? In other words, correct collation of Chars by current locale settings ?
I did found ICU module only, but it requires extra library to be installed because it isn't a standard part of linux distributions. I would like solution based on POSIX's C (glibc like) library, so there won't be any hassle with handling of additional dependency.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
推荐方式:text-icu
以区域设置敏感的方式稳健处理字符串的推荐方式是通过 文本 和 text-icu< /a>,如您所见。标准库集中提供了 text 库,Haskell 平台。
一个示例,对土耳其语字符串进行排序:
似乎可以正确排序 字典顺序 基于区域设置,在正确小写土耳其语字符串后:
不使用 text-icu 包
您已在您的问题以避免使用除 Posix 提供的库之外的其他库的解决方案。虽然 text-icu 可以轻松地从 Hackage 安装(
cabal install text-icu
),但它确实依赖于 ICU C 库,而该库并非随处可用。此外,没有任何 Posix 替代方案能够如此强大或全面。最后,text-icu
是唯一能够正确执行多字符字符转换的包。尽管如此,Haskell 中内置的 Char 和 String 类型提供了 Data.Char,其值代表 Unicode,其函数 将使用 Open Group 定义的
wchar_t
函数。此外,我们可以以(文本)区域设置敏感的方式在句柄上执行 IO。事实上,GHC 默认情况下会使用您的文本区域设置进行 IO(例如 UTF8)。对于许多问题,这可能会给出正确的答案。您只需要意识到,在许多情况下它也会是错误的,因为如果没有批量处理文本以及丰富的转换和比较支持,就不可能正确。
Recommended way: text-icu
The recommended way for robustly processing strings in a locale-sensitive manner is via text and text-icu, as you have seen. The text library is provided in the standard library set, the Haskell Platform.
An example, sorting Turkish strings:
appears to correctly sort by lexicographic ordering based on locale, after correctly lower-casing the Turkish string:
Not using the text-icu package
You've asked in your question to avoid solutions that use additional libraries, other than what Posix provides. While text-icu is easily installable from Hackage (
cabal install text-icu
), it does depend on the ICU C library, which isn't available everywhere. Additionally, there is no Posix alternative that is as robust or comprehensive. Finally,text-icu
is the only package that correctly does conversions on multi-char characters.Given this, though, the built in Char and String types in Haskell provide Data.Char, whose values represent Unicode, and with functions that will do Unicode case conversion, in a locale-insensitive way, using the
wchar_t
functions defined by the Open Group. Additionally, we can do IO on Handles in a (text) locale-sensitive way.In fact, GHC will use your text locale by default for IO (e.g. UTF8). For many problems, this will probably give the right answer. You just have to be aware it will also be wrong in many cases, since it isn't possible to be correct without bulk processing of text, and rich conversion and comparison support.