在 Haskell 中按区域设置对字符串进行排序和比较?

发布于 2024-11-07 05:45:55 字数 173 浏览 0 评论 0原文

是否可以在 Haskell (GHC) 中对带有国家字符的字符串进行正确排序?换句话说,当前区域设置对字符的正确排序?

我确实只找到了 ICU 模块,但它需要安装额外的库,因为它不是 Linux 发行版的标准部分。我想要基于 POSIX 的 C(类似 glibc)库的解决方案,这样处理额外的依赖项就不会有任何麻烦。

is it possible to properly sort strings with national characters in Haskell (GHC) ? In other words, correct collation of Chars by current locale settings ?

I did found ICU module only, but it requires extra library to be installed because it isn't a standard part of linux distributions. I would like solution based on POSIX's C (glibc like) library, so there won't be any hassle with handling of additional dependency.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

临走之时 2024-11-14 05:45:55

推荐方式:text-icu

以区域设置敏感的方式稳健处理字符串的推荐方式是通过 文本text-icu< /a>,如您所见。标准库集中提供了 text 库,Haskell 平台

一个示例,对土耳其语字符串进行排序:

{-# LANGUAGE OverloadedStrings #-}

import Data.Text.IO  as T 
import Data.Text.ICU as T 
import Data.List     (sortBy)

main = do
  let trLocale = T.Locale "tr-TR"
      str      = "ÇIİĞÖŞÜ"
      strs     = take 10 (cycle $ T.toLower trLocale str : str : [])

  mapM_ T.putStrLn (sortBy (T.compare [T.FoldCaseExcludeSpecialI]) strs)

似乎可以正确排序 字典顺序 基于区域设置,在正确小写土耳其语字符串后:

*Main> main
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
çıiğöşü
çıiğöşü
çıiğöşü
çıiğöşü
çıiğöşü

不使用 text-icu 包

您已在您的问题以避免使用除 Posix 提供的库之外的其他库的解决方案。虽然 text-icu 可以轻松地从 Hackage 安装(cabal install text-icu),但它确实依赖于 ICU C 库,而该库并非随处可用。此外,没有任何 Posix 替代方案能够如此强大或全面。最后,text-icu 是唯一能够正确执行多字符字符转换的包。

尽管如此,Haskell 中内置的 Char 和 String 类型提供了 Data.Char,其值代表 Unicode,其函数 将使用 Open Group 定义的 wchar_t 函数。此外,我们可以以(文本)区域设置敏感的方式在句柄上执行 IO。

import System.IO  
import Data.Char
import Data.List  (sort)

main = do
    t <- mkTextEncoding "UTF-8"
    hSetEncoding stdout t

    let str      = "ÇIİĞÖŞÜ"
        strs     = take 10 (cycle $ map toLower str : str : [])

    mapM_ putStrLn (sort strs)

事实上,GHC 默认情况下会使用您的文本区域设置进行 IO(例如 UTF8)。对于许多问题,这可能会给出正确的答案。您只需要意识到,在许多情况下它也会是错误的,因为如果没有批量处理文本以及丰富的转换和比较支持,就不可能正确。

*Main> main
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
çiiğöşü
çiiğöşü
çiiğöşü
çiiğöşü
çiiğöşü

Recommended way: text-icu

The recommended way for robustly processing strings in a locale-sensitive manner is via text and text-icu, as you have seen. The text library is provided in the standard library set, the Haskell Platform.

An example, sorting Turkish strings:

{-# LANGUAGE OverloadedStrings #-}

import Data.Text.IO  as T 
import Data.Text.ICU as T 
import Data.List     (sortBy)

main = do
  let trLocale = T.Locale "tr-TR"
      str      = "ÇIİĞÖŞÜ"
      strs     = take 10 (cycle $ T.toLower trLocale str : str : [])

  mapM_ T.putStrLn (sortBy (T.compare [T.FoldCaseExcludeSpecialI]) strs)

appears to correctly sort by lexicographic ordering based on locale, after correctly lower-casing the Turkish string:

*Main> main
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
çıiğöşü
çıiğöşü
çıiğöşü
çıiğöşü
çıiğöşü

Not using the text-icu package

You've asked in your question to avoid solutions that use additional libraries, other than what Posix provides. While text-icu is easily installable from Hackage (cabal install text-icu), it does depend on the ICU C library, which isn't available everywhere. Additionally, there is no Posix alternative that is as robust or comprehensive. Finally, text-icu is the only package that correctly does conversions on multi-char characters.

Given this, though, the built in Char and String types in Haskell provide Data.Char, whose values represent Unicode, and with functions that will do Unicode case conversion, in a locale-insensitive way, using the wchar_t functions defined by the Open Group. Additionally, we can do IO on Handles in a (text) locale-sensitive way.

import System.IO  
import Data.Char
import Data.List  (sort)

main = do
    t <- mkTextEncoding "UTF-8"
    hSetEncoding stdout t

    let str      = "ÇIİĞÖŞÜ"
        strs     = take 10 (cycle $ map toLower str : str : [])

    mapM_ putStrLn (sort strs)

In fact, GHC will use your text locale by default for IO (e.g. UTF8). For many problems, this will probably give the right answer. You just have to be aware it will also be wrong in many cases, since it isn't possible to be correct without bulk processing of text, and rich conversion and comparison support.

*Main> main
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
ÇIİĞÖŞÜ
çiiğöşü
çiiğöşü
çiiğöşü
çiiğöşü
çiiğöşü

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文