具有文化意识的按首字母对字符串进行分组

发布于 2024-10-28 00:55:40 字数 1351 浏览 3 评论 0原文

我正在尝试按首字母对字符串的排序列表进行分组。假设这是列表:

azaroth 
älgkebab 
orgel 
ölkorv

当列表根据 sv-SE 排序时,这就是排序顺序:

azaroth 
orgel 
älgkebab 
ölkorv

这意味着按首字母分组

A
  azaroth
O
  orgel
Ä
  älgkebab
Ö 
  ölkorv

这是有道理的,这也是您的方式会发现它被分组在使用 sv-SE 的国家/地区的电话簿中。

当列表按照 en-US 排序时,这就是排序顺序:

älgkebab 
azaroth 
ölkorv
orgel 

现在是有趣的部分。这意味着按首字母分组将是

AÄ
  älgkebab
  azaroth
OÖ
  ölkorv
  orgel

因为出于所有实际目的,“a”和“ä”在排序过程中被视为相同的字母,“o”和“ö”也是如此,这意味着它们是用于此目的相同的首字母。据我所知,在使用 en-US 的国家/地区,您会发现它在电话簿中的分组方式。

我的问题是,当它因文化而异时,如何以编程方式实现这种分组?或者换句话说,在对列表进行排序时,您如何知道哪些字母被视为“相同”特定的文化?

例如,我还没有找到一种方法让 StringComparer 对于“a”和“ä”返回 0

我有一个似乎有效的解决方案,它的作用是:

if (
    cultureInfo.CompareInfo.GetSortKey("a").KeyData[1] ==
    cultureInfo.CompareInfo.GetSortKey("ä").KeyData[1]
) // same initial (this will return false for sv-SE and true for en-US)

问题是,我不知道它是否适用于任何文化,甚至不知道KeyDataSortKey 的 code> 数组实际上是。 MSDN 上的页面相当模糊,并且可能是故意的。所以我宁愿有一个更可靠的解决方案。

I am trying to group a sorted list of strings by their initial letter. Let's say this is the list:

azaroth 
älgkebab 
orgel 
ölkorv

When the list is sorted according to sv-SE, this is the sort order:

azaroth 
orgel 
älgkebab 
ölkorv

Which means the grouping by initial letter would be

A
  azaroth
O
  orgel
Ä
  älgkebab
Ö 
  ölkorv

This makes sense, and this is also how you'd find it grouped in a phone book in a country which uses sv-SE.

When the list is sorted according to en-US, this is the sort order:

älgkebab 
azaroth 
ölkorv
orgel 

Now comes the interesting part. This means the grouping by initial letter would be

AÄ
  älgkebab
  azaroth
OÖ
  ölkorv
  orgel

Since for all practical purposes, "a" and "ä" were treated as the same letter during the sort, and so were "o" and "ö", which means they are for this purpose the same initial. This is AFAIK how you'd find it grouped in a phone book in a country which uses en-US.

My question is, how can I achieve this grouping programatically, when it varies by culture? Or in other words, how do you know which letters are being treated as "being the same" when sorting a list in a specific culture?

I haven't found a way to make a StringComparer return 0 for "a" vs "ä", for example.

I have a solution that seems to work, which does this:

if (
    cultureInfo.CompareInfo.GetSortKey("a").KeyData[1] ==
    cultureInfo.CompareInfo.GetSortKey("ä").KeyData[1]
) // same initial (this will return false for sv-SE and true for en-US)

Problem is, I have no idea whether it works for any culture, or even what the second piece of data in the KeyData array of the SortKey actually is. The page on MSDN is rather vague, and probably purposefully so. So I'd rather there was a more reliable solution.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

时间海 2024-11-04 00:55:40

当您在 sv-SE 中比较 aä 时,结果为 -1,因此如果两个单词相同(除变音符号外) ,它们总是排序相同。但您仍然可以发现它们的排序相同,否则:将一些字符附加到其中一个和另一个,与另一个排序不同,然后比较它们。然后交换添加的字符并再次比较。如果结果不同,则字符排序相同。

示例:

sv-SE:
"a0" < "ä1"
"a1" < "ä0"
en-US:
"a0" < "ä1"
"a1" > "ä0"

因此,在 sv-SE 中,'a' < 'ä',但在 en-US 'a' == 'ä' 中。下面是一个根据这些规则对字符串列表进行分组的类。但它不适用于某些文化,因为它们的排序顺序更复杂。例如,在捷克语中,ch 被视为一个单独的字母,排序在 h 之后。我不知道你会如何解决这个问题。

此外,代码使用 01 作为要附加的字符。如果在某些文化中这些字符不影响排序,那么它就不起作用。

class Grouper
{
    StringComparer m_comparer;

    public Grouper(StringComparer comparer)
    {
        m_comparer = comparer;
    }

    public List<Tuple<string, List<string>>> Group(IEnumerable<string> strings)
    {
        List<Tuple<string, List<string>>> result =
            new List<Tuple<string, List<string>>>();

        var sorted = strings.OrderBy(s => s, m_comparer);

        string previous = null;

        List<char> currentGroupName = null;
        List<string> currentGroup = null;

        foreach (var s in sorted)
        {
            char sInitial = ToUpper(s[0]);
            if (currentGroup == null || !AreEqual(s[0], previous[0]))
            {
                if (currentGroup != null)
                    result.Add(Tuple.Create(
                        SortGroupName(currentGroupName),
                        currentGroup));
                currentGroupName = new List<char> { sInitial };
                currentGroup = new List<string> { s };
            }
            else
            {
                if (!currentGroupName.Contains(sInitial))
                    currentGroupName.Add(sInitial);
                currentGroup.Add(s);
            }

            previous = s;
        }

        if (currentGroup != null)
            result.Add(Tuple.Create(SortGroupName(currentGroupName), currentGroup));

        return result;
    }

    string SortGroupName(List<char> chars)
    {
        return new string(chars.OrderBy(c => c.ToString(), m_comparer).ToArray());
    }

    bool AreEqual(char c1, char c2)
    {
        return Math.Sign(m_comparer.Compare(c1 + "0", c2 + "1")) ==
            -Math.Sign(m_comparer.Compare(c1 + "1", c2 + "0"));
    }

    char ToUpper(char c)
    {
        return c.ToString().ToUpper()[0];
    }
}

此外,此类还远未达到生产质量,例如,它不处理 null 或空字符串。

When you compare a and ä in sv-SE, the result is -1 so that if two words are the same, except for the umlaut, they are always sorted the same. But you can still figure out that they are sorted the same otherwise: Append some character to one of them and another, differently sorted to the other, and compare them. Then switch the added characters around and compare again. If the result is the different, the characters are sorted the same.

Example:

sv-SE:
"a0" < "ä1"
"a1" < "ä0"
en-US:
"a0" < "ä1"
"a1" > "ä0"

Thus, in sv-SE, 'a' < 'ä', but in en-US 'a' == 'ä'. Below is a class that groups a list of strings according to these rules. But it doesn't work properly for some cultures, because their sort order is more complex. For example in Czech, ch is considered a separate letter, sorted after h. I have no idea how would you fix that.

Also, the code uses 0 and 1 as the characters to append. If there are some cultures where these characters don't affect the sort, it wouldn't work.

class Grouper
{
    StringComparer m_comparer;

    public Grouper(StringComparer comparer)
    {
        m_comparer = comparer;
    }

    public List<Tuple<string, List<string>>> Group(IEnumerable<string> strings)
    {
        List<Tuple<string, List<string>>> result =
            new List<Tuple<string, List<string>>>();

        var sorted = strings.OrderBy(s => s, m_comparer);

        string previous = null;

        List<char> currentGroupName = null;
        List<string> currentGroup = null;

        foreach (var s in sorted)
        {
            char sInitial = ToUpper(s[0]);
            if (currentGroup == null || !AreEqual(s[0], previous[0]))
            {
                if (currentGroup != null)
                    result.Add(Tuple.Create(
                        SortGroupName(currentGroupName),
                        currentGroup));
                currentGroupName = new List<char> { sInitial };
                currentGroup = new List<string> { s };
            }
            else
            {
                if (!currentGroupName.Contains(sInitial))
                    currentGroupName.Add(sInitial);
                currentGroup.Add(s);
            }

            previous = s;
        }

        if (currentGroup != null)
            result.Add(Tuple.Create(SortGroupName(currentGroupName), currentGroup));

        return result;
    }

    string SortGroupName(List<char> chars)
    {
        return new string(chars.OrderBy(c => c.ToString(), m_comparer).ToArray());
    }

    bool AreEqual(char c1, char c2)
    {
        return Math.Sign(m_comparer.Compare(c1 + "0", c2 + "1")) ==
            -Math.Sign(m_comparer.Compare(c1 + "1", c2 + "0"));
    }

    char ToUpper(char c)
    {
        return c.ToString().ToUpper()[0];
    }
}

Also, this class is far from production-quality, for example, it doesn't handle nulls or empty strings.

梦明 2024-11-04 00:55:40

它可能是一个实现内部值,类似于常量。值本身并不重要,重要的是它与其他相关值的比较方式。

这类似于(例如)C 中的 EOF 值。虽然 GCC 将其定义为 -1,但实际值可能会有所不同,因此最终开发人员代码应该只比较该值,而不是评估它。

Its likely an implementation-internal value, similar to constants. The value itself doesn't matter, only how it compares to other related values.

This is similar to (for example) the EOF value in C. While GCC defines it as -1, the actual value MAY vary, and so end-developer code should only compare the value, never evaluate it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文