C++像人类一样进行字符串排序?

发布于 2024-08-31 05:58:56 字数 1960 浏览 3 评论 0 原文

我想按照人类对字母数字字符串进行排序的方式对其进行排序。即,“A2”位于“A10”之前,“a”当然位于“Z”之前!有没有什么办法可以不写一个迷你解析器呢?理想情况下,它还会将“A1B1”放在“A1B10”之前。我看到问题 “自然(人类字母数字)排序Microsoft SQL 2005” 有一个可能的答案,但它使用各种库函数,“使用 IComparer 对人类字符串进行排序”

以下是当前失败的测试用例:

#include <set>
#include <iterator>
#include <iostream>
#include <vector>
#include <cassert>

template <typename T>
struct LexicographicSort {
  inline bool operator() (const T& lhs, const T& rhs) const{
    std::ostringstream s1,s2;
    s1 << toLower(lhs); s2 << toLower(rhs);
    bool less = s1.str() < s2.str();
    //Answer: bool less = doj::alphanum_less<std::string>()(s1.str(), s2.str());
    std::cout<<s1.str()<<" "<<s2.str()<<" "<<less<<"\n";
    return less;
  }

  inline std::string toLower(const std::string& str) const {
    std::string newString("");
    for (std::string::const_iterator charIt = str.begin();
         charIt!=str.end();++charIt) {
          newString.push_back(std::tolower(*charIt));
        }
        return newString;
      }
};


int main(void) {
  const std::string reference[5] = {"ab","B","c1","c2","c10"};
  std::vector<std::string> referenceStrings(&(reference[0]), &(reference[5]));

  //Insert in reverse order so we know they get sorted
  std::set<std::string,LexicographicSort<std::string> > strings(referenceStrings.rbegin(), referenceStrings.rend());

  std::cout<<"Items:\n";
  std::copy(strings.begin(), strings.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
  std::vector<std::string> sortedStrings(strings.begin(), strings.end());
  assert(sortedStrings == referenceStrings);
}

I would like to sort alphanumeric strings the way a human being would sort them. I.e., "A2" comes before "A10", and "a" certainly comes before "Z"! Is there any way to do with without writing a mini-parser? Ideally it would also put "A1B1" before "A1B10". I see the question "Natural (human alpha-numeric) sort in Microsoft SQL 2005" with a possible answer, but it uses various library functions, as does "Sorting Strings for Humans with IComparer".

Below is a test case that currently fails:

#include <set>
#include <iterator>
#include <iostream>
#include <vector>
#include <cassert>

template <typename T>
struct LexicographicSort {
  inline bool operator() (const T& lhs, const T& rhs) const{
    std::ostringstream s1,s2;
    s1 << toLower(lhs); s2 << toLower(rhs);
    bool less = s1.str() < s2.str();
    //Answer: bool less = doj::alphanum_less<std::string>()(s1.str(), s2.str());
    std::cout<<s1.str()<<" "<<s2.str()<<" "<<less<<"\n";
    return less;
  }

  inline std::string toLower(const std::string& str) const {
    std::string newString("");
    for (std::string::const_iterator charIt = str.begin();
         charIt!=str.end();++charIt) {
          newString.push_back(std::tolower(*charIt));
        }
        return newString;
      }
};


int main(void) {
  const std::string reference[5] = {"ab","B","c1","c2","c10"};
  std::vector<std::string> referenceStrings(&(reference[0]), &(reference[5]));

  //Insert in reverse order so we know they get sorted
  std::set<std::string,LexicographicSort<std::string> > strings(referenceStrings.rbegin(), referenceStrings.rend());

  std::cout<<"Items:\n";
  std::copy(strings.begin(), strings.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
  std::vector<std::string> sortedStrings(strings.begin(), strings.end());
  assert(sortedStrings == referenceStrings);
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

杀手六號 2024-09-07 05:58:56

有什么方法可以不写迷你解析器吗?

让别人来做吗?

我正在使用这个实现:http://www.davekoelle.com/alphanum.html,我也对其进行了修改以支持 wchar_t 。

Is there any way to do with without writing a mini-parser?

Let someone else do that?

I'm using this implementation: http://www.davekoelle.com/alphanum.html, I've modified it to support wchar_t, too.

明媚如初 2024-09-07 05:58:56

这实际上取决于“解析器”的含义。如果您想避免编写解析器,我认为您应该利用库函数。

  • 将字符串视为统一为字母、数字或“其他”的子序列序列。
  • 使用 isalnum 获取每个字符串的下一个字母数字序列,并回溯检查 +-(如果它是数字)。就地使用 strtold 查找数字子序列的结尾。
  • 如果一个是数字,一个是字母,则带有数字子序列的字符串位于第一位。
  • 如果一个字符串已用完字符,则该字符串排在第一位。
  • 使用 strcoll 比较当前语言环境中的字母子序列。
  • 使用 strtold 比较当前语言环境中的数字子序列。
  • 重复此操作,直至完成一根或两根弦。
  • strcmp 打破联系。

该算法在比较超出 long double 精度的数字字符串时存在一些弱点。

It really depends what you mean by "parser." If you want to avoid writing a parser, I would think you should avail yourself of library functions.

  • Treat the string as a sequence of subsequences which are uniformly alphabetic, numeric, or "other."
  • Get the next alphanumeric sequence of each string using isalnum and backtrack-checking for + or - if it is a number. Use strtold in-place to find the end of a numeric subsequence.
  • If one is numeric and one is alphabetic, the string with the numeric subsequence comes first.
  • If one string has run out of characters, it comes first.
  • Use strcoll to compare alphabetic subsequences within the current locale.
  • Use strtold to compare numeric subsequences within the current locale.
  • Repeat until finished with one or both strings.
  • Break ties with strcmp.

This algorithm has something of a weakness in comparing numeric strings which exceed the precision of long double.

南薇 2024-09-07 05:58:56

有没有什么方法可以在不编写迷你解析器的情况下做到这一点?我认为答案是否定的。但编写解析器并不那么困难。不久前我不得不这样做来对我们公司的股票编号进行排序。基本上只需扫描数字并将其转换为数组即可。检查每个字符的“类型”:字母、数字,也许您还有其他需要特殊处理的字符。就像我必须特殊对待连字符一样,因为我们希望 ABC 在 AB-A 之前排序。然后开始剥字符。只要它们与第一个字符的类型相同,它们就会进入同一个存储桶。一旦类型发生变化,您就开始将它们放入不同的存储桶中。然后你还需要一个比较函数来逐个桶进行比较。当两个桶都是 alpha 时,您只需进行正常的 alpha 比较。当两者都是数字时,将两者都转换为整数并进行整数比较,或者将较短的部分填充到较长的长度或等效的长度。当它们是不同类型时,您需要一个规则来进行比较,例如 AA 是在 A-1 之前还是之后?

这不是一项微不足道的工作,您必须为可能出现的所有奇怪情况制定规则,但我认为您可以在几个小时内将其组合在一起。

Is there any way to do it without writing a mini parser? I would think the answer is no. But writing a parser isn't that tough. I had to do this a while ago to sort our company's stock numbers. Basically just scan the number and turn it into an array. Check the "type" of every character: alpha, number, maybe you have others you need to deal with special. Like I had to treat hyphens special because we wanted A-B-C to sort before AB-A. Then start peeling off characters. As long as they are the same type as the first character, they go into the same bucket. Once the type changes, you start putting them in a different bucket. Then you also need a compare function that compares bucket-by-bucket. When both buckets are alpha, you just do a normal alpha compare. When both are digits, convert both to integer and do an integer compare, or pad the shorter to the length of the longer or something equivalent. When they're different types, you'll need a rule for how those compare, like does A-A come before or after A-1 ?

It's not a trivial job and you have to come up with rules for all the odd cases that may arise, but I would think you could get it together in a few hours of work.

蓝咒 2024-09-07 05:58:56

如果不进行任何解析,就无法比较人类书写的数字(首先去除前导零的高值)和作为同一字符串一部分的普通字符。

不过,解析不需要非常复杂。一个简单的哈希表,用于处理区分大小写和剥离特殊字符('A'='a'=1,'B'='b'='2,...或'A'=1,'a' =2,'B'=3,..., '-'=0(strip)),将字符串重新映射到哈希值数组,然后截断数字情况(如果遇到数字并且最后一个字符是数字,将最后一个数字乘以 10,然后加上当前值)。

从那里开始,按正常方式排序。

Without any parsing, there's no way to compare human written numbers (high values first with leading zeroes stripped) and normal characters as part of the same string.

The parsing doesn't need to be terribly complex though. A simple hash table to deal with things like case sensitivity and stripping special characters ('A'='a'=1,'B'='b'='2,... or 'A'=1,'a'=2,'B'=3,..., '-'=0(strip)), remap your string to an array of the hashed values, then truncate number cases (if a number is encountered and the last character was a number, multiply the last number by ten and add the current value to it).

From there, sort as normal.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文