如何比较包含非英语字符的 unicode 字符串以按字母顺序排序?

发布于 2024-11-29 23:30:49 字数 4617 浏览 3 评论 0原文

我正在尝试根据其中包含非英语字符的 unicode 字符串值对数组/列表/任何数据进行排序,我希望它们按字母顺序正确排序。

我已经编写了很多代码(D2010,win XP),我认为这些代码对于未来的国际化来说非常可靠,但事实并非如此。它全部使用 unicodestring(字符串)数据类型,到目前为止我只是将英文字符放入 unicode 字符串中。

看来我必须承认犯了一个非常严重的 unicode 错误。我和我的德国朋友交谈,并尝试了一些德语 ß(ß 是“ss”,应该位于字母表中的 S 之后和 T 之前)和 ö 等(注意变音符号),但我的排序算法都不再起作用了。结果非常混乱。垃圾。

从那时起,我广泛阅读并学到了很多有关 unicode 排序规则的令人不快的事情。事情看起来很严峻,比我想象的要严峻得多,我已经严重搞砸了。我希望我错过了一些东西,事情实际上并不像现在看起来那么严峻。我一直在修改 Windows api 调用(RtlCompareUnicodeString)但没有成功(保护错误),我无法让它工作。我了解到的 API 调用的问题是,它们在各种较新的 Windows 平台上发生了变化,而且 delphi 很快就会跨平台,后来有了 Linux,我的应用程序是客户端服务器,所以我需要担心这一点,但说实话,情况是什么是吗(坏)我将不胜感激任何进展,即特定于 win api 的进展。

使用 win api 函数 RtlCompareUnicodeString 是否是显而易见的解决方案?如果是这样,我真的应该再试一次,但是说实话,我对与 unicode 排序相关的所有问题感到惊讶,而且我根本不清楚我应该做什么来以这种方式比较这些字符串。

我了解到 IBM ICU c++ 开源项目,尽管适用于较旧版本的 ICU,但它有一个 delphi 包装器。这似乎是一个非常全面的解决方案,与平台无关。当然,我不能考虑为此创建一个delphi包装器(或更新现有的包装器)以获得unicode排序规则的良好解决方案?

我非常高兴听到两个层面的建议:-

A)Windows 特定的非便携式解决方案,我现在很高兴,忘记客户端服务器的影响! B) 一个更便携的解决方案,不受各种 XP/vista/win7 版本的 unicode api 函数的影响,因此对 XE2 mac 支持和未来的 Linux 支持很有帮助,更不用说客户端服务器的复杂性了。

顺便说一句,我真的不想做“凑合”的解决方案,在比较之前扫描字符串并替换某些棘手的字符等,我已经读过这些内容。我在上面给出了德语示例,这只是一个示例,我想让它适用于所有(或至少大多数,远东,俄语)语言,我不想为特定的一两种语言做解决方法。我也不需要任何有关排序算法的建议,它们很好,只是字符串比较位出了问题。

我希望我错过了/做了一些愚蠢的事情,这一切看起来都很令人头痛。

谢谢。


编辑,Rudy,这是我尝试调用 RtlCompareUnicodeString 的方法。抱歉耽搁了,我在这件事上度过了一段糟糕的时光。

program Project26

{$APPTYPE CONSOLE}

uses
  SysUtils;


var
  a,b:ansistring;

  k,l:string;
  x,y:widestring;
  r:integer;

procedure RtlInitUnicodeString(
  DestinationString:pstring;
  SourceString:pwidechar) stdcall; external 'NTDLL';

function RtlCompareUnicodeString(
  String1:pstring;
  String2:pstring;
  CaseInSensitive:boolean
  ):integer stdcall; external 'NTDLL';


begin

  x:='wef';
  y:='fsd';

  RtlInitUnicodeString(@k, pwidechar(x));
  RtlInitUnicodeString(@l, pwidechar(y));

  r:=RtlCompareUnicodeString(@k,@l,false);

  writeln(r);
  readln;

end.

我意识到这很可能是错误的,我不习惯直接调用 api 函数,这是我最好的猜测。

关于您的 StringCompareEx api 函数。看起来确实不错,但仅适用于 Vista +,我使用的是 XP。 StringCompare 在 XP 上运行,但那不是 Unicode!

回顾一下,正在进行的基本任务是比较两个字符串,并根据当前 Windows 语言环境中指定的字符排序顺序进行比较。

谁能确定 ansicomparetext 是否应该这样做?它对我不起作用,但其他人说它应该,而且我读过的其他东西表明它应该。

这是我在德语语言环境中使用 AnsiCompareText 时得到的 31 个测试字符串(空格分隔 - 没有字符串包含空格):-

  • arß Asß asß aßs no nö ö ön oo öö oöo öoö öp pö ss SS ßaß ßbß sß Sßa Sßb ßß ssss SSSS ßßß ssßß SSßß ßz ßzß z zzz

编辑 2.

我仍然很想知道我是否应该期望 AnsiCompareText 使用区域设置信息工作,正如 lkessler 所说的那样,lkessler 之前也发布过有关这些主题的文章,并且似乎已经发布过以前经历过这个。

然而,根据 Rudy 的建议,我也一直在查看 CompareStringW - 它与 CompareString,所以它不是我之前所说的非 unicode。

即使 AnsiCompareText 不起作用,尽管我认为它应该起作用,但 win32api 函数 CompareStringW 确实应该起作用。现在我已经定义了我的 API 函数,我可以调用它,并且得到一个结果,并且没有错误......但是无论输入字符串如何,我每次都会得到相同的结果!它每次都返回 1 - 这意味着小于。这是我的代码,

var
  k,l:string;

function CompareStringW(
  Locale:integer;
  dwCmpFlags:longword;
  lpString1:pstring;
  cchCount1:integer;
  lpString2:pstring;
  cchCount2:integer
  ):integer stdcall; external 'Kernel32.dll';

begin;

  k:='zzz';
  l:='xxx';

  writeln(length(k));
  r:=comparestringw(LOCALE_USER_DEFAULT,0,@k,3,@l,3);

  writeln(r); // result is 1=less than, 2=equal, 3=greater than
  readln;

end;

我觉得在经历了很多痛苦之后我现在已经有所进展。我很高兴了解 AnsiCompareText,以及我在上面的 CompareStringW api 调用中做错了什么。谢谢。


编辑 3

首先,我自己修复了对 CompareStringW 的 api 调用,当我应该执行 PString(mystring) 时,我传入了 @mystring。现在一切正常。

r:=comparestringw(LOCALE_USER_DEFAULT,0,pstring(k),-1,pstring(l),-1);

现在,你可以想象当我仍然得到与一开始相同的排序结果时我的沮丧...

  • arß asß aßs Asß no nö ö ön oo öö oöo öoö öp pö ss SS ßaß ßbß sß Sßa Sßb ßß ssss SSSS ßßß ssßß SSßß ßz ßzß z zzz

您可能还可以想象,当我意识到排序顺序是正确的,并且从一开始就是正确的时,我会感到极度沮丧,更不用说同时的喜悦了!说起来有点恶心,但一开始就没有任何问题——这都是因为我缺乏德语知识。我相信排序是错误的,因为你可以看到上面的字符串以 S 开头,然后它们以 ß 开头,然后再次 s 并返回 ß 等等。好吧,我不会说德语,但我仍然可以清楚地看到它们没有正确排序 - 我的德国朋友告诉我 ß 位于 S 之后,T 之前......我错了!发生的情况是字符串函数(AnsiCompareText 和 winapi CompareTextW)将每个“ß”替换为“ss”,将每个“ö”替换为普通“o”...所以如果我将上面的结果进行搜索并按照描述进行替换,我得到...

  • arss asss asss Asss no no o on oo oo ooo ooo op po ss SS ssass ssbss sss Sssa Sssb ssss ssss SSSS ssssss ssssss SSssss ssz sszss z zzz

对我来说看起来很正确!一直如此。

我非常感谢所提供的所有建议,并且非常抱歉这样浪费了您的时间。那些德语 ß 让我很困惑,内置的 delphi 函数或其他任何东西都没有任何问题。看起来好像确实有。我犯了一个错误,在我的测试数据中将它们与普通的“s”组合在一起,任何其他字母都不会产生这种未排序的错觉!那些弯弯曲曲的 ß 让我看起来像个傻瓜! βs!

鲁迪和凯斯勒,我们都特别有帮助,我必须接受凯斯勒的回答是最正确的,对不起鲁迪。

I am trying to sort array/lists/whatever of data based upon the unicode string values in them which contain non-english characters, I want them sorted correctly alphabetically.

I have written a lot of code (D2010, win XP), which I thought was pretty solid for future internationalisation, but it is not. Its all using unicodestring (string) data type, which up until now I have just been putting english characters into the unicode strings.

It seems I have to own up to making a very serious unicode mistake. I talked to my German friend, and tried out some German ß's, (ß is 'ss' and should come after S and before T in alphabet) and and ö's etc (note the umlaut) and none of my sorting algorithms work anymore. Results are very mixed up. Garbage.

Since then I have been reading up extensively and learnt a lot of unpleasant things with regards to unicode collation. Things are looking grim, much grimmer than I ever expected, I have seriously messed this up. I hope I am missing something and things are not actually quite as grim as they appear at present. I have been tinkering around looking at windows api calls (RtlCompareUnicodeString) with no success (protection faults), I could not get it to work. Problem with API calls I learnt is that they change on various newer windows platforms, and also with delphi going cross plat soon, with linux later, my app is client server so I need to be concerned about this, but tbh with the situation being what is it (bad) I would be grateful for any forward progress, ie win api specific.

Is using win api function RtlCompareUnicodeString to obvious solution? If so I should really try again with that but tbh I have been taken aback by all of the issues involved with unicode collation and I not clear at all what I should be doing to compare these strings this way anyway.

I learnt of the IBM ICU c++ opensource project, there is a delphi wrapper for it albeit for an older version of ICU. It seems a very comprehensive solution which is platform independant. Surely I cannot be looking at creating a delphi wrapper for this (or updating the existing one) to get a good solution for unicode collation?

I would be extremely glad to hear advice at two levels :-

A) A windows specific non portable solution, I would be glad off that at the moment, forget the client server ramifications!
B) A more portable solution which is immune from the various XP/vista/win7 variations of unicode api functions, therefore putting me in good stead for XE2 mac support and future linux support, not to mention the client server complications.

Btw I dont really want to be doing 'make-do' solutions, scanning strings prior to comparison and replacing certain tricky characters etc, which I have read about. I gave the German examplle above, thats just an example, I want to get it working for all (or at least most, far east, russian) languages, I don't want to do workarounds for a specific language or two. I also do not need any advice on the sorting algorithms, they are fine, its just the string comparison bit that's wrong.

I hope I am missing/doing something stupid, this all looks to be a headache.

Thank you.


EDIT, Rudy, here is how I was trying to call RtlCompareUnicodeString. Sorry for the delay I have been having a horrible time with this.

program Project26

{$APPTYPE CONSOLE}

uses
  SysUtils;


var
  a,b:ansistring;

  k,l:string;
  x,y:widestring;
  r:integer;

procedure RtlInitUnicodeString(
  DestinationString:pstring;
  SourceString:pwidechar) stdcall; external 'NTDLL';

function RtlCompareUnicodeString(
  String1:pstring;
  String2:pstring;
  CaseInSensitive:boolean
  ):integer stdcall; external 'NTDLL';


begin

  x:='wef';
  y:='fsd';

  RtlInitUnicodeString(@k, pwidechar(x));
  RtlInitUnicodeString(@l, pwidechar(y));

  r:=RtlCompareUnicodeString(@k,@l,false);

  writeln(r);
  readln;

end.

I realise this is most likely wrong, I am not used to calling api unctions directly, this is my best guess.

About your StringCompareEx api function. That looked really good, but is avail on Vista + only, I'm using XP. StringCompare is on XP, but that's not Unicode!

To recap, the basic task afoot, is to compare two strings, and to do so based on the character sort order specified in the current windows locale.

Can anyone say for sure if ansicomparetext should do this or not? It don't work for me, but others have said it should, and other things i have read suggest it should.

This is what I get with 31 test strings when using AnsiCompareText when in German Locale (space delimited - no strings contain spaces) :-

  • arß Asß asß aßs no nö ö ön oo öö oöo öoö öp pö ss SS ßaß ßbß sß Sßa
    Sßb ßß ssss SSSS ßßß ssßß SSßß ßz ßzß z zzz

EDIT 2.

I am still keen to hear if I should expect AnsiCompareText to work using the locale info, as lkessler has said so, and lkessler has also posted about these subjects before and seems have been through this before.

However, following on from Rudy's advice I have also been checking out CompareStringW - which shares the same documentation with CompareString, so it is NOT non-unicode as I have stated earlier.

Even if AnsiCompareText is not going to work, although I think it should, the win32api function CompareStringW should indeed work. Now I have defined my API function, and I can call it, and I get a result, and no error... but i get the same result everytime regardless of the input strings! It returns 1 everytime - which means less than. Here's my code

var
  k,l:string;

function CompareStringW(
  Locale:integer;
  dwCmpFlags:longword;
  lpString1:pstring;
  cchCount1:integer;
  lpString2:pstring;
  cchCount2:integer
  ):integer stdcall; external 'Kernel32.dll';

begin;

  k:='zzz';
  l:='xxx';

  writeln(length(k));
  r:=comparestringw(LOCALE_USER_DEFAULT,0,@k,3,@l,3);

  writeln(r); // result is 1=less than, 2=equal, 3=greater than
  readln;

end;

I feel I am getting somewhere now after much pain. Would be glad to know about AnsiCompareText, and what I am doing wrong with the above CompareStringW api call. Thank you.


EDIT 3

Firstly, I fixed the api call to CompareStringW myself, I was passing in @mystring when I should do PString(mystring). Now it all works correctly.

r:=comparestringw(LOCALE_USER_DEFAULT,0,pstring(k),-1,pstring(l),-1);

Now, you can imagine my dismay when I still got the same sort result as I did right at the beginning...

  • arß asß aßs Asß no nö ö ön oo öö oöo öoö öp pö ss SS ßaß ßbß sß Sßa
    Sßb ßß ssss SSSS ßßß ssßß SSßß ßz ßzß z zzz

You may also imagine my EXTREME dismay not to mention simultaneous joy when I realised the sort order IS CORRECT, and IT WAS CORRECT RIGHT BACK IN THE BEGGINING! It make sme sick to say it, but there was never any problem in the first place - this is all down to my lack of German knowledge. I beleived the sort was wrong, since you can see above string start with S, then later they start with ß, then s again and back to ß and so on. Well I can't speak German however I could still clearly see that they was not sorted correctly - my German friend told me ß comes after S and before T... I WAS WRONG! What is happening is that string functions (both AnsiCompareText and winapi CompareTextW) are SUBSTITUTING every 'ß' with 'ss', and every 'ö' with a normal 'o'... so if i take those result above and to a search and replace as described I get...

  • arss asss asss Asss no no o on oo oo ooo ooo op po ss SS ssass ssbss
    sss Sssa Sssb ssss ssss SSSS ssssss ssssss SSssss ssz sszss z zzz

Looks pretty correct to me! And it always was.

I am extremely grateful for all the advice given, and extremely sorry to have wasted your time like this. Those german ß's got me all confused, there was never nothing wrong with the built in delphi function or anything else. It just looked like there was. I made the mistake of combining them with normal 's' in my test data, any other letter would have not have created this illusion of un-sortedness! The squiggly ß's have made me look a fool! ßs!

Rudy and lkessler we're both especially helpful, ty both, I have to accept lkessler's answer as most correct, sorry Rudy.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

做个ˇ局外人 2024-12-06 23:30:49

您说您自己调用 Windows API 时遇到问题。您能否发布代码,以便这里的人们可以了解失败的原因?它并不像看起来那么难,但确实需要一些小心。 ISTM 认为 RtlCompareUnicodeStrings() 级别太低。

我找到了一些解决方案:

不可移植

您可以使用 Windows API 函数 比较字符串Ex。这将使用 Unicode 特定排序规则类型进行比较。您可以指定如何完成此操作(请参阅链接)。它确实需要宽字符串,即指向它们的 PWideChar 指针。如果您在调用它时遇到问题,请大声喊叫,我会尝试添加一些演示代码。

或多或少的可移植性

为了使其或多或少的可移植性,您可以编写一个比较两个字符串的函数,并使用条件定义为平台选择不同的比较 API。

You said you had problems calling Windows API calls yourself. Could you post the code, so people here can see why it failed? It is not as hard as it may seem, but it does require some care. ISTM that RtlCompareUnicodeStrings() is too low level.

I found a few solutions:

Non-portable

You could use the Windows API function CompareStringEx. This will compare using Unicode specific collation types. You can specify how you want this done (see link). It does require wide strings, i.e. PWideChar pointers to them. If you have problems calling it, give a holler and I'll try to add some demo code.

More or less portable

To make this more or less portable, you could write a function that compares two strings and use conditional defines to choose the different comparison APIs for the platform.

┾廆蒐ゝ 2024-12-06 23:30:49

尝试使用 CompareStr 区分大小写,或 CompareText 如果您想要排序,则不区分大小写在任何语言环境中都完全相同。

并使用 AnsiCompareStr 区分大小写,或 AnsiCompareText 案例如果您希望排序特定于用户的区域设置,则不敏感。

请参阅:如何让 TStringList 在 Delphi 中以不同方式排序 有关于此的更多信息。

Try using CompareStr for case sensitive, or CompareText for case insensitive if you want your sorts exactly the same in any locale.

And use AnsiCompareStr for case sensitive, or AnsiCompareText for case insensitive if you want your sorts to be specific to the locale of the user.

See: How can I get TStringList to sort differently in Delphi for a lot more information on this.

空心空情空意 2024-12-06 23:30:49

在 Unicode 中,字符的数字顺序当然不是排序顺序。 HeartWare 提到的 AnsiCompareText 在比较字符时确实会考虑区域设置细节,但是,正如您所发现的,它对排序顺序没有任何作用。您正在寻找的称为语言的排序规则,它指定考虑到变音符号等的语言的字母排序顺序。它们在旧的 Ansi 代码页面中有所隐含,尽管它们也没有考虑使用相同字符集的语言之间的排序差异。

我检查了 D2010 文档。除了一些 TIB* 组件之外,我没有找到任何链接。 C++ builder 似乎确实有一个考虑排序规则的比较函数,但这在 Delphi 中没有多大用处。在那里您可能必须直接使用一些 Windows 的 API 函数。

文档:

“对“整理”全部进行排序”一文由 Michael 撰写Kaplan 对 Unicode 和各种语言的复杂性有着深入的了解。当我从 D2006 移植到 D2009 时,他的博客对我来说非常宝贵。

In Unicode the numeric order of the characters is certainly not the sorting sequence. AnsiCompareText as mentioned by HeartWare does take locale specifics into consideration when comparing characters, but, as you found out, does nothing wrt the sorting order. What you are looking for is called the collation sequence of a language, which specifies the alphabetic sorting order for a language taking diacritics etc into consideration. They were sort of implied in the old Ansi Code pages, though those didn't account for sorting difference between languages using the same character set either.

I checked the D2010 docs. Apart from some TIB* components I didn't find any links. C++ builder does seem to have a compare function that takes collation into account, but that's not much use in Delphi. There you will probably have to use some Windows' API functions directly.

Docs:

The 'Sorting "Collate" all out' article is by Michael Kaplan, someone who has great in-depth knowledge of all things Unicode and all intricacies of various languages. His blog has been invaluable to me when porting from D2006 to D2009.

許願樹丅啲祈禱 2024-12-06 23:30:49

您尝试过 AnsiCompareText 吗?尽管它被称为“Ansi”,但我相信它会调用特定于操作系统的 Unicode 比较例程...

它还应该使您免受跨平台依赖性的影响(前提是 Embarcadero 在各种操作系统中提供兼容版本)他们的目标)。

我不知道与各种奇怪的 Unicode 字符串编码方式的比较效果如何,但尝试一下并让我们知道结果......

Have you tried AnsiCompareText ? Even though it is called "Ansi", I believe it calls on to an OS-specific Unicode-able comparison routine...

It should also make you safe from cross-platform dependencies (provided that Embarcadero supplies a compatible version in the various OS's they target).

I do not know how good the comparison works with the various strange Unicode ways to encode strings, but try it out and let us know the result...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文