不知道语言的情况下大小写折叠 UTF-8
我正在尝试评估不区分大小写的 UTF-8 字符串比较的不同策略。
我阅读了 Unicode 联盟的一些材料,尝试了 ICU 并尝试提出各种实现质量的替代方案。
我曾多次看到简单案例映射和完整案例映射之间的文本有所不同,我想确保我完全理解其中的差异。
据我阅读,简单案例映射是“上下文无关”的,即不需要知道有效负载是什么语言。由于突厥语“I/ı/ı/i”的崩溃,这将给出近似结果。
另一方面,完整案例映射需要知道有效负载的语言才能执行映射。有了这些额外信息,它可以采取特殊措施来涵盖以下情况:“Kim”作为突厥语字符串应变为大写的“KıM”,但“Kim”作为英语字符串应变为大写的“KIM”。
我说得对吗?
是否还有其他针对不同语言以不同方式折叠的“多方面”代码点的示例?
谢谢!
更新:提到简单案例映射与语言无关的来源之一是 ICU 的文档。我将其解释为 Unicode 真理,但也许这只是实现的一个声明?
I'm trying to evaluate different strategies for case insensitive UTF-8 string comparison.
I've read some material from the Unicode consortium, experimented with ICU and tried to come up with various quality-of-implementation alternatives.
On multiple occasions I've seen texts differ between Simple Case Mapping and Full Case Mapping, and I wanted to make sure I understand the difference entirely.
As I read it, Simple Case Mapping is "context-free", i.e. doesn't need to know what language the payload is. This will give approximate results, due to the Turkic "I/ı/İ/i" debacle.
Full Case Mapping, on the other hand, needs to know the language of the payload to be able to perform the mapping. With that extra information, it can take special measures to cover cases where "Kim" as a Turkic string should become "KİM" in upper-case, but "Kim" as an English string, should become "KIM" in upper-case.
Have I got that right?
Are there other examples of "multi-faceted" code points that fold differently for different languages?
Thanks!
UPDATE: One of the sources mentioning simple case mapping as language independent is ICU's documentation. I interpreted that as Unicode truth, but maybe it's just a statement of the implementation?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不,“完整大小写映射”是一种大小写,其中一个代码点需要被多个新代码点替换。简单的大小写映射是单个代码点替换。
如果您想自己实现此功能,则 Unicode CaseFolding.txt< /a> 文件对于正确执行此操作至关重要。请注意状态字段代码“T”,专门用于处理土耳其语 I 问题。
No, a "full case mapping" is a casing where one codepoint needs to be replaced by more than one new codepoints. A simple case mapping is a single codepoint substitution.
If you want to implement this yourself then the Unicode CaseFolding.txt file is crucial to get this right. Note the status field code "T", specifically there to handle the Turkish I problem.
嗯......对于大多数西方语言来说,辅音组合“SS”会小写为“ss”,但在德语中它可能会变成特殊字母“ß”。这只是“可能”,有相当多的使用规则需要考虑。
我认为这不会直接影响整理顺序(当然欢迎任何德国人纠正我),所以也许这是一个没有实际意义的问题。
Well ... The consonant combination "SS" would down-case to "ss" for most Western languages, but in German it might become the special letter "ß". That's just "might", there are quite involved usage rules to consider.
I think this doesn't directly affect collation order (any Germans are of course welcome to correct me) though, so maybe it's a moot point.