当变音符号在前并且重音符号不组合形式时,如何将 CodePage 规范化为 Unicode Form C
我希望能够说“通过将变音符号强制转换为组合形式来规范该字符串”。
详情:
我的代码是用 C# 开发的,但我不认为问题是特定于语言的。
我的数据有两个问题 (1) 变音符号位于该数据中的基本字符之前(它需要位于 Unicode 形式 D 或 KD 中的基本字符之后)。 (2) 我的数据中的重音变音符号是希腊语 Tonos (U+0384),但需要组合形式 (U+0301) 才能标准化。
我想以编程方式执行此操作。我认为这种类型的操作应该是众所周知的,但我没有在 C# 全球化方法中找到支持(有规范化方法,但没有办法强制变音符号重音变成其组合形式)。
I would like to be able to say "Normalize this string by forcing diacritic accents into their combining form".
Details:
My code is being developed in C# but I don't believe the issue to be language specific.
There are two problems with my data (1) the diacritic is preceding the base character in this data (it needs to follow the base character in Unicode forms D or KD). (2) the accent diacritic in my data is a Greek Tonos (U+0384) but needs to be combining form (U+0301) in order to Normalize.
I would like to do this programmatically. I would think that this type of operation should be well known but I did not find support in the C# Globalization methods (There are normalization methods but there is no way to force the diacritic accents into their combining form).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不认为 C# 全球化方法可以为您提供帮助。正如您指出的问题是 U+0384 不是组合字符。它本身就是一个角色。这也可以从兼容性分解(To U+0020 U+0301)中看出。该数据集很可能来自将音调显示为下一个字符的变音符号的源。根据 unicode 规范,这不是“正确的”。因此,您必须自己转换数据。我在撇号方面遇到了类似的问题;有时应用程序会使用正确的引号。
数据转换并不难,我相信你可以编写代码。
我会有一个有状态转换器并通过流传输数据。当 U+0384 被检测到时,它不会被 emmetied。您坚持“tonos”状态并在 NEXT 字符后发出 U+0301。这些是要处理的错误条件(U+0384 运行,“tonos”状态下的数据结束)。
该数据可以使用常用的 API 进行标准化。
祝你好运。
I do not think that the C# Globalization methods can help you here. The issue as you pointed out is that U+0384 is not a combining charcter. It is a character by itself. This also can be seen from the compatibilty decomposition ( To U+0020 U+0301). The data set most likely comes from a source that would display the tonos as a diacritic on the next character. This is not "proper" according to the unicode spec. Thus you'll have to convert the data yourself. I have run into a similar issue with the apostrophe; sometimes the right quotation mark is being used by applications.
The data conversion is not hard, I'm sure you can code that up.
I would have a stateful converter and stream the data through. When U+0384 gets detected, it does not get emmetied. You sticth to the "tonos" state and emit U+0301 after the NEXT character. The are error conditions to be handled (U+0384 runs, end of data in "tonos" state).
This data can be normalized with the usual APIs.
Good Luck.