.NET 能否将 Unicode 转换为 ASCII 以删除“智能引号”等?
我们的一些用户使用的电子邮件客户端无法处理 Unicode,即使在邮件标头中正确设置了编码等。
我想“标准化”他们收到的内容。我们遇到的最大问题是用户将 Microsoft Word 中的内容复制粘贴到我们的 Web 应用程序中,然后该应用程序通过电子邮件转发该内容 - 包括分数、智能引号以及 Word 为您插入的所有其他扩展 Unicode 字符。
我猜想对此没有明确的解决方案,但在我坐下来开始编写大型查找表之前,是否有一些内置方法可以让我开始?
基本上涉及三个阶段。
首先,从其他普通字母中去除重音符号 - 解决此问题 在这里
This paragraph contains “smart quotes” and áccénts and ½ of the problem is fractions
转到
This paragraph contains “smart quotes” and accents and ½ of the problem is fractions
第二,将单个 Unicode 字符替换为其等效的 ASCII 字符,给出:
This paragraph contains "smart quotes" and accents and ½ of the problem is fractions
这是我希望在实现自己的解决方案之前有一个解决方案的部分。最后,用合适的 ASCII 序列替换特定字符 - ½ 到 1/2 等等 - 我很确定任何类型的 Unicode 魔法本身都不支持这种操作,但有人可能已经编写了一个合适的查找表,我可以重复使用。
有什么想法吗?
Some of our users use e-mail clients that can't cope with Unicode, even when the encoding, etc. are properly set in the mail headers.
I'd like to 'normalise' the content they're receiving. The biggest problem we have is users copy'n'pasting content from Microsoft Word into our web application, which then forwards that content by e-mail - including fractions, smart quotes, and all the other extended Unicode characters that Word helpfully inserts for you.
I'm guessing there is no definitely solution for this, but before I sit down and start writing great big lookup tables, is there some built-in method that'll get me started?
There's basically three phases involved.
First, stripping accents from otherwise-normal letters - solution to this is here
This paragraph contains “smart quotes” and áccénts and ½ of the problem is fractions
goes to
This paragraph contains “smart quotes” and accents and ½ of the problem is fractions
Second, replacing single Unicode characters with their ASCII equivalent, to give:
This paragraph contains "smart quotes" and accents and ½ of the problem is fractions
This is the part where I'm hoping there's a solution before I implement my own. Finally, replacing specific characters with a suitable ASCII sequence - ½ to 1/2, and so on - which I'm pretty sure isn't natively supported by any kind of Unicode magic, but somebody might have written a suitable lookup table I can re-use.
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
谢谢大家提供一些非常有用的答案。我意识到实际的问题不是“如何将任何 Unicode 字符转换为其 ASCII 后备” - 问题是“如何将我的客户抱怨的 Unicode 字符转换为其 ASCII 后备” ?
换句话说 - 我们不需要通用的解决方案;我们需要一个在 99% 的情况下都能正常工作的解决方案,以便英语客户将 Word 和其他网站的英语内容粘贴到我们的应用程序中。为此,我使用此测试分析了通过我们的系统发送的八年的消息,寻找无法用 ASCII 编码表示的字符:
然后,我检查了生成的无法表示的字符集,并手动分配了适当的替换字符细绳。整个过程都捆绑在一个扩展方法中,因此您可以调用 myString.Asciify() 将字符串转换为合理的 ASCII 编码近似值。
请注意,其中有一些相当奇怪的后备方案 - 就像这个:
那是因为我们的一位用户有一些程序可以将打开/关闭智能引号转换为 ² 和 ³(例如:他说“你好”),但没有人使用过它们代表幂,所以这可能对我们来说非常有效,但是 YMMV。
Thank you all for some very useful answers. I realize the actual question isn't "How can I convert ANY Unicode character into its ASCII fallback" - the question is "how can I convert the Unicode characters my customers are complaining about into their ASCII fallbacks" ?
In other words - we don't need a general-purpose solution; we need a solution that'll work 99% of the time, for English-speaking customers pasting English-language content from Word and other websites into our application. To that end, I analyzed eight years' worth of messages sent through our system looking for characters that aren't representable in ASCII encoding, using this test:
I've then been through the resulting set of unrepresentable characters and manually assigned an appropriate replacement string. The whole lot is bundled up in an extension method, so you can call myString.Asciify() to convert your string into a reasonable ASCII-encoding approximation.
Note that there are some rather odd fallbacks in there - like this one:
That's because one of our users has some program that converts open/close smart-quotes into ² and ³ (like : he said ²hello³) and nobody has ever used them to represent exponentiation, so this will probably work quite nicely for us, but YMMV.
我自己在使用最初在 Word 中内置的字符串列表时遇到了一些问题。我发现使用简单的
"String".replace(current char/string, new char/string)
命令效果很好。我使用的确切代码是智能引号,或者确切地说:左“,右”,左'和右'如下:我希望这可以帮助任何仍然遇到此问题的人!
I had some problems with this myself, whilst using a list of strings originally built in Word. I have found that using a simple
"String".replace(current char/string, new char/string)
command works perfectly. The exact code I used was for smart quotes, or to be exact: left ", right ", left ', and right ' is as follows:I hope this helps anyone out there still having this problem!
我尝试的第一件事是将文本转换为 NFKD 规范化形式,其中 对字符串方法进行标准化。您链接的问题的答案中提到了此建议,但我建议使用 NFKD 而不是 NFD,因为 NFKD 将消除不需要的印刷区别(例如 NBSP → 空格,或 ℂ → C)。
您还可以通过Unicode 类别进行通用替换。例如,Pd 可以替换为
-
,Nd 可以替换为相应的0
-9
数字,Mn 可以替换为空字符串(删除重音符号)。您可以尝试使用 Unicode 程序或 CLDR 中的数据。
编辑:这里有一个巨大的替代图表。
The first thing I'd try is to convert the text to NFKD normalization form, with the Normalize on strings method. This suggestion is mentioned in the answer to the question you linked, but I recommend using NFKD instead of NFD because NFKD will remove unwanted typographical distinctions (e.g., NBSP → space, or ℂ → C).
You might also be able to make generic replacements by Unicode category. For example, Pd's can be replaced by
-
, Nd's can be replaced by the corresponding0
-9
digit, and Mn's can be replaced with the empty string (to remove accents).You could try using the data from the Unidecode program, or CLDR.
Edit: There's a huge substitution chart here.
你永远不应该尝试将 Unicode 转换为 ASCII,因为你最终会遇到比解决更多的问题。
这就像试图将 1,114,112 个代码点 (Unicode 6.0) 放入 128 个字符中。
你认为你会成功吗?
顺便说一句,Unicode 中有很多引号,不仅是您提到的那些,而且如果您无论如何都想进行转换,请记住转换将取决于区域设置。
检查 ICU - 其中包含最完整的 Unicode 转换例程。
You should never try to convert Unicode to ASCII because you will end-up having more problems than solving.
It's like trying to fit 1,114,112 codepoints (Unicode 6.0) into just 128 characters.
Do you think you will succeed?
BTW, There are lots of quotes in Unicode, not only those mentioned by you and also if you will want to do the conversion anyway remember that the conversions will be dependent on the locale.
Check ICU - that contains the most complete Unicode conversion routines.