.NET 能否将 Unicode 转换为 ASCII 以删除“智能引号”等？

发布于 2024-11-10 10:06:09 字数 967 浏览 6 评论 0原文

我们的一些用户使用的电子邮件客户端无法处理 Unicode，即使在邮件标头中正确设置了编码等。

我想“标准化”他们收到的内容。我们遇到的最大问题是用户将 Microsoft Word 中的内容复制粘贴到我们的 Web 应用程序中，然后该应用程序通过电子邮件转发该内容 - 包括分数、智能引号以及 Word 为您插入的所有其他扩展 Unicode 字符。

我猜想对此没有明确的解决方案，但在我坐下来开始编写大型查找表之前，是否有一些内置方法可以让我开始？

基本上涉及三个阶段。

首先，从其他普通字母中去除重音符号 - 解决此问题在这里

This paragraph contains “smart quotes” and áccénts and ½ of the problem is fractions

转到

This paragraph contains “smart quotes” and accents and ½ of the problem is fractions

第二，将单个 Unicode 字符替换为其等效的 ASCII 字符，给出：

This paragraph contains "smart quotes" and accents and ½ of the problem is fractions

这是我希望在实现自己的解决方案之前有一个解决方案的部分。最后，用合适的 ASCII 序列替换特定字符 - ½ 到 1/2 等等 - 我很确定任何类型的 Unicode 魔法本身都不支持这种操作，但有人可能已经编写了一个合适的查找表，我可以重复使用。

有什么想法吗？

原文

Some of our users use e-mail clients that can't cope with Unicode, even when the encoding, etc. are properly set in the mail headers.

I'd like to 'normalise' the content they're receiving. The biggest problem we have is users copy'n'pasting content from Microsoft Word into our web application, which then forwards that content by e-mail - including fractions, smart quotes, and all the other extended Unicode characters that Word helpfully inserts for you.

I'm guessing there is no definitely solution for this, but before I sit down and start writing great big lookup tables, is there some built-in method that'll get me started?

There's basically three phases involved.

First, stripping accents from otherwise-normal letters - solution to this is here

This paragraph contains “smart quotes” and áccénts and ½ of the problem is fractions

goes to

This paragraph contains “smart quotes” and accents and ½ of the problem is fractions

Second, replacing single Unicode characters with their ASCII equivalent, to give:

This paragraph contains "smart quotes" and accents and ½ of the problem is fractions

This is the part where I'm hoping there's a solution before I implement my own. Finally, replacing specific characters with a suitable ASCII sequence - ½ to 1/2, and so on - which I'm pretty sure isn't natively supported by any kind of Unicode magic, but somebody might have written a suitable lookup table I can re-use.

Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

羞稚 2024-11-17 10:06:09

谢谢大家提供一些非常有用的答案。我意识到实际的问题不是“如何将任何 Unicode 字符转换为其 ASCII 后备” - 问题是“如何将我的客户抱怨的 Unicode 字符转换为其 ASCII 后备” ？

换句话说 - 我们不需要通用的解决方案；我们需要一个在 99% 的情况下都能正常工作的解决方案，以便英语客户将 Word 和其他网站的英语内容粘贴到我们的应用程序中。为此，我使用此测试分析了通过我们的系统发送的八年的消息，寻找无法用 ASCII 编码表示的字符：

///<summary>Determine whether the supplied character is 
///using ASCII encoding.</summary>
bool IsAscii(char inputChar) {
    var ascii = new ASCIIEncoding();
    var asciiChar = (char)(ascii.GetBytes(inputChar.ToString())[0]);
    return(asciiChar == inputChar);
}

然后，我检查了生成的无法表示的字符集，并手动分配了适当的替换字符细绳。整个过程都捆绑在一个扩展方法中，因此您可以调用 myString.Asciify() 将字符串转换为合理的 ASCII 编码近似值。

public static class StringExtensions {
    private static readonly Dictionary<char, string> Replacements = new Dictionary<char, string>();
    /// <summary>Returns the specified string with characters not representable in ASCII codepage 437 converted to a suitable representative equivalent.  Yes, this is lossy.</summary>
    /// <param name="s">A string.</param>
    /// <returns>The supplied string, with smart quotes, fractions, accents and punctuation marks 'normalized' to ASCII equivalents.</returns>
    /// <remarks>This method is lossy. It's a bit of a hack that we use to get clean ASCII text for sending to downlevel e-mail clients.</remarks>
    public static string Asciify(this string s) {
        return (String.Join(String.Empty, s.Select(c => Asciify(c)).ToArray()));
    }

    private static string Asciify(char x) {
        return Replacements.ContainsKey(x) ? (Replacements[x]) : (x.ToString());
    }

    static StringExtensions() {
        Replacements['’'] = "'"; // 75151 occurrences
        Replacements['–'] = "-"; // 23018 occurrences
        Replacements['‘'] = "'"; // 9783 occurrences
        Replacements['”'] = "\""; // 6938 occurrences
        Replacements['“'] = "\""; // 6165 occurrences
        Replacements['…'] = "..."; // 5547 occurrences
        Replacements['£'] = "GBP"; // 3993 occurrences
        Replacements['•'] = "*"; // 2371 occurrences
        Replacements[' '] = " "; // 1529 occurrences
        Replacements['é'] = "e"; // 878 occurrences
        Replacements['ï'] = "i"; // 328 occurrences
        Replacements['´'] = "'"; // 226 occurrences
        Replacements['—'] = "-"; // 133 occurrences
        Replacements['·'] = "*"; // 132 occurrences
        Replacements['„'] = "\""; // 102 occurrences
        Replacements['€'] = "EUR"; // 95 occurrences
        Replacements['®'] = "(R)"; // 91 occurrences
        Replacements['¹'] = "(1)"; // 80 occurrences
        Replacements['«'] = "\""; // 79 occurrences
        Replacements['è'] = "e"; // 79 occurrences
        Replacements['á'] = "a"; // 55 occurrences
        Replacements['™'] = "TM"; // 54 occurrences
        Replacements['»'] = "\""; // 52 occurrences
        Replacements['ç'] = "c"; // 52 occurrences
        Replacements['½'] = "1/2"; // 48 occurrences
        Replacements[''] = "-"; // 39 occurrences
        Replacements['°'] = " degrees "; // 33 occurrences
        Replacements['ä'] = "a"; // 33 occurrences
        Replacements['É'] = "E"; // 31 occurrences
        Replacements['‚'] = ","; // 31 occurrences
        Replacements['ü'] = "u"; // 30 occurrences
        Replacements['í'] = "i"; // 28 occurrences
        Replacements['ë'] = "e"; // 26 occurrences
        Replacements['ö'] = "o"; // 19 occurrences
        Replacements['à'] = "a"; // 19 occurrences
        Replacements['¬'] = " "; // 17 occurrences
        Replacements['ó'] = "o"; // 15 occurrences
        Replacements['â'] = "a"; // 13 occurrences
        Replacements['ñ'] = "n"; // 13 occurrences
        Replacements['ô'] = "o"; // 10 occurrences
        Replacements['¨'] = ""; // 10 occurrences
        Replacements['å'] = "a"; // 8 occurrences
        Replacements['ã'] = "a"; // 8 occurrences
        Replacements['ˆ'] = ""; // 8 occurrences
        Replacements['©'] = "(c)"; // 6 occurrences
        Replacements['Ä'] = "A"; // 6 occurrences
        Replacements['Ï'] = "I"; // 5 occurrences
        Replacements['ò'] = "o"; // 5 occurrences
        Replacements['ê'] = "e"; // 5 occurrences
        Replacements['î'] = "i"; // 5 occurrences
        Replacements['Ü'] = "U"; // 5 occurrences
        Replacements['Á'] = "A"; // 5 occurrences
        Replacements['ß'] = "ss"; // 4 occurrences
        Replacements['¾'] = "3/4"; // 4 occurrences
        Replacements['È'] = "E"; // 4 occurrences
        Replacements['¼'] = "1/4"; // 3 occurrences
        Replacements['†'] = "+"; // 3 occurrences
        Replacements['³'] = "'"; // 3 occurrences
        Replacements['²'] = "'"; // 3 occurrences
        Replacements['Ø'] = "O"; // 2 occurrences
        Replacements['¸'] = ","; // 2 occurrences
        Replacements['Ë'] = "E"; // 2 occurrences
        Replacements['ú'] = "u"; // 2 occurrences
        Replacements['Ö'] = "O"; // 2 occurrences
        Replacements['û'] = "u"; // 2 occurrences
        Replacements['Ú'] = "U"; // 2 occurrences
        Replacements['Œ'] = "Oe"; // 2 occurrences
        Replacements['º'] = "?"; // 1 occurrences
        Replacements['‰'] = "0/00"; // 1 occurrences
        Replacements['Å'] = "A"; // 1 occurrences
        Replacements['ø'] = "o"; // 1 occurrences
        Replacements['˜'] = "~"; // 1 occurrences
        Replacements['æ'] = "ae"; // 1 occurrences
        Replacements['ù'] = "u"; // 1 occurrences
        Replacements['‹'] = "<"; // 1 occurrences
        Replacements['±'] = "+/-"; // 1 occurrences
    }
}

请注意，其中有一些相当奇怪的后备方案 - 就像这个：

Replacements['³'] = "'"; // 3 occurrences
Replacements['²'] = "'"; // 3 occurrences

那是因为我们的一位用户有一些程序可以将打开/关闭智能引号转换为 ² 和 ³（例如：他说“你好”），但没有人使用过它们代表幂，所以这可能对我们来说非常有效，但是 YMMV。

Thank you all for some very useful answers. I realize the actual question isn't "How can I convert ANY Unicode character into its ASCII fallback" - the question is "how can I convert the Unicode characters my customers are complaining about into their ASCII fallbacks" ?

In other words - we don't need a general-purpose solution; we need a solution that'll work 99% of the time, for English-speaking customers pasting English-language content from Word and other websites into our application. To that end, I analyzed eight years' worth of messages sent through our system looking for characters that aren't representable in ASCII encoding, using this test:

///<summary>Determine whether the supplied character is 
///using ASCII encoding.</summary>
bool IsAscii(char inputChar) {
    var ascii = new ASCIIEncoding();
    var asciiChar = (char)(ascii.GetBytes(inputChar.ToString())[0]);
    return(asciiChar == inputChar);
}

I've then been through the resulting set of unrepresentable characters and manually assigned an appropriate replacement string. The whole lot is bundled up in an extension method, so you can call myString.Asciify() to convert your string into a reasonable ASCII-encoding approximation.

public static class StringExtensions {
    private static readonly Dictionary<char, string> Replacements = new Dictionary<char, string>();
    /// <summary>Returns the specified string with characters not representable in ASCII codepage 437 converted to a suitable representative equivalent.  Yes, this is lossy.</summary>
    /// <param name="s">A string.</param>
    /// <returns>The supplied string, with smart quotes, fractions, accents and punctuation marks 'normalized' to ASCII equivalents.</returns>
    /// <remarks>This method is lossy. It's a bit of a hack that we use to get clean ASCII text for sending to downlevel e-mail clients.</remarks>
    public static string Asciify(this string s) {
        return (String.Join(String.Empty, s.Select(c => Asciify(c)).ToArray()));
    }

    private static string Asciify(char x) {
        return Replacements.ContainsKey(x) ? (Replacements[x]) : (x.ToString());
    }

    static StringExtensions() {
        Replacements['’'] = "'"; // 75151 occurrences
        Replacements['–'] = "-"; // 23018 occurrences
        Replacements['‘'] = "'"; // 9783 occurrences
        Replacements['”'] = "\""; // 6938 occurrences
        Replacements['“'] = "\""; // 6165 occurrences
        Replacements['…'] = "..."; // 5547 occurrences
        Replacements['£'] = "GBP"; // 3993 occurrences
        Replacements['•'] = "*"; // 2371 occurrences
        Replacements[' '] = " "; // 1529 occurrences
        Replacements['é'] = "e"; // 878 occurrences
        Replacements['ï'] = "i"; // 328 occurrences
        Replacements['´'] = "'"; // 226 occurrences
        Replacements['—'] = "-"; // 133 occurrences
        Replacements['·'] = "*"; // 132 occurrences
        Replacements['„'] = "\""; // 102 occurrences
        Replacements['€'] = "EUR"; // 95 occurrences
        Replacements['®'] = "(R)"; // 91 occurrences
        Replacements['¹'] = "(1)"; // 80 occurrences
        Replacements['«'] = "\""; // 79 occurrences
        Replacements['è'] = "e"; // 79 occurrences
        Replacements['á'] = "a"; // 55 occurrences
        Replacements['™'] = "TM"; // 54 occurrences
        Replacements['»'] = "\""; // 52 occurrences
        Replacements['ç'] = "c"; // 52 occurrences
        Replacements['½'] = "1/2"; // 48 occurrences
        Replacements[''] = "-"; // 39 occurrences
        Replacements['°'] = " degrees "; // 33 occurrences
        Replacements['ä'] = "a"; // 33 occurrences
        Replacements['É'] = "E"; // 31 occurrences
        Replacements['‚'] = ","; // 31 occurrences
        Replacements['ü'] = "u"; // 30 occurrences
        Replacements['í'] = "i"; // 28 occurrences
        Replacements['ë'] = "e"; // 26 occurrences
        Replacements['ö'] = "o"; // 19 occurrences
        Replacements['à'] = "a"; // 19 occurrences
        Replacements['¬'] = " "; // 17 occurrences
        Replacements['ó'] = "o"; // 15 occurrences
        Replacements['â'] = "a"; // 13 occurrences
        Replacements['ñ'] = "n"; // 13 occurrences
        Replacements['ô'] = "o"; // 10 occurrences
        Replacements['¨'] = ""; // 10 occurrences
        Replacements['å'] = "a"; // 8 occurrences
        Replacements['ã'] = "a"; // 8 occurrences
        Replacements['ˆ'] = ""; // 8 occurrences
        Replacements['©'] = "(c)"; // 6 occurrences
        Replacements['Ä'] = "A"; // 6 occurrences
        Replacements['Ï'] = "I"; // 5 occurrences
        Replacements['ò'] = "o"; // 5 occurrences
        Replacements['ê'] = "e"; // 5 occurrences
        Replacements['î'] = "i"; // 5 occurrences
        Replacements['Ü'] = "U"; // 5 occurrences
        Replacements['Á'] = "A"; // 5 occurrences
        Replacements['ß'] = "ss"; // 4 occurrences
        Replacements['¾'] = "3/4"; // 4 occurrences
        Replacements['È'] = "E"; // 4 occurrences
        Replacements['¼'] = "1/4"; // 3 occurrences
        Replacements['†'] = "+"; // 3 occurrences
        Replacements['³'] = "'"; // 3 occurrences
        Replacements['²'] = "'"; // 3 occurrences
        Replacements['Ø'] = "O"; // 2 occurrences
        Replacements['¸'] = ","; // 2 occurrences
        Replacements['Ë'] = "E"; // 2 occurrences
        Replacements['ú'] = "u"; // 2 occurrences
        Replacements['Ö'] = "O"; // 2 occurrences
        Replacements['û'] = "u"; // 2 occurrences
        Replacements['Ú'] = "U"; // 2 occurrences
        Replacements['Œ'] = "Oe"; // 2 occurrences
        Replacements['º'] = "?"; // 1 occurrences
        Replacements['‰'] = "0/00"; // 1 occurrences
        Replacements['Å'] = "A"; // 1 occurrences
        Replacements['ø'] = "o"; // 1 occurrences
        Replacements['˜'] = "~"; // 1 occurrences
        Replacements['æ'] = "ae"; // 1 occurrences
        Replacements['ù'] = "u"; // 1 occurrences
        Replacements['‹'] = "<"; // 1 occurrences
        Replacements['±'] = "+/-"; // 1 occurrences
    }
}

Note that there are some rather odd fallbacks in there - like this one:

Replacements['³'] = "'"; // 3 occurrences
Replacements['²'] = "'"; // 3 occurrences

That's because one of our users has some program that converts open/close smart-quotes into ² and ³ (like : he said ²hello³) and nobody has ever used them to represent exponentiation, so this will probably work quite nicely for us, but YMMV.

回复收藏 0 原文

指尖微凉心微凉 2024-11-17 10:06:09

我自己在使用最初在 Word 中内置的字符串列表时遇到了一些问题。我发现使用简单的 "String".replace(current char/string, new char/string) 命令效果很好。我使用的确切代码是智能引号，或者确切地说：左“，右”，左'和右'如下：

StringName = StringName.Replace(ChrW(8216), "'")     ' Replaces any left ' with a normal '
StringName = StringName.Replace(ChrW(8217), "'")     ' Replaces any right ' with a normal '
StringName = StringName.Replace(ChrW(8220), """")    ' Replace any left " with a normal "
StringName = StringName.Replace(ChrW(8221), """")    ' Replace any right " with a normal "

我希望这可以帮助任何仍然遇到此问题的人！

I had some problems with this myself, whilst using a list of strings originally built in Word. I have found that using a simple "String".replace(current char/string, new char/string) command works perfectly. The exact code I used was for smart quotes, or to be exact: left ", right ", left ', and right ' is as follows:

StringName = StringName.Replace(ChrW(8216), "'")     ' Replaces any left ' with a normal '
StringName = StringName.Replace(ChrW(8217), "'")     ' Replaces any right ' with a normal '
StringName = StringName.Replace(ChrW(8220), """")    ' Replace any left " with a normal "
StringName = StringName.Replace(ChrW(8221), """")    ' Replace any right " with a normal "

I hope this helps anyone out there still having this problem!

回复收藏 0 原文