如何在 C# 中将字符串从 utf8 转换(音译)为 ASCII(单字节)?

发布于 2024-07-12 11:46:05 字数 379 浏览 12 评论 0原文

字符串对象

我有一个“包含多个字符甚至特殊字符”的

,我正在尝试使用

UTF8Encoding utf8 = new UTF8Encoding();
ASCIIEncoding ascii = new ASCIIEncoding();

对象将该字符串转换为 ascii。 我可以请某人为这个简单的任务带来一些启发,那就是狩猎我的下午。

编辑1: 我们想要完成的是摆脱特殊字符,例如一些特殊的 Windows 撇号。 我在下面发布的作为答案的代码不会解决这个问题。 基本上

奥布莱恩将成为奥布莱恩。 其中 ' 是特殊撇号之一

I have a string object

"with multiple characters and even special characters"

I am trying to use

UTF8Encoding utf8 = new UTF8Encoding();
ASCIIEncoding ascii = new ASCIIEncoding();

objects in order to convert that string to ascii. May I ask someone to bring some light to this simple task, that is hunting my afternoon.

EDIT 1:
What we are trying to accomplish is getting rid of special characters like some of the special windows apostrophes. The code that I posted below as an answer will not take care of that. Basically

O'Brian will become O?Brian. where ' is one of the special apostrophes

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

木落 2024-07-19 11:46:05

这是对你的另一个问题的回应,看起来它已被删除......这一点仍然成立。

看起来像经典的 Unicode 到 ASCII 问题。 诀窍是找到它发生的地点

.NET 可以很好地处理 Unicode,假设它被告知它是 Unicode 开始(或保留默认值)。

我的猜测是您的接收应用程序无法处理它。 所以,我可能会使用 ASCIIEncoder 带有 EncoderReplacementFallback with String.Empty:

using System.Text;

string inputString = GetInput();
var encoder = ASCIIEncoding.GetEncoder();
encoder.Fallback = new EncoderReplacementFallback(string.Empty);

byte[] bAsciiString = encoder.GetBytes(inputString);

// Do something with bytes...
// can write to a file as is
File.WriteAllBytes(FILE_NAME, bAsciiString);
// or turn back into a "clean" string
string cleanString = ASCIIEncoding.GetString(bAsciiString); 
// since the offending bytes have been removed, can use default encoding as well
Assert.AreEqual(cleanString, Default.GetString(bAsciiString));

当然,在过去,我们只是循环并删除任何大于127的字符......好吧,至少我们这些在美国的人。 ;)

This was in response to your other question, that looks like it's been deleted....the point still stands.

Looks like a classic Unicode to ASCII issue. The trick would be to find where it's happening.

.NET works fine with Unicode, assuming it's told it's Unicode to begin with (or left at the default).

My guess is that your receiving app can't handle it. So, I'd probably use the ASCIIEncoder with an EncoderReplacementFallback with String.Empty:

using System.Text;

string inputString = GetInput();
var encoder = ASCIIEncoding.GetEncoder();
encoder.Fallback = new EncoderReplacementFallback(string.Empty);

byte[] bAsciiString = encoder.GetBytes(inputString);

// Do something with bytes...
// can write to a file as is
File.WriteAllBytes(FILE_NAME, bAsciiString);
// or turn back into a "clean" string
string cleanString = ASCIIEncoding.GetString(bAsciiString); 
// since the offending bytes have been removed, can use default encoding as well
Assert.AreEqual(cleanString, Default.GetString(bAsciiString));

Of course, in the old days, we'd just loop though and remove any chars greater than 127...well, those of us in the US at least. ;)

反差帅 2024-07-19 11:46:05

我能够弄清楚。 如果有人想知道下面对我有用的代码:

ASCIIEncoding ascii = new ASCIIEncoding();
byte[] byteArray = Encoding.UTF8.GetBytes(sOriginal);
byte[] asciiArray = Encoding.Convert(Encoding.UTF8, Encoding.ASCII, byteArray);
string finalString = ascii.GetString(asciiArray);

请告诉我是否有更简单的方法。

I was able to figure it out. In case someone wants to know below the code that worked for me:

ASCIIEncoding ascii = new ASCIIEncoding();
byte[] byteArray = Encoding.UTF8.GetBytes(sOriginal);
byte[] asciiArray = Encoding.Convert(Encoding.UTF8, Encoding.ASCII, byteArray);
string finalString = ascii.GetString(asciiArray);

Let me know if there is a simpler way o doing it.

只为守护你 2024-07-19 11:46:05

对于任何喜欢扩展方法的人来说,这个方法对我们有用。

using System.Text;

namespace System
{
    public static class StringExtension
    {
        private static readonly ASCIIEncoding asciiEncoding = new ASCIIEncoding();

        public static string ToAscii(this string dirty)
        {
            byte[] bytes = asciiEncoding.GetBytes(dirty);
            string clean = asciiEncoding.GetString(bytes);
            return clean;
        }
    }
}

(系统命名空间,因此它几乎可以自动用于我们所有的字符串。)

For anyone who likes Extension methods, this one does the trick for us.

using System.Text;

namespace System
{
    public static class StringExtension
    {
        private static readonly ASCIIEncoding asciiEncoding = new ASCIIEncoding();

        public static string ToAscii(this string dirty)
        {
            byte[] bytes = asciiEncoding.GetBytes(dirty);
            string clean = asciiEncoding.GetString(bytes);
            return clean;
        }
    }
}

(System namespace so it's available pretty much automatically for all of our strings.)

不爱素颜 2024-07-19 11:46:05

根据上面 Mark 的回答(以及 Geo 的评论),我创建了一个两行版本来从字符串中删除所有 ASCII 异常情况。 为寻找这个答案的人提供(就像我一样)。

using System.Text;

// Create encoder with a replacing encoder fallback
var encoder = ASCIIEncoding.GetEncoding("us-ascii", 
    new EncoderReplacementFallback(string.Empty), 
    new DecoderExceptionFallback());

string cleanString = encoder.GetString(encoder.GetBytes(dirtyString)); 

Based on Mark's answer above (and Geo's comment), I created a two liner version to remove all ASCII exception cases from a string. Provided for people searching for this answer (as I did).

using System.Text;

// Create encoder with a replacing encoder fallback
var encoder = ASCIIEncoding.GetEncoding("us-ascii", 
    new EncoderReplacementFallback(string.Empty), 
    new DecoderExceptionFallback());

string cleanString = encoder.GetString(encoder.GetBytes(dirtyString)); 
梦晓ヶ微光ヅ倾城 2024-07-19 11:46:05

如果您想要在许多编码中使用的字符的 8 位表示,这可能会帮助您。

您必须将变量 targetEncoding 更改为您想要的任何编码。

Encoding targetEncoding = Encoding.GetEncoding(874); // Your target encoding
Encoding utf8 = Encoding.UTF8;

var stringBytes = utf8.GetBytes(Name);
var stringTargetBytes = Encoding.Convert(utf8, targetEncoding, stringBytes);
var ascii8BitRepresentAsCsString = Encoding.GetEncoding("Latin1").GetString(stringTargetBytes);

If you want 8 bit representation of characters that used in many encoding, this may help you.

You must change variable targetEncoding to whatever encoding you want.

Encoding targetEncoding = Encoding.GetEncoding(874); // Your target encoding
Encoding utf8 = Encoding.UTF8;

var stringBytes = utf8.GetBytes(Name);
var stringTargetBytes = Encoding.Convert(utf8, targetEncoding, stringBytes);
var ascii8BitRepresentAsCsString = Encoding.GetEncoding("Latin1").GetString(stringTargetBytes);
半葬歌 2024-07-19 11:46:05

下面是尽可能将 unicode 字符音译为最接近的 ascii 版本的代码。 删除/修复重音符号、宏符号、排版冒号、破折号、大引号、撇号、破折号、隐形空格和其他不良字符。

如果您需要将数据输入到另一个不支持 unicode 的系统中,这非常有用。 通过使用 stringbuilder 和简单循环,代码速度很快(经过测试,处理 8,000 个字符串需要 10,000x = 1.1 秒)。

地址:123 East Tāmaki – Tāmaki“ ” GötheФ€ O'Briens 他说“你好”!

输出 ->

地址:123 East Tamaki - Tamaki" " Gothe O'Briens '你好'他说!

    /// <summary>
    /// Transliterate all unicode chars to their closest ascii version
    /// Remove/fix accents, maori macrons, typesetters colons, dashes, curly quotes, apostrophes, dashes, invisible spaces, and other bad chars
    /// 1. remove accents but keep the letters
    /// 2. fix punctuation to the closest ascii punctuation
    /// 3. remove any remaining non ascii chars
    /// 4. also remove any invisible control chars
    /// Option: remove line breaks or keep them
    /// </summary>
    /// <example>"CHASSIS NO.:LC0CE4CB3N0345426 East Tāmaki – East Tāmaki“ ” GötheФ€ O’Briens ‘hello’ he said!" outputs "CHASSIS NO.:LC0CE4CB3N0345426 East Tamaki - East Tamaki" " Gothe O'Briens 'hello' he said!"</example>
    public static string CleanUnicodeTransliterateToAscii(string text, bool removeLineBreaks) {
        if (text == null) return null;

        // decomposes accented letters into the letter and the diacritic, fixes wacky punctuation to closest common punctuation
        text = text.Normalize(NormalizationForm.FormKD);

        // loop all chars after converting all punctuation to the closest (fix curly quotes etc)
        var stringBuilder = new StringBuilder();
        foreach (var c in text) {
            var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
            if (c == '\r' || c == '\n') {
                if (removeLineBreaks) {
                    // skip
                } else {
                    stringBuilder.Append(c);
                }
            } else if (unicodeCategory == UnicodeCategory.Control) {
                // control char - skip
            } else if (unicodeCategory == UnicodeCategory.NonSpacingMark) {
                // diacritic mark/accent - skip             
            } else if (c == '‘' || c == '’') {
                // single curly quote or apostrophe add apostrophe
                stringBuilder.Append("'");
            } else if (unicodeCategory == UnicodeCategory.InitialQuotePunctuation || unicodeCategory == UnicodeCategory.FinalQuotePunctuation) {
                // any other quote add a normal straight quote
                stringBuilder.Append("\"");
            } else if (unicodeCategory == UnicodeCategory.DashPunctuation) {
                stringBuilder.Append("-");
            } else if (unicodeCategory == UnicodeCategory.SpaceSeparator) {
                // add a normal space
                stringBuilder.Append(" ");
            } else if (c > 255) {
                // skip any remaining non ascii chars
            } else {
                stringBuilder.Append(c);
            }
        }
        text = stringBuilder.ToString();
        return text;
    }

Here is code to transliterate unicode chars to their closest ascii version where possible. Remove/fix accents, macrons, typesetters colons, dashes, curly quotes, apostrophes, dashes, invisible spaces, and other bad chars.

This is useful if you need to feed data into another system that does not support unicode. Code is fast by using stringbuilder and simple loop (tested 8,000 char string processed 10,000x = 1.1sec).

Address:123 East Tāmaki – Tāmaki“ ” GötheФ€ O’Briens ‘hello’ he said!

outputs ->

Address:123 East Tamaki - Tamaki" " Gothe O'Briens 'hello' he said!

    /// <summary>
    /// Transliterate all unicode chars to their closest ascii version
    /// Remove/fix accents, maori macrons, typesetters colons, dashes, curly quotes, apostrophes, dashes, invisible spaces, and other bad chars
    /// 1. remove accents but keep the letters
    /// 2. fix punctuation to the closest ascii punctuation
    /// 3. remove any remaining non ascii chars
    /// 4. also remove any invisible control chars
    /// Option: remove line breaks or keep them
    /// </summary>
    /// <example>"CHASSIS NO.:LC0CE4CB3N0345426 East Tāmaki – East Tāmaki“ ” GötheФ€ O’Briens ‘hello’ he said!" outputs "CHASSIS NO.:LC0CE4CB3N0345426 East Tamaki - East Tamaki" " Gothe O'Briens 'hello' he said!"</example>
    public static string CleanUnicodeTransliterateToAscii(string text, bool removeLineBreaks) {
        if (text == null) return null;

        // decomposes accented letters into the letter and the diacritic, fixes wacky punctuation to closest common punctuation
        text = text.Normalize(NormalizationForm.FormKD);

        // loop all chars after converting all punctuation to the closest (fix curly quotes etc)
        var stringBuilder = new StringBuilder();
        foreach (var c in text) {
            var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
            if (c == '\r' || c == '\n') {
                if (removeLineBreaks) {
                    // skip
                } else {
                    stringBuilder.Append(c);
                }
            } else if (unicodeCategory == UnicodeCategory.Control) {
                // control char - skip
            } else if (unicodeCategory == UnicodeCategory.NonSpacingMark) {
                // diacritic mark/accent - skip             
            } else if (c == '‘' || c == '’') {
                // single curly quote or apostrophe add apostrophe
                stringBuilder.Append("'");
            } else if (unicodeCategory == UnicodeCategory.InitialQuotePunctuation || unicodeCategory == UnicodeCategory.FinalQuotePunctuation) {
                // any other quote add a normal straight quote
                stringBuilder.Append("\"");
            } else if (unicodeCategory == UnicodeCategory.DashPunctuation) {
                stringBuilder.Append("-");
            } else if (unicodeCategory == UnicodeCategory.SpaceSeparator) {
                // add a normal space
                stringBuilder.Append(" ");
            } else if (c > 255) {
                // skip any remaining non ascii chars
            } else {
                stringBuilder.Append(c);
            }
        }
        text = stringBuilder.ToString();
        return text;
    }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文