处理C#中无法编码的字符

发布于 2025-02-10 01:27:14 字数 675 浏览 0 评论 0原文

给定一个输入字符串和一个编码,我想处理输入字符串中的每个字符,如下所示:

  • 如果可以编码编码点,然后对其进行编码;

  • 如果不是,则输出(编码)字符串& #xuuuu;其中uuuu是Unicode CodePoint的十六进制值。

我已经阅读了encoderencoderfallback的.NET文档,我可以看到如何在找到无法校正的字符时获得通知,但我看不到任何内容输出实际取决于所讨论的特定字符的方法。

有什么想法吗?

看起来更深一些(谢谢@Josefz),我看到encoderfallback class的描述支持三种机制,包括:

最佳拟合后回来,映射有效的Unicode字符 编码为近似等效。例如,最合适的后卫 Asciiencoding类的处理程序可能会将æ(U+00C6)映射到AE(U+0041+ U+0045)。最佳合适的后备处理程序也可以实施以将一个字母(例如西里尔)音译到另一个字母(例如 拉丁或罗马)。 .NET框架没有提供任何公众 最佳拟合后备实现。

这似乎是我所追求的:所以我必须弄清楚如何编写自己的encoderfallback的实现?

Given an input string and an encoding, I want to process each character in the input string as follows:

  • If the codepoint can be encoded, then encode it;

  • If not, output (the encoding of) the string &#xUUUU; where UUUU is the hex value of the Unicode codepoint.

I've read through the .NET documentation for Encoder and EncoderFallback, and I can see how to get notified when an unencodable character is found, but I can't see any way to output something that actually depends on the particular character in question.

Any ideas?

Looking a bit deeper (thanks @JosefZ), I see that the description of the EncoderFallback class says it supports three mechanisms, including:

Best-fit fallback, which maps valid Unicode characters that cannot be
encoded to an approximate equivalent. For example, a best-fit fallback
handler for the ASCIIEncoding class might map Æ (U+00C6) to AE (U+0041 +
U+0045). A best-fit fallback handler might also be implemented to transliterate one alphabet (such as Cyrillic) to another (such as
Latin or Roman). The .NET Framework does not provide any public
best-fit fallback implementations.

which would appear to be the one I am after: so I have to work out how to write my own implementation of EncoderFallback?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

在巴黎塔顶看东京樱花 2025-02-17 01:27:14

您可以使用以下encoderfallbackencoderfallbackbuffer来执行您想要的

public class HexFallback : EncoderFallback
{
    public override int MaxCharCount { get { return int.MaxValue; } }   // we can handle any amount of chars
    public override EncoderFallbackBuffer CreateFallbackBuffer(){ return new HexFallbackBuffer(); }
}

public class HexFallbackBuffer : EncoderFallbackBuffer
{
    int _currentPos;   // current position of invalid char encoding
    char _charToEncode;   // first or main char
    char _charToEncode2;  // lower pair of surrogate if any
    
    public override bool Fallback(char charUnknown, int index)
    {
        Reset();
        _charToEncode = charUnknown;   // store char
        return true;
    }
    
    public override bool Fallback(char charUnknownHigh, char charUnknownLow, int index)
    {
        Reset();
        _charToEncode = charUnknownHigh;   // store high and low surrogates
        _charToEncode2 = charUnknownLow;
        return true;        
    }
    
    public override int Remaining { get { return 8 - _currentPos + (_charToEncode2 != (char)0 ? 8 : 0); } }   // 8 chars per invalid char
    
    public override void Reset()
    {
        _charToEncode = (char)0;
        _charToEncode2 = (char)0;
        _currentPos = 0;
    }
    
    public override bool MovePrevious()   // can we move backwards in our encoding
    {
        if(_currentPos == 0)
            return false;
        _currentPos -= 1;
        return true;
    }
    
    public override char GetNextChar()
    {
        if(_charToEncode2 != (char)0 && _currentPos == 8)   // if we have a surrogate
        {
            _charToEncode = _charToEncode2;   // move low surrogate to main
            _charToEncode2 = (char)0;
            _currentPos = 0;   // and start again
        }
        
        char result;
        switch(_currentPos)
        {
        case 0:
            result = '&';
            break;
        case 1:
            result = '#';
            break;
        case 2:
            result = 'x';
            break;
        case 3:
            result = NibbleToHex(((int)_charToEncode) >> 12);   // shift 12 bits
            break;
        case 4:
            result = NibbleToHex(((int)_charToEncode) >> 8 & 0x0F);  // shift 8 and mask the rest
            break;
        case 5:
            result = NibbleToHex(((int)_charToEncode) >> 4 & 0x0F);  // shift 4 and mask the rest
            break;
        case 6:
            result = NibbleToHex(((int)_charToEncode) & 0x0F); //  mask all high bits
            break;
        case 7:
            result = ';';
            break;
        default:
            return (char)0;
        }
        
        _currentPos++;
        return result;
    }
    
    char NibbleToHex(int nibble)    // convert 4 bits to hex char
    {
        return (char)(
            nibble < 10
            ? nibble + (int)'0'  // Return a character from '0' to '9'
            : nibble + (int)'7'  // Return A to F
            );
    }
}

dotnetfiddle

您这样使用

var encoder = Encoding.ASCII.GetEncoder();
encoder.Fallback = new HexFallback();

var str = "Æ";
var buffer = new byte[1000];

var length = encoder.GetBytes(str.ToCharArray(), 0, str.Length, buffer, 0, true);

// write out encoded string
Console.WriteLine(Encoding.ASCII.GetString(buffer, 0, length));

You can use the following EncoderFallback and EncoderFallbackBuffer to do what you want

public class HexFallback : EncoderFallback
{
    public override int MaxCharCount { get { return int.MaxValue; } }   // we can handle any amount of chars
    public override EncoderFallbackBuffer CreateFallbackBuffer(){ return new HexFallbackBuffer(); }
}

public class HexFallbackBuffer : EncoderFallbackBuffer
{
    int _currentPos;   // current position of invalid char encoding
    char _charToEncode;   // first or main char
    char _charToEncode2;  // lower pair of surrogate if any
    
    public override bool Fallback(char charUnknown, int index)
    {
        Reset();
        _charToEncode = charUnknown;   // store char
        return true;
    }
    
    public override bool Fallback(char charUnknownHigh, char charUnknownLow, int index)
    {
        Reset();
        _charToEncode = charUnknownHigh;   // store high and low surrogates
        _charToEncode2 = charUnknownLow;
        return true;        
    }
    
    public override int Remaining { get { return 8 - _currentPos + (_charToEncode2 != (char)0 ? 8 : 0); } }   // 8 chars per invalid char
    
    public override void Reset()
    {
        _charToEncode = (char)0;
        _charToEncode2 = (char)0;
        _currentPos = 0;
    }
    
    public override bool MovePrevious()   // can we move backwards in our encoding
    {
        if(_currentPos == 0)
            return false;
        _currentPos -= 1;
        return true;
    }
    
    public override char GetNextChar()
    {
        if(_charToEncode2 != (char)0 && _currentPos == 8)   // if we have a surrogate
        {
            _charToEncode = _charToEncode2;   // move low surrogate to main
            _charToEncode2 = (char)0;
            _currentPos = 0;   // and start again
        }
        
        char result;
        switch(_currentPos)
        {
        case 0:
            result = '&';
            break;
        case 1:
            result = '#';
            break;
        case 2:
            result = 'x';
            break;
        case 3:
            result = NibbleToHex(((int)_charToEncode) >> 12);   // shift 12 bits
            break;
        case 4:
            result = NibbleToHex(((int)_charToEncode) >> 8 & 0x0F);  // shift 8 and mask the rest
            break;
        case 5:
            result = NibbleToHex(((int)_charToEncode) >> 4 & 0x0F);  // shift 4 and mask the rest
            break;
        case 6:
            result = NibbleToHex(((int)_charToEncode) & 0x0F); //  mask all high bits
            break;
        case 7:
            result = ';';
            break;
        default:
            return (char)0;
        }
        
        _currentPos++;
        return result;
    }
    
    char NibbleToHex(int nibble)    // convert 4 bits to hex char
    {
        return (char)(
            nibble < 10
            ? nibble + (int)'0'  // Return a character from '0' to '9'
            : nibble + (int)'7'  // Return A to F
            );
    }
}

dotnetfiddle

You use it like this

var encoder = Encoding.ASCII.GetEncoder();
encoder.Fallback = new HexFallback();

var str = "Æ";
var buffer = new byte[1000];

var length = encoder.GetBytes(str.ToCharArray(), 0, str.Length, buffer, 0, true);

// write out encoded string
Console.WriteLine(Encoding.ASCII.GetString(buffer, 0, length));

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文