文本解码问题

发布于 2024-08-27 21:44:05 字数 1860 浏览 6 评论 0原文

因此,给定这个输入字符串:

=?ISO-8859-1?Q?TEST=2C_This_Is_A_Test_of_Some_Encoding=AE?=

以及这个函数:

private string DecodeSubject(string input)
        {
            StringBuilder sb = new StringBuilder();
            MatchCollection matches = Regex.Matches(inputText.Text, @"=\?(?<encoding>[\S]+)\?.\?(?<data>[\S]+[=]*)\?=");
            foreach (Match m in matches)
            {
                string encoding = m.Groups["encoding"].Value;
                string data = m.Groups["data"].Value;

                Encoding enc = Encoding.GetEncoding(encoding.ToLower());
                if (enc == Encoding.UTF8)
                {
                    byte[] d = Convert.FromBase64String(data);
                    sb.Append(Encoding.ASCII.GetString(d));
                }
                else
                {                    
                    byte[] bytes = Encoding.Default.GetBytes(data);
                    string decoded = enc.GetString(bytes);
                    sb.Append(decoded);
                }
            }

            return sb.ToString();

        }

结果与从输入字符串中提取的数据相同。我做错了什么,这段文本没有被正确解码?

更新

所以我有这段代码用于解码可打印的引用:

public string DecodeQuotedPrintable(string encoded)
        {
            byte[] buffer = new byte[1];
            return Regex.Replace(encoded, "=(\r\n?|\n)|=([A-F0-9]{2})", delegate(Match m)
            {
                if (byte.TryParse(m.Groups[2].Value, NumberStyles.HexNumber, CultureInfo.InvariantCulture, out buffer[0]))
                {
                    return Encoding.ASCII.GetString(buffer);
                }
                else
                {
                    return string.Empty;
                }
            });
        }

并且只留下下划线。我是否手动将它们转换为空格(替换(“_”,“”)),或者我需要做其他什么来处理这个问题?

So given this input string:

=?ISO-8859-1?Q?TEST=2C_This_Is_A_Test_of_Some_Encoding=AE?=

And this function:

private string DecodeSubject(string input)
        {
            StringBuilder sb = new StringBuilder();
            MatchCollection matches = Regex.Matches(inputText.Text, @"=\?(?<encoding>[\S]+)\?.\?(?<data>[\S]+[=]*)\?=");
            foreach (Match m in matches)
            {
                string encoding = m.Groups["encoding"].Value;
                string data = m.Groups["data"].Value;

                Encoding enc = Encoding.GetEncoding(encoding.ToLower());
                if (enc == Encoding.UTF8)
                {
                    byte[] d = Convert.FromBase64String(data);
                    sb.Append(Encoding.ASCII.GetString(d));
                }
                else
                {                    
                    byte[] bytes = Encoding.Default.GetBytes(data);
                    string decoded = enc.GetString(bytes);
                    sb.Append(decoded);
                }
            }

            return sb.ToString();

        }

The result is the same as the data extracted from the input string. What am i doing wrong that this text is not getting decoded properly?

UPDATE

So i have this code for decoding quote-printable:

public string DecodeQuotedPrintable(string encoded)
        {
            byte[] buffer = new byte[1];
            return Regex.Replace(encoded, "=(\r\n?|\n)|=([A-F0-9]{2})", delegate(Match m)
            {
                if (byte.TryParse(m.Groups[2].Value, NumberStyles.HexNumber, CultureInfo.InvariantCulture, out buffer[0]))
                {
                    return Encoding.ASCII.GetString(buffer);
                }
                else
                {
                    return string.Empty;
                }
            });
        }

And that just leaves the underscores. Do i manually convert those to spaces (Replace("_"," ")), or is there something else i need to do to handle that?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

隔岸观火 2024-09-03 21:44:05

看来您不完全理解输入行的格式。在这里查看:http://www.ietf.org/rfc/rfc2047.txt
格式为:encoded-word = "=?"字符集“?”编码“?”编码文本“?=”

所以你必须

  1. 提取字符集(根据.net进行编码)。不仅仅是 UTF8 或默认 (Utf16)
  2. 提取编码:B 表示 base64 Q 表示可引用打印(您的情况!)
  3. 然后对字节进行解码,然后对字符串进行解码

Looks like you don't fully understand format of input line. Check it here: http://www.ietf.org/rfc/rfc2047.txt
format is: encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

so you have to

  1. Extranct charset(encoding in terms of .net). Not just UTF8 or Default (Utf16)
  2. Extract encoding: either B for base64 Q for quoted-printable (your case!)
  3. Then perform decoding to bytes then to string
远昼 2024-09-03 21:44:05
  1. 该函数甚至没有尝试解码 quoted-printable 编码的内容(十六进制代码和下划线)。你需要添加这一点。
  2. 它处理编码错误(出于某种奇怪的原因,UTF-8 使用 Encoding.ASCII 进行解码)
  1. The function's not even trying to decode the quoted-printable encoded stuff (the hex codes and underscores). You need to add that.
  2. It's handling the encoding wrong (UTF-8 gets decoded with Encoding.ASCII for some bizarre reason)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文