如何读取包含特殊字符的ANSI编码文件

发布于 2024-08-04 10:53:42 字数 298 浏览 8 评论 0原文

我正在编写一个 TFS 签入策略，它检查我们的源文件是否包含文件头。

我的问题是，我们的文件头包含一个特殊字符“©”，不幸的是我们的一些源文件是用 ANSI 编码的。因此，如果我在策略中读取这些文件，该字符串将类似于“Copyright � 2009”。

string content = File.ReadAllText(pendingChange.LocalItem);

原文

I'm writing a TFS Checkin policy, which checks if our source files containing our file header.

My problem is, that our file header contains a special character "©" and unfortunately some of our source files are encoded in ANSI.
So if I read these files in the policy, the string looks like this "Copyright � 2009".

string content = File.ReadAllText(pendingChange.LocalItem);

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

往日情怀 2024-08-11 10:53:42

使用 Encoding.Default：

string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);

但是，您应该注意，使用系统默认编码读取它 - 这可能与文件的编码不同。没有一种称为 ANSI 的编码，但通常当人们谈论“ANSI 编码”时，他们指的是 Windows 代码页 1252 或他们的机器碰巧使用的任何编码。

如果您能找出所使用的准确编码，您的代码将会更加健壮。

Use Encoding.Default:

string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);

You should be aware, however, that that reads it using the system default encoding - which may not be the same as the encoding of the file. There's no single encoding called ANSI, but usually when people talk about "the ANSI encoding" they mean Windows Code Page 1252 or whatever their box happens to use.

Your code will be more robust if you can find out the exact encoding used.

回复收藏 0 原文

夢归不見 2024-08-11 10:53:42

如果您要制定这样的政策，并且让团队同意标准编码，这似乎是明智的。老实说，我不明白为什么任何团队都会使用“Unicode（带签名的 UtF-8）-代码页 65001”以外的编码（可能除了具有重要非拉丁静态内容的 ASPX 页面，但即便如此我也可以'不明白使用 UTF-8 有什么大不了的）。

假设您仍然希望允许混合编码，那么您接下来需要一种方法来确定文件保存的编码，以便您知道将哪种编码传递给 ReadAllText。从文件中确定这一点并不容易，但是使用 Encoding.Default 可能可以正常工作。因为您很可能只有 2 种编码需要处理，即 VS（带签名的 UTF-8）和计算机使用的常见 ANSI 编码（可能是 Windows-1252）。

因此使用

 string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);

会起作用。（据我所知，乔恩已经发布了）。这是有效的，因为当 UTF-8 BOM（这就是 VS 术语“签名”的含义）出现在文件开头时，提供的编码参数将被忽略，并且无论如何都会使用 UTF-8。因此，在使用 UTF-8 保存文件的地方，您会得到正确的结果，而在使用 ANSI 的地方，您也很可能会得到正确的结果。

顺便说一句，如果您正在处理文件头，ReadAllLines 不会让事情变得更容易吗？

It would seem sensible if you going to have such policies that you would also have team agreed standard encoding. To be honest, I can't see why any team would use an encoding other than "Unicode (UtF-8 with signature) - Codepage 65001" (except perhaps for ASPX pages with significant non-latin static content but even then I can't see how it would be a big deal to use UTF-8).

Assuming you still want to allow mixed encodings then you next need a way to determine which encoding a file was save in so you know which encoding to pass to ReadAllText. Its not easy to determine this from the file however using Encoding.Default is likely to work ok. Since its most likely you have just 2 encodings to deal with, the VS (UTF-8 with signature) and a common ANSI encoding used by you machines (probably Windows-1252).

Hence using

 string content = File.ReadAllText(pendingChange.LocalItem, Encoding.Default);

will work. (As I see Jon has already posted). This works because when the UTF-8 BOM (which is what VS means by the term "signature") is present at the start of the file the supplied encoding parameter is ignored and UTF-8 is used anyway. Hence where the file is saved using UTF-8 you get correct results and where ANSI is used you are most likely also to get correct results.

BTW if you are processing file headers wouldn't ReadAllLines make things easier?.

回复收藏 0 原文

淡笑忘祈一世凡恋 2024-08-11 10:53:42

我知道这是一个老问题，但我遇到了类似的情况，并发现公认的答案是偷工减料（不要忽视乔恩·斯基特的实用简短答案，但我会进一步充实它）......

< a href="http://www.biblioscape.com/rtf15_spec.htm#Heading6" rel="noreferrer">specs 声明标头将直接在 {\rtf:

 \ansi  ANSI (the default)
 \mac   Apple Macintosh
 \pc    IBM PC code page 437 
 \pca   IBM PC code page 850, used by IBM Personal System/2 (not implemented in version 1 of Microsoft Word for OS/2)

根据 Wikipedia，“ANSI 字符集没有明确定义的含义”

对于默认 ANSI您可以选择部分不兼容编码：

using System.Text;
...
string content = File.ReadAllText(filename, Encoding.GetEncoding("ISO-8859-1"));
or
string content = File.ReadAllText(filename, Encoding.GetEncoding("Windows-1252"));

在 Windows 10 上使用写字板保存带有欧元符号的文件（Windows-1252 中为 0x80，ISO-8859-1 中为 0xA4）显示以下内容：

标头在 \ansi

{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1043{ ...

用 RTF 编码包装：\'80

并且编码并没有直接使用，而是根据规范

：
\'hh ：基于指定字符集的十六进制值（可以
用于识别 8 位值）。

我想最好的办法是读取标头，如果文件以 {\rtf1\ansi\ansicpg1252 开头，则转到 Windows-1252。

但为了让事情变得更复杂，规范还声明可以存在混合编码。搜索“\upr”...

我想没有明确的答案，在您的情况下最简单的方法可能是搜索（在未解码的原始字节数组中）编码版权标志的所有变体您可能会在源库中遇到。

就我而言，我最终决定也走捷径，但添加一小部分防御性编码。到目前为止，我看到的所有文件都是 Windows-1252，因此我对此进行了常见情况优化。

    Encoding encoding = Encoding.GetEncoding("Windows-1252", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
    
    using (System.IO.StreamReader reader = new System.IO.StreamReader(filename, encoding)) {
        string header= reader.ReadLine();
        if (!header.Contains("cpg1252")) {
            if(header.Contains("\\pca"))
                encoding = Encoding.GetEncoding(850, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
            else if (header.Contains("\\pc"))
                encoding = Encoding.GetEncoding(437, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
            else
                encoding = Encoding.GetEncoding("ISO-8859-1", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
        }
    }
    
    string content = System.IO.File.ReadAllText(filename, encoding);

I know this is an old question but I ran into a similar situation and found the accepted answer to be cutting some corners (no disregard for Jon Skeet's pragmatic short answer, but I'll flesh it out a little more)...

The specs state that the header will contain the encoding directly after {\rtf:

 \ansi  ANSI (the default)
 \mac   Apple Macintosh
 \pc    IBM PC code page 437 
 \pca   IBM PC code page 850, used by IBM Personal System/2 (not implemented in version 1 of Microsoft Word for OS/2)

According to Wikipedia the "ANSI character set has no well-defined meaning"

For the default ANSI you have the choice of these partially incompatible encodings:

using System.Text;
...
string content = File.ReadAllText(filename, Encoding.GetEncoding("ISO-8859-1"));
or
string content = File.ReadAllText(filename, Encoding.GetEncoding("Windows-1252"));

Using WordPad on windows 10 to save a file with a euro sign (0x80 in Windows-1252 but 0xA4 in ISO-8859-1) revealed the following:

The header stated the exact encoding after \ansi

{\rtf1\ansi\ansicpg1252\deff0\nouicompat\deflang1043{ ...

And the encoding was not directly used, instead it was wrapped in RTF encoding: \'80

according to the specs:

\'hh : A hexadecimal value, based on the specified character set (may
be used to identify 8-bit values).

I guess the best thing to do is read the header, if the file starts with {\rtf1\ansi\ansicpg1252 then go for Windows-1252.

But to make things more complicated, the specs also state that there can be mixed encodings... search for '\upr'...

I guess there is no definitive answer, the easiest way to go in your case may be to search (in the un-decoded raw byte array) for all the variations of the encoded copyright signs that you may encounter in your source base.

In my case I finally decided to cut a few corners as well, but add a small percentage of defensive coding. All files I have seen so far were Windows-1252 so I common-case-optimised for that.

    Encoding encoding = Encoding.GetEncoding("Windows-1252", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
    
    using (System.IO.StreamReader reader = new System.IO.StreamReader(filename, encoding)) {
        string header= reader.ReadLine();
        if (!header.Contains("cpg1252")) {
            if(header.Contains("\\pca"))
                encoding = Encoding.GetEncoding(850, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
            else if (header.Contains("\\pc"))
                encoding = Encoding.GetEncoding(437, EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
            else
                encoding = Encoding.GetEncoding("ISO-8859-1", EncoderFallback.ReplacementFallback, DecoderFallback.ReplacementFallback);
        }
    }
    
    string content = System.IO.File.ReadAllText(filename, encoding);

回复收藏 0 原文

~没有更多了~