如何读取包含特殊字符的ANSI编码文件
我正在编写一个 TFS 签入策略,它检查我们的源文件是否包含文件头。
我的问题是,我们的文件头包含一个特殊字符“©”,不幸的是我们的一些源文件是用 ANSI 编码的。 因此,如果我在策略中读取这些文件,该字符串将类似于“Copyright � 2009”。
string content = File.ReadAllText(pendingChange.LocalItem);
我厌倦了更改字符串的编码,但这没有帮助。那么我如何读取这些文件,以获得正确的字符串“Copyright © 2009”?
I'm writing a TFS Checkin policy, which checks if our source files containing our file header.
My problem is, that our file header contains a special character "©" and unfortunately some of our source files are encoded in ANSI.
So if I read these files in the policy, the string looks like this "Copyright � 2009".
string content = File.ReadAllText(pendingChange.LocalItem);
I tired to change the encoding of the string, but it does not help. So how can I read these files, that I get the correct string "Copyright © 2009"?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用
Encoding.Default
:但是,您应该注意,使用系统默认编码读取它 - 这可能与文件的编码不同。没有一种称为 ANSI 的编码,但通常当人们谈论“ANSI 编码”时,他们指的是 Windows 代码页 1252 或他们的机器碰巧使用的任何编码。
如果您能找出所使用的准确编码,您的代码将会更加健壮。
Use
Encoding.Default
:You should be aware, however, that that reads it using the system default encoding - which may not be the same as the encoding of the file. There's no single encoding called ANSI, but usually when people talk about "the ANSI encoding" they mean Windows Code Page 1252 or whatever their box happens to use.
Your code will be more robust if you can find out the exact encoding used.
如果您要制定这样的政策,并且让团队同意标准编码,这似乎是明智的。老实说,我不明白为什么任何团队都会使用“Unicode(带签名的 UtF-8)-代码页 65001”以外的编码(可能除了具有重要非拉丁静态内容的 ASPX 页面,但即便如此我也可以'不明白使用 UTF-8 有什么大不了的)。
假设您仍然希望允许混合编码,那么您接下来需要一种方法来确定文件保存的编码,以便您知道将哪种编码传递给
ReadAllText
。从文件中确定这一点并不容易,但是使用Encoding.Default
可能可以正常工作。因为您很可能只有 2 种编码需要处理,即 VS(带签名的 UTF-8)和计算机使用的常见 ANSI 编码(可能是 Windows-1252)。因此使用
会起作用。 (据我所知,乔恩已经发布了)。这是有效的,因为当 UTF-8 BOM(这就是 VS 术语“签名”的含义)出现在文件开头时,提供的编码参数将被忽略,并且无论如何都会使用 UTF-8。因此,在使用 UTF-8 保存文件的地方,您会得到正确的结果,而在使用 ANSI 的地方,您也很可能会得到正确的结果。
顺便说一句,如果您正在处理文件头,
ReadAllLines
不会让事情变得更容易吗?It would seem sensible if you going to have such policies that you would also have team agreed standard encoding. To be honest, I can't see why any team would use an encoding other than "Unicode (UtF-8 with signature) - Codepage 65001" (except perhaps for ASPX pages with significant non-latin static content but even then I can't see how it would be a big deal to use UTF-8).
Assuming you still want to allow mixed encodings then you next need a way to determine which encoding a file was save in so you know which encoding to pass to
ReadAllText
. Its not easy to determine this from the file however usingEncoding.Default
is likely to work ok. Since its most likely you have just 2 encodings to deal with, the VS (UTF-8 with signature) and a common ANSI encoding used by you machines (probably Windows-1252).Hence using
will work. (As I see Jon has already posted). This works because when the UTF-8 BOM (which is what VS means by the term "signature") is present at the start of the file the supplied encoding parameter is ignored and UTF-8 is used anyway. Hence where the file is saved using UTF-8 you get correct results and where ANSI is used you are most likely also to get correct results.
BTW if you are processing file headers wouldn't
ReadAllLines
make things easier?.我知道这是一个老问题,但我遇到了类似的情况,并发现公认的答案是偷工减料(不要忽视乔恩·斯基特的实用简短答案,但我会进一步充实它)......
< a href="http://www.biblioscape.com/rtf15_spec.htm#Heading6" rel="noreferrer">specs 声明标头将直接在
{\rtf:
根据 Wikipedia,“ANSI 字符集没有明确定义的含义”
对于默认 ANSI您可以选择部分不兼容 编码:
在 Windows 10 上使用写字板保存带有欧元符号的文件(Windows-1252 中为 0x80,ISO-8859-1 中为 0xA4)显示以下内容:
标头在
\ansi
用 RTF 编码包装:
\'80
并且编码并没有直接使用,而是根据规范
我想最好的办法是读取标头,如果文件以
{\rtf1\ansi\ansicpg1252
开头,则转到Windows-1252
。但为了让事情变得更复杂,规范还声明可以存在混合编码。搜索“\upr”...
我想没有明确的答案,在您的情况下最简单的方法可能是搜索(在未解码的原始字节数组中)编码版权标志的所有变体您可能会在源库中遇到。
就我而言,我最终决定也走捷径,但添加一小部分防御性编码。到目前为止,我看到的所有文件都是 Windows-1252,因此我对此进行了常见情况优化。
I know this is an old question but I ran into a similar situation and found the accepted answer to be cutting some corners (no disregard for Jon Skeet's pragmatic short answer, but I'll flesh it out a little more)...
The specs state that the header will contain the encoding directly after
{\rtf:
According to Wikipedia the "ANSI character set has no well-defined meaning"
For the default ANSI you have the choice of these partially incompatible encodings:
Using WordPad on windows 10 to save a file with a euro sign (0x80 in Windows-1252 but 0xA4 in ISO-8859-1) revealed the following:
The header stated the exact encoding after
\ansi
And the encoding was not directly used, instead it was wrapped in RTF encoding:
\'80
according to the specs:
I guess the best thing to do is read the header, if the file starts with
{\rtf1\ansi\ansicpg1252
then go forWindows-1252
.But to make things more complicated, the specs also state that there can be mixed encodings... search for '\upr'...
I guess there is no definitive answer, the easiest way to go in your case may be to search (in the un-decoded raw byte array) for all the variations of the encoded copyright signs that you may encounter in your source base.
In my case I finally decided to cut a few corners as well, but add a small percentage of defensive coding. All files I have seen so far were
Windows-1252
so I common-case-optimised for that.