StreamWriter 和 UTF-8 字节顺序标记
我遇到了 StreamWriter 和字节顺序标记的问题。该文档似乎指出 Encoding.UTF8 编码已启用字节顺序标记,但在写入文件时,有些文件具有标记,而其他文件则没有。
我正在通过以下方式创建流编写器:
this.Writer = new StreamWriter(this.Stream, System.Text.Encoding.UTF8);
任何关于可能发生的事情的想法将不胜感激。
I'm having an issue with StreamWriter and Byte Order Marks. The documentation seems to state that the Encoding.UTF8 encoding has byte order marks enabled but when files are being written some have the marks while other don't.
I'm creating the stream writer in the following way:
this.Writer = new StreamWriter(this.Stream, System.Text.Encoding.UTF8);
Any ideas on what could be happening would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
正如有人已经指出的那样,不带编码参数的调用就可以解决问题。
但是,如果您想明确,请尝试以下操作:
要禁用 BOM,关键是使用
new UTF8Encoding(false)
进行构造,而不仅仅是 Encoding.UTF8Encoding。这与调用不带编码参数的 StreamWriter 相同,在内部它只是做同样的事情。要启用 BOM,请改用
new UTF8Encoding(true)
。更新:自 Windows 10 v1903 起,在 notepad.exe 中另存为 UTF-8 时,BOM 字节现在是一个选择加入功能。
As someone pointed that out already, calling without the encoding argument does the trick.
However, if you want to be explicit, try this:
To disable BOM, the key is to construct with a
new UTF8Encoding(false)
, instead of just Encoding.UTF8Encoding. This is the same as calling StreamWriter without the encoding argument, internally it's just doing the same thing.To enable BOM, use
new UTF8Encoding(true)
instead.Update: Since Windows 10 v1903, when saving as UTF-8 in notepad.exe, BOM byte is now an opt-in feature instead.
该问题是由于您使用静态
UTF8<
Encoding
类 上的 /code> 属性。当调用
GetPreamble
方法 时由UTF8
属性返回的Encoding
类的实例,它返回字节顺序标记(三个字符的字节数组)并在任何其他内容之前写入流被写入流(假设是一个新流)。您可以通过创建
UTF8Encoding
类的实例来避免这种情况< /a> 自己,如下所示:根据 默认无参数构造函数 的文档(强调我的):
这意味着对
GetPreamble
的调用将返回一个空数组,因此不会将 BOM 写入底层流。The issue is due to the fact that you are using the static
UTF8
property on theEncoding
class.When the
GetPreamble
method is called on the instance of theEncoding
class returned by theUTF8
property, it returns the byte order mark (the byte array of three characters) and is written to the stream before any other content is written to the stream (assuming a new stream).You can avoid this by creating the instance of the
UTF8Encoding
class yourself, like so:As per the documentation for the default parameterless constructor (emphasis mine):
This means that the call to
GetPreamble
will return an empty array, and therefore no BOM will be written to the underlying stream.我的答案基于 HelloSam 的答案,其中包含所有必要的信息。
只是我相信OP所要求的是如何确保BOM被发送到文件中。
因此,您需要传递 true,而不是将 false 传递给 UTF8Encoding ctor。
尝试下面的代码,在十六进制编辑器中打开生成的文件,看看哪个包含 BOM,哪个不包含。
My answer is based on HelloSam's one which contains all the necessary information.
Only I believe what OP is asking for is how to make sure that BOM is emitted into the file.
So instead of passing false to UTF8Encoding ctor you need to pass true.
Try the code below, open the resulting files in a hex editor and see which one contains BOM and which doesn't.
我唯一一次看到构造函数不添加 UTF-8 BOM 是当您调用它时流不在位置 0 时。例如,在下面的代码中,没有写入 BOM:
正如其他人所说,如果您使用
StreamWriter(stream)
构造函数,而不指定编码,那么您将看不到物料清单。The only time I've seen that constructor not add the UTF-8 BOM is if the stream is not at position 0 when you call it. For example, in the code below, the BOM isn't written:
As others have said, if you're using the
StreamWriter(stream)
constructor, without specifying the encoding, then you won't see the BOM.您是否对每个文件使用相同的 StreamWriter 构造函数?因为文档说:
不久前我也遇到过类似的情况。我最终使用了
Stream.Write
方法而不是 StreamWriter,并在写入
Encoding.GetBytes(stringToWrite)
之前写入Encoding.GetPreamble()
的结果Do you use the same constructor of the StreamWriter for every file? Because the documentation says:
I was in a similar situation a while ago. I ended up using the
Stream.Write
method instead of the StreamWriter and wrote the result ofEncoding.GetPreamble()
before writing theEncoding.GetBytes(stringToWrite)
我发现这个答案很有用(感谢@Philipp Grathwohl 和@Nik),但就我而言,我使用 FileStream 来完成任务,因此,生成 BOM 的代码如下所示:
I found this answer useful (thanks to @Philipp Grathwohl and @Nik), but in my case I'm using FileStream to accomplish the task, so, the code that generates the BOM goes like this:
似乎如果文件已经存在并且不包含 BOM,那么在覆盖时它不会包含 BOM,换句话说 StreamWriter 在覆盖文件时保留 BOM(或不存在)。
Seems that if the file already existed and didn't contain BOM, then it won't contain BOM when overwritten, in other words StreamWriter preserves BOM (or it's absence) when overwriting a file.
你能展示一下它不产生它的情况吗?我能找到的唯一不存在序言的情况是,没有任何东西写给作者(吉姆·米歇尔似乎找到了另一个,合乎逻辑的,更可能是你的问题,请参阅它的答案)。
我的测试代码:
Could you please show a situation where it don't produce it ? The only case where the preamble isn't present that I can find is when nothing is ever written to the writer (Jim Mischel seem to have find an other, logical and more likely to be your problem, see it's answer).
My test code :
阅读完SteamWriter的源代码后,您需要确保您正在创建一个新文件,然后字节顺序标记将添加到该文件中。
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L267
Flush方法中的代码
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L129" rel="nofollow noreferrer" >https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L129
代码设置_haveWrittenPreamble的值
After reading the source code of SteamWriter, you need to make sure you are creating a new file, then the byte order mark will add to the file.
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L267
Code in Flush method
https://github.com/dotnet/runtime/blob/6ef4b2e7aba70c514d85c2b43eac1616216bea55/src/libraries/System.Private.CoreLib/src/System/IO/StreamWriter.cs#L129
Code set the value of _haveWrittenPreamble
使用 Encoding.Default 而不是 Encoding.UTF8 解决了我的问题
using Encoding.Default instead of Encoding.UTF8 solved my problem
当未使用 FileStream 且未指定编码时,文件将以 ANSI 写入,除非存在非英语字符,然后将其转换为不带 BOM 的 UTF-8。
添加UTF-8编码将创建并写入带有BOM的文件。没有 BOM 的现有文件在覆盖时将添加 BOM。 false 表示追加
When FileStream is not used and encoding is not specified, file is written in ANSI unless there's a non-english character then it's converted to UTF-8 without BOM.
Adding UTF-8 encoding will create and write file with BOM. Existing file without BOM will have BOM added when overwritten. false means append