Unicode 转换为字符串留下前导字节顺序标记
在我的 .NET 3.5 C# 应用程序中,我将 unicode 编码的字节数组转换为字符串。
字节数组如下:
{255, 254, 85, 0, 83, 0, 69, 0}
使用 Encoding.Unicode.GetString(var)
,我将字节数组转换为字符串,它返回:
{65279 '', 85 'U', 83 'S' , 69 'E'}
前导字符 65279
似乎成为零宽度不间断空格,用于作为字节顺序标记在 Unicode 编码中,它的出现导致我的应用程序的其余部分出现问题。
目前我使用的解决方法是 var.Trim(new char[]{'\uFEFF','\u200B'});,效果很好。
但问题确实是,GetString 不应该负责删除字节顺序标记吗?或者我在转换字节数组时做错了什么?
In my .NET 3.5 C# application I'm converting a unicode encoded byte array to a string.
The byte array is as follows:
{255, 254, 85, 0, 83, 0, 69, 0}
Using Encoding.Unicode.GetString(var)
, I convert the byte array to a string, which returns:
{65279 '', 85 'U', 83 'S' , 69 'E'}
The leading character, 65279
, seems to be a Zero Width No-Break Space, which is used as a Byte Order Mark in Unicode encoding, and its appearance is causing problems in the rest of my application.
Currently the workaround I'm using is var.Trim(new char[]{'\uFEFF','\u200B'});
, which works just fine.
But the question really is, shouldn't GetString
take care of removing the byte order mark? Or am I doing something wrong when converting the byte array?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不,
GetString()
不应删除 BOM。 BOM 实际上是一个完全有效的 Unicode 字符(专门选择它是因为如果它出现在 Unicode 文件的中间,例如,如果该文件是连接多个 Unicode 文件的结果,则它不会影响呈现的文本)并且必须进行解码以及byte[]
中的所有其他字符。唯一应该解释和过滤 BOM 的代码是能够理解数据来自某些持久存储的代码,例如
StreamReader
。请注意,只有当您不禁用该行为时,它才会执行此操作。GetString()
应该做的就是解释实际编码的字符并将它们转换为它们表示的文本(当然,在 C# 中,字符串在内部存储为 UTF16,因此当原始数据已经是 UTF16 格式了:))。No,
GetString()
should not be removing the BOM. The BOM is actually a perfectly valid Unicode character (selected specifically because if it appears in the middle of a Unicode file, e.g. if the file was the result of concatenating multiple Unicode files, it won't affect the rendered text) and must be decoded along with all other characters in thebyte[]
.The only code that ought to be interpreting and filtering out the BOM would be code that understands the data is coming from some persistent storage, e.g.
StreamReader
. And note that it will do that only if you don't disable that behavior.All that
GetString()
should do is interpret the actual encoded characters and convert them to the text they represent (of course, in C# strings are stored internally as UTF16, so there's very little to that conversion when the original data is already in UTF16 :) ).