奇怪的字符在记事本中正确呈现,但在其他地方作为控制字符
我有一个 .csv 企业列表。该文件中有一些奇怪的字符。例如,在此字段中:Stocktonon-Tees
,第一个连字符,位于 Stockton
和 on
之间,似乎是值为 6
的字符,而不是值为 45
的连字符。堆栈溢出可能会对其进行清理,因此您看不到它,所以这里有一个粘贴箱:
http://pastebin.com/NuyyaQy9
任何人都可以解释为什么会这样吗?我错过了一些编码问题吗?或者数据集损坏?
I have a .csv list of businesses. The file has some strange characters in. For example, in this field: Stocktonon-Tees
, the first hyphen, between Stockton
and on
seems to be a character with the value 6
rather than a hyphen, with the value 45
. Stack overflow will probably sanatize this so you can't see it, so here is a pastebin:
http://pastebin.com/NuyyaQy9
Can anyone explain why this could be? Is it some encoding issue that I have missed? Or a corruption in the dataset?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的,这几乎可以肯定是编码问题。文件仅由二进制数据组成 - 重要的是您解释二进制数据的方式。听起来记事本正在猜测最初的编码,但您使用的其他任何东西都不是。
不幸的是,您没有说明什么软件正在尝试读取该文件或首先写入该文件的内容 - 但您应该查看记事本认为它是什么编码,并从那里开始工作。
如果是您的代码写出文件,并且您可以决定编码,那么我建议使用 UTF-8 作为良好的通用、平台可移植编码。
Yes, it's almost certainly an encoding issue. A file just consists of binary data - it's how you interpret that binary data that matters. It sounds like Notepad is guessing at the originally-intended encoding, but whatever else you're using isn't.
Unfortunately you haven't said anything about what software is trying to read the file or what wrote it in the first place - but you should look at what encoding Notepad thinks it is, and work from there.
If it's your code that wrote the file out, and you get to decide the encoding, I'd recommend UTF-8 as a good general purpose, platform-portable encoding.