源代码是否应该以 UTF-8 格式保存
以 UTF-8 格式保存源代码有多重要?
Windows 上的 Eclipse 默认使用 CP1252 字符编码。 CP1251 格式意味着可以保存非 UTF-8 字符,如果您从 Word 文档复制并粘贴注释以进行注释,我就看到过这种情况。
我之所以问这个问题,是因为出于习惯,我将 Maven 编码设置为 UTF-8 格式,最近它发现了一些不可映射的错误。
(更新)请添加这样做的任何原因以及为什么,是否有一些应该知道的常见问题?
(更新)你的目标是什么?为了找到最佳实践,所以当问为什么我们应该使用 UTF-8 时,我有一个很好的答案,但现在我没有。
How important is it to save your source code in UTF-8 format?
Eclipse on Windows uses CP1252 character encoding by default. The CP1251 format means non UTF-8 characters can be saved and I have seen this happen if you copy and paste from a Word document for a comment.
The reason I ask is because out of habit I set-up Maven encoding to be in UTF-8 format and recently it has caught a few non mappable errors.
(update) Please add any reasons for doing so and why, are there some common gotchas that should be known?
(update) What is your goal? To find the best practice so when ask why should we use UTF-8 I have a good answer, right now I don't.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
你的目标是什么?权衡您的需求与此选择的利弊。
UTF-8 Pros
\uHHHH
转义UTF-8 Cons
\ uHHHH
增加字符损坏的风险ASCII Pros
ASCII 缺点
注意:ASCII 是 7 位,不是“扩展”,不要与 Windows-1252 混淆、ISO 8859-1 或其他任何内容。
What is your goal? Balance your needs against the pros and cons of this choice.
UTF-8 Pros
\uHHHH
escapingUTF-8 Cons
\uHHHH
increases risk of character corruptionASCII Pros
ASCII Cons
Note: ASCII is 7-bit, not "extended" and not to be confused with Windows-1252, ISO 8859-1, or anything else.
重要的是,至少您需要与用于避免鲱鱼的编码一致。因此,X 在这里,Y 在那里,Z 在别处。将源代码保存为编码 X。将代码输入设置为编码 X。将代码输出设置为编码 X。将基于字符的 FTP 传输设置为编码 X。等等。
如今,
UTF-8
是一个不错的选择,因为它涵盖了人类世界所识别的每个字符,并且几乎在所有地方都受到支持。所以,是的,我也会为其设置工作区编码。我也是这么用的。Important is at least that you need to be consistent with the encoding used to avoid herrings. Thus not, X here, Y there and Z elsewhere. Save source code in encoding X. Set code input to encoding X. Set code output to encoding X. Set characterbased FTP transfer to encoding X. Etcetera.
Nowadays
UTF-8
is a good choice as it covers every character the human world is aware of and is pretty everywhere supported. So, yes, I would set workspace encoding to it as well. I also use it so.恕我直言,Eclipse 使用平台默认编码的默认设置是一个糟糕的决定。我发现有必要在安装后不久将默认值更改为 UTF-8,因为我现有的一些源文件使用了它(可能来自从网页复制/粘贴的片段)。Java
语言和 API 规范需要 UTF-8 支持,因此您就标准工具而言绝对没问题,而且我已经很久没有见过不支持 UTF-8 的像样的编辑器了。
即使在使用 JNI 的项目中,您的 C 源代码通常也采用 US-ASCII(它是 UTF-8 的子集),因此在同一个 IDE 中打开两者不会出现问题。
Eclipse's default setting of using the platform default encoding is a poor decision IMHO. I found it necessary to change the default to UTF-8 shortly after installing it because some of my existing source files used it (probably from snippets copied/pasted from web pages.)
The Java Language and API specs require UTF-8 support so you're definitely okay as far as the standard tools go, and it's a long time since I've seen a decent editor that did not support UTF-8.
Even in projects that use JNI, your C sources will normally be in US-ASCII which is a subset of UTF-8 so having both open in the same IDE will not be a problem.
是的,除非您的编译器/解释器无法处理 UTF-8 文件,否则这绝对是可行的方法。
Yes, unless your compiler/interpreter is not able to work with UTF-8 files, it is definitely the way to go.
我不认为这个问题真的有一个直接的是或否的答案。我想说,应该使用以下准则来选择编码格式,按照列出的优先级顺序(从高到低):
1) 选择您的工具链支持的编码。这比以前容易多了。即使在最近的记忆中,许多编译器和语言基本上只支持 ASCII,这或多或少迫使开发人员使用西欧语言进行编码。如今,许多较新的语言都支持其他编码,并且几乎所有不错的编辑器和 IDE 都支持非常长的编码列表。不过……在确定编码之前,仍有足够的保留,您需要仔细检查。
2) 选择一种支持尽可能多的您希望使用的字母的编码。我将其作为次要优先事项,因为坦率地说,如果您的工具不支持它,那么您是否更喜欢这种编码并不重要。
在当今世界的许多情况下,UTF-8 都是一个绝佳的选择。这是一种丑陋、不优雅的格式,但它解决了一系列破坏其他编码的问题(即处理遗留代码),并且它似乎越来越成为字符编码的事实上的标准。它支持所有主要的字母表,现在地球上几乎每个编辑器都支持它,并且许多语言/编译器也支持它。但正如我上面提到的,有足够遗留的保留,您需要从头到尾仔细检查您的工具链,然后再最终决定。
I don't think there's really a straight yes or no answer to this question. I would say that the following guidelines should be used to pick an encoding format, in order of priority listed (highest to lowest):
1) Pick an encoding your tool chain supports. This is a lot easier than it used to be. Even in recent memory a lot of compilers and languages basically only supported ASCII, which more or less forced developers into coding in Western European languages. These days, many of the newer languages support other encodings, and almost all decent editors and IDEs support a tremendously long list of encodings. Still... there are just enough holdouts that you need to double check before you settle on an encoding.
2) Pick an encoding that supports as many of the alphabets you wish to use as possible. I place this as a secondary priority because frankly, if your tools don't support it it doesn't really matter whether you like the encoding better or not.
UTF-8 is an excellent choice in many circumstances of today's world. It's an ugly, inelegant format, but it solves a whole host of problems (namely dealing with legacy code) that break other encodings, and it seems to becoming more and more the de facto standard of character encodings. It supports every major alphabet, darn near every editor on the planet supports it now, and a whole host of languages/compilers support it, too. But as I mentioned above, there are just enough legacy holdouts that you need to double check your tool chain from end to end before you settle on it definitively.