Java:多平台字符串编码问题
我遇到了一个奇怪的情况,我不知道如何处理。我们有开发人员在多个平台上工作,主要平台是 Linux,但我们也有人在 OS X 和 Windows 上工作。
我们有一组测试,所有这些测试都可以在 Linux 上正常构建和运行。但是当我们尝试在 OS X 上运行它们时,它们会失败。失败的断言正在测试两个字符串是否相等,但有一个字符在 Mac 环境中似乎不是同一字符。我相当确定这只是因为文件以某种方式编码,并且硬编码的预期字符串值以不同的方式编码。我能够通过 MAVEN-OPTS 设置 JVM file.encoding 来修复其他一些编码问题,但到目前为止我一直被这个问题难住了。
结构看起来像这样: 一些.xml --> xslt-->目的 assertEquals("期望值", object.valueToTest());
关于如何纠正这种不匹配有什么见解吗?或者甚至为什么它会首先发生?
xml 文件上的标头表明它是用 UTF-8 编码的,但该文件在文件系统上的编码方式可能不同。有没有办法让我检查实际的编码是什么?
I have an odd situation that I haven't figured out how to handle. We have developers working on multiple platforms, the primary platform is linux, but we also have people working on OS X and Windows.
We have a set of tests that all build and run fine on Linux. But when we try to run them on OS X they fail. The failing assert is testing that two strings are equal, but there is one character that doesn't seem to be the same character in the Mac environment. I am fairly certain that this is simply because the file is encoded in a certain way and the expected string value, which is hard coded, is encoded differently. I was able to fix some other encoding issues by setting the JVM file.encoding through MAVEN-OPTS, but I have been stumped by this problem up to this point.
The structure looks something like this:
some.xml --> xslt --> object
assertEquals("expected value", object.valueToTest());
Any insights on how to rectify this mismatch? Or even why it would be occurring in the first place?
The header on the xml file says it is encoded in UTF-8, but it is possible that the file might be encoded differently on the file system. Is there a way for me to check what the actual encoding is?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
主要是Pete Kirkham 所说的。
。 它不受支持,并且可能会产生意想不到的副作用。
指定源文件编码的正确方法位于 pom.xml 文件中。
这可确保编译器在所有平台上一致地解码源文件,相当于使用
javac -encoding X ...
有关源文件中编码的更多信息 此处。
Mostly, what Pete Kirkham said.
Don't do this; it is not supported and may have unintended side-effects.
The correct way to specify source file encoding is in the pom.xml files.
This ensures that the compiler will decode the source files consistently on all platfroms and is the equivalent to using
javac -encoding X ...
More on encoding in source files here.
发生这种情况的常见原因是如果有人使用旧字符串 <->字节转换不使用参数来指定编码。
这并非不可能是源文件中的编码问题,尽管我只在 Windows 和 Linux 之间移动过,所以我从未见过它,但您应该对 U00007f 以上的任何代码点使用 Unicode 转义。
The usual reason it occurs is if someone using one the old string <-> bytes conversions which doesn't take a parameter to specify the encoding.
It's not impossible that it's an encoding issue in the source file, though I've only moved between Windows and Linux so I've never seen it, but you should be using a Unicode escape for any code point above U00007f.
如果其他平台使用不同的编码读取字符,您可能会看到这样的失败。
文件中的字符如何表示?您可以尝试使用 \uXXXX 表示法转义字符串常量中的任何 unicode 。
此页面还提供了另一个线索来说明为什么这可能不起作用。 Mac 上的默认编码是“MacRoman”,它不是 UTF-8 的子集。因此,正如您所怀疑的那样,这个角色可能会被不同地解释。
If the other platform is reading the character using a different encoding, you might see a failure like this.
How is the character represented in the file? You might try escaping any unicode within string constants using \uXXXX notation.
This page also provides another clue as to why this may not be working. The default encoding on the Mac is "MacRoman", which is not a subset of UTF-8. Therefore, as you suspected, the character is likely being interpreted differently.
如果 XML 文件以
开头,那么您可以相当确信它在文件系统上编码为 UTF-8。否则,请在编辑器中打开它,让您查看原始字节是什么,例如 emacs Mx
find-file-literally
。或者,您的 java 源代码的字符串文字中可能有一个有趣的字节,该字节在不同的编码中以不同的方式表示。我认为编译器使用默认平台编码读取源代码。要解决此可移植性问题,您可以使用 \uxxxx 表示法对任何非 ascii 字符进行编码。这对于以英语为母语的用户来说很好,但对于其他人来说可能有点烦人!
编辑:偏离主题,但这让我想起了我在测试用例中发现的一个奇怪的文件。这是一个 XML 文件,编码为 ascii/utf-8,但编码标记为“UTF-16”。在记事本等不考虑 XML 编码指令的简单编辑器中,它看起来很正常,但在将文件读取为 UTF-16 的智能编辑器中,它看起来很奇怪
If the XML file starts with
<?xml ... encoding="UTF-8"?>
then you can be fairly confident that it's encoded as UTF-8 on the file-system. Otherwise, open it in an editor that lets you see what the raw bytes are, e.g. emacs M-xfind-file-literally
.Alternatively, your java source code might have a funny byte in the string literal that is represented differently in different encodings. I think the compiler reads source code using the default platform encoding. To get around this portability issue, you can code any non-ascii character using \uxxxx notation. This is fine for native English language users but can be a bit tiresome for everyone else!
EDIT: Off topic, but this reminded me of a curious file I found at work in a test-case. It was an XML file that was encoded as ascii/utf-8 but the encoding tag said "UTF-16". It would look normal in simple editors like notepad that didn't take account of the XML encoding directive but would look bizarre in smart editors that read the file as UTF-16