Java Unicode 问题(我认为)
我是 Java 新手,如果我说了什么蠢话,请多多包涵!我遇到了一些问题,我认为这些问题与 Unicode 有关。
我正在使用 Scanner
从文本文件中读取标记化命令,并以 UTF-8 编码保存。基本上我想首先检查命令是否不等于“command1”或“command2”(在这些情况下我会做其他事情),然后再读取一个字符。如果令牌不是单个字符,我将输出错误。
这是我的代码:
public static void main(String[] args) throws FileNotFoundException {
Scanner scanner = new Scanner(new File(args[0]));
while (scanner.hasNext()) {
String command = scanner.next();
if (command.equals("command1")) {
System.out.println("command: command1");
// do something
} else if (command.equals("command2")) {
System.out.println("command: command2");
// do something
} else {
if (command.length() == 1) {
char c = command.charAt(0);
System.out.println("character: " + c);
// do something with c
} else {
System.err.println("error (string was " + command
+ " with length " + command.length() + ")");
}
}
}
}
我在 args[0] 中传递其文件名进行测试的文本文件的内容:
command1
x
y
command2
z
└
command1
╒
═
预期输出是:
command: command1
character: x
character: y
command: command2
character: z
character: └
command: command1
character: ╒
character: ═
实际输出是:
command: command1
character: x
character: y
command: command2
character: z
error (string was └ with length 3)
command: command1
error (string was ╒ with length 3)
error (string was ═ with length 3)
如您所见,非标准字符被视为 3- Java 的字符串。奇怪的是,如果我将终端输出中的一个字符复制/粘贴到 System.out.println("└".length())
语句中,它会正确打印 1.
关于我哪里出错了有什么想法吗?
谢谢
I'm new to Java, so bear with me if I say anything stupid! I'm having a few problems, which I think are Unicode-related.
I'm using Scanner
to read in tokenised commands from a text file, saved with UTF-8 encoding. Basically I want to first check that the command isn't equal to either "command1" or "command2" (I do something else in these cases), then otherwise read in a character. If the token isn't a single character, I'm going to output an error.
Here's my code:
public static void main(String[] args) throws FileNotFoundException {
Scanner scanner = new Scanner(new File(args[0]));
while (scanner.hasNext()) {
String command = scanner.next();
if (command.equals("command1")) {
System.out.println("command: command1");
// do something
} else if (command.equals("command2")) {
System.out.println("command: command2");
// do something
} else {
if (command.length() == 1) {
char c = command.charAt(0);
System.out.println("character: " + c);
// do something with c
} else {
System.err.println("error (string was " + command
+ " with length " + command.length() + ")");
}
}
}
}
And the contents of the text file whose filename I'm passing in args[0] for testing:
command1
x
y
command2
z
└
command1
╒
═
Expected output is:
command: command1
character: x
character: y
command: command2
character: z
character: └
command: command1
character: ╒
character: ═
Actual output is:
command: command1
character: x
character: y
command: command2
character: z
error (string was └ with length 3)
command: command1
error (string was ╒ with length 3)
error (string was ═ with length 3)
As you can see, the non-standard characters are being seen as a 3-character string by Java. Strangely, if I copy/paste the one of the characters from the terminal output into a System.out.println("└".length())
statement, it correctly prints 1
.
Any ideas on where I'm going wrong?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
当您在 Java 中打开文件时,编码(如果您未指定编码)将从
file.encoding
系统属性中获取。这几乎永远不会设置为您想要的东西(如果您像我一样,您总是想要 UTF-8)。要修复此问题,请在创建扫描程序时明确指定您的字符集:
When you open files in Java, the encoding (if you don't specify one) is taken from the
file.encoding
system property. This is almost never set to something that you want (if you're like me, you always want UTF-8).To fix, explicitly specify your character set when you create your Scanner:
我怀疑您的问题确实来自编码不匹配。
您是否尝试过在扫描仪的构造函数中传递字符集?
您的代码在我的系统(Arch Linux 64b,java 6.0.30)上完美运行,默认语言环境为 UTF-8。如果您运行 Windows,您的区域设置可能是 Win-CP1252,扫描程序可以使用它。
I suspect your problem comes indeed from an encoding mismatch.
Have you tried passing a Charset in the Scanner's constructor ?
Your code works perfectly on my system (Arch Linux 64b, java 6.0.30), with a default locale in UTF-8. If you run Windows, your locale may be Win-CP1252, which could be used by the Scanner.