在 Java 中读取 UTF-8 文件时出错

发布于 2024-09-11 04:00:23 字数 613 浏览 2 评论 0原文

我正在尝试从包含 unicode 字符的文件中读取一些句子。它确实打印出一个字符串,但由于某种原因它弄乱了 unicode 字符

这是我的代码:

public static String readSentence(String resourceName) {

    String sentence = null;
    try {
        InputStream refStream = ClassLoader
                .getSystemResourceAsStream(resourceName);
        BufferedReader br = new BufferedReader(new InputStreamReader(
                refStream, Charset.forName("UTF-8")));
        sentence = br.readLine();
    } catch (IOException e) {
        throw new RuntimeException("Cannot read sentence: " + resourceName);
    }
    return sentence.trim();
}

I am trying to read in some sentences from a file that contains unicode characters. It does print out a string but for some reason it messes up the unicode characters

This is the code I have:

public static String readSentence(String resourceName) {

    String sentence = null;
    try {
        InputStream refStream = ClassLoader
                .getSystemResourceAsStream(resourceName);
        BufferedReader br = new BufferedReader(new InputStreamReader(
                refStream, Charset.forName("UTF-8")));
        sentence = br.readLine();
    } catch (IOException e) {
        throw new RuntimeException("Cannot read sentence: " + resourceName);
    }
    return sentence.trim();
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

嘦怹 2024-09-18 04:00:23

问题可能出在字符串的输出方式上。

我建议您通过执行以下操作来确认您正在正确读取 Unicode 字符:

for (char c : sentence.toCharArray()) {
    System.err.println("char '" + ch + "' is unicode codepoint " + ((int) ch)));
}

并查看 Unicode 代码点对于混乱的字符是否正确。如果正确,则问题出在输出侧;如果不正确,则问题出在输入侧。

The problem is probably in the way that the string is being output.

I suggest that you confirm that you are correctly reading the Unicode characters by doing something like this:

for (char c : sentence.toCharArray()) {
    System.err.println("char '" + ch + "' is unicode codepoint " + ((int) ch)));
}

and see if the Unicode codepoints are correct for the characters that are being messed up. If they are correct, then the problem is output side: if not, then input side.

小ぇ时光︴ 2024-09-18 04:00:23

首先,您可以创建 InputStreamReader,

new InputStreamReader(refStream, "UTF-8")

此外,您还应该验证资源是否确实包含 UTF-8 内容。

First, you could create the InputStreamReader as

new InputStreamReader(refStream, "UTF-8")

Also, you should verify if the resource really contains UTF-8 content.

烟织青萝梦 2024-09-18 04:00:23

最烦人的原因之一可能是...您的 IDE 设置。

如果您的 IDE 默认控制台编码类似于 latin1 ,那么您将在不同的 java 代码变体中挣扎很长时间,但在您正确设置一些 IDE 选项之前没有任何帮助。

One of the most annoying reason could be... your IDE settings.

If your IDE default console encoding is something like latin1 then you'll be struggling very long with different variations of java code but nothing help untill you correctly set some IDE options.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文