Java 应用程序中检测到错误的默认系统编码

发布于 2025-01-14 18:20:04 字数 1584 浏览 1 评论 0原文

问题是什么？

我注意到 Java 在同一台机器和操作系统（Windows 10）上运行时显示不同的默认文件编码的一个奇怪问题。如果我从控制台运行 Gradle 应用程序，Charset.defaultCharset() 显示 Windows-1250。当我从 IntelliJ（也作为 Gradle 应用程序）运行它时，它显示 Windows-1252。

更奇怪的是，当我在另一台装有 Windows 11 的计算机上运行它时，结果完全相反，从控制台运行时的 Windows-1252 和在IntelliJ。

据我所知，我的操作系统（Win 10/11 的波兰语版本）的正确系统编码应始终是 Windows-1250。

我使用 AdoptOpenJDK 16、Gradle 7.0 和 IJ 2021.3.2。

为什么它对我来说很重要？

我的 Java 应用程序执行外部 Python 脚本，并通过 Process.getInput/OutputStream() 与 ProcessBuilder 创建的 Python 进程进行通信。当我通过该流发送一些带有非 ASCII 字符的数据时，它们会被替换为 ? 并在 Python 端按原样读取。例如，在 Java 端，我发送如下行：

try (var inputWriter = new BufferedWriter(new OutputStreamWriter(scriptProcess.getOutputStream()))) {
    inputWriter.write("Właściciel");
}

在 Python 端，我接收如下数据：

inputBuffer = []
for line in stdin:
    inputBuffer.append(line.rstrip())

当我打印 inputBuffer 或将其写入文件时，它显示 W?a?ciciel。值得注意的是，此行为并不取决于输入字符串本身的编码 - 可以从 UTF-8 或 Windows-1250 或 Windows-1252 文件中读取“Właściciel”，并且问题仍然相同。

如果我通过将其添加为 Writer 的参数来强制执行正确的编码：

var writer = new OutputStreamWriter(scriptProcess.getOutputStream(), "Windows-1250")

..然后它可以正常工作，问号消失。但我觉得硬编码“系统编码”不是一个好的解决方案，因为如果有人在 Windows 上使用其他区域设置（例如使用英语，默认编码为 UTF-8）运行我的应用程序，它会崩溃。

所以我的问题是：是否有另一种方法来确定有效的系统编码或在独立于系统编码/区域设置的进程之间创建通信？

原文

What is the problem?

I've noticed a strange problem with Java showing different default file encodings while running at the same machine and OS (Windows 10). If I run my Gradle application from a console, Charset.defaultCharset() shows Windows-1250. When I run it from IntelliJ (also as Gradle app) it shows Windows-1252.

It is even more strange when I run it on a different computer with Windows 11 - the results are quite opposite, Windows-1252 while running from a console and Windows-1250 in IntelliJ.

The correct system encoding for my OS (Polish version of Win 10/11) should always be Windows-1250 as far as I know.

I use AdoptOpenJDK 16, Gradle 7.0 and IJ 2021.3.2.

Why is it important in my case?

My Java application executes external Python scripts and communicates with Python processes created by ProcessBuilder via Process.getInput/OutputStream(). When I send some data with non-ascii characters through that stream, they are replaced with ? and read as such on the Python side. For example, on Java side I am sending a line like this:

try (var inputWriter = new BufferedWriter(new OutputStreamWriter(scriptProcess.getOutputStream()))) {
    inputWriter.write("Właściciel");
}

and on the Python side I am receiving this data like this:

inputBuffer = []
for line in stdin:
    inputBuffer.append(line.rstrip())

When I print inputBuffer or write it to a file, it shows W?a?ciciel. It's worth noting that this behavior doesn't depend on the encoding of the input string itself - "Właściciel" can be read from UTF-8 or Windows-1250 or Windows-1252 file and the problem remains the same.

If I force a correct encoding by adding it as a Writer's parametr:

var writer = new OutputStreamWriter(scriptProcess.getOutputStream(), "Windows-1250")

..then it works ok, question marks disappear. But I feel hardcoding "system encoding" is not a good solution, because it will collapse if someone runs my app on Windows with other regional settings (e.g. with the English language, where default encoding is UTF-8).

So my question is: is there another way to determine valid system encoding or to create communication between processes that is independent of system encoding/region settings?

分享到QQ

分享到微博