Java 中的日语字符编码

发布于 2024-12-08 20:05:40 字数 688 浏览 0 评论 0原文

这是我的问题。我现在正在使用 Java Apache POI 读取 Excel（.xls 或 .xlsx）文件，并显示内容。电子表格中有一些日语字符，我得到的所有日语字符都是“？？？”在我的输出中。我尝试过使用Shift-JIS、UTF-8等多种编码方式，但是都不行... 下面是我的编码代码：

public String encoding(String str) throws UnsupportedEncodingException{
  String Encoding = "Shift_JIS";
  return this.changeCharset(str, Encoding);
}
public String changeCharset(String str, String newCharset) throws UnsupportedEncodingException {
  if (str != null) {
    byte[] bs = str.getBytes();
    return new String(bs, newCharset);
  }
  return null;
}

我将传入编码（str）的每个字符串。但是当我打印返回值时，它仍然是类似“???”的东西（如下所示）但不是日语字符（平假名、片假名或汉字）。

title-jp=???

任何人都可以帮我解决这个问题吗？太感谢了。

原文

Here's my problem. I'm now using using Java Apache POI to read an Excel (.xls or .xlsx) file, and display the contents. There are some Japanese chars in the spreadsheet and all of the Japanese chars I got are "???" in my output. I tried to use Shift-JIS, UTF-8 and many other encoding ways, but it doesn't work...
Here's my encoding code below:

public String encoding(String str) throws UnsupportedEncodingException{
  String Encoding = "Shift_JIS";
  return this.changeCharset(str, Encoding);
}
public String changeCharset(String str, String newCharset) throws UnsupportedEncodingException {
  if (str != null) {
    byte[] bs = str.getBytes();
    return new String(bs, newCharset);
  }
  return null;
}

I am passing in every string I got to encoding(str). But when I print the return value, it's still something like "???" (Like below) but not Japanese characters (Hiragana, Katakana or Kanji).

title-jp=???

Anyone can help me with this? Thank you so much.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

顾冷 2024-12-15 20:05:40

您的 changeCharset 方法看起来很奇怪。 Java 中的 String 对象最好被认为没有特定的字符集。它们使用 Unicode，因此可以表示所有字符，而不仅仅是一个区域子集。您的方法说：使用我的系统的字符集（无论是什么）将字符串转换为字节，然后尝试使用其他字符集（在 newCharset 中指定）解释这些字节，因此可能会获胜不工作。如果您转换为编码中的字节，则应该使用相同的编码读取这些字节。

更新：

要将字符串转换为 Shift-JIS（日本常用的区域编码），您可以说：

byte[] jis = str.getBytes("Shift_JIS");

如果将这些字节写入文件，然后在 Windows 计算机上的记事本中打开该文件如果区域设置全部以日本为中心，则记事本将以日语显示（没有其他内容可继续，它将假定文本采用系统的本地编码）。

但是，您同样可以将其另存为 UTF-8（以 3 字节 UTF-8 引导符序列为前缀），记事本也会将其显示为日语。 Shift-JIS 只是将日语文本表示为字节的一种方式。

Your changeCharset method seems strange. String objects in Java are best thought of as not have a specific character set. They use Unicode and so can represent all characters, not only one regional subset. Your method says: turn the string into bytes using my system's character set (whatever that may be), and then try and interpret those bytes using some other character set (specified in newCharset), which therefore probably won't work. If you convert to bytes in an encoding, you should read those bytes with the same encoding.

Update:

To convert a String to Shift-JIS (a regional encoding commonly used in Japan) you can say:

byte[] jis = str.getBytes("Shift_JIS");

If you write those bytes into a file, and then open the file in Notepad on a Windows computer where the regional settings are all Japan-centric, Notepad will display it in Japanese (having nothing else to go on, it will assume the text is in the system's local encoding).

However, you could equally well save it as UTF-8 (prefixed with the 3-byte UTF-8 introducer sequence) and Notepad will also display it as Japanese. Shift-JIS is only one way of representing Japanese text as bytes.

回复收藏 0 原文

陪我终i 2024-12-15 20:05:40

我怀疑你一开始就不应该这样做。如果确实是 Apache POI 的错误，那么您需要从数据中获取原始的原始字节，而不是仅仅使用系统默认的编码。

另一方面，我认为 Apache POI 完全有可能成功地做了正确的事情，而这只是一个输出问题。我建议您根据 Unicode 代码点转储您所获得的原始字符串（完全删除您的 encoding 方法），例如

 for (int i = 0; i < text.length; i++) {
     System.out.println("U+" + Integer.toHexString(text.charAt(i));
 }

然后根据 Unicode 网站。

I suspect you shouldn't be doing this in the first place. If it really is Apache POI's fault, then you'll need to get the original raw bytes from the data, not just use the system default encdoing.

On the other hand, I think it's entirely likely that Apache POI has managed to do the right thing, and it's just an output problem. I suggest you dump the original string you've got (removing your encoding method entirely) in terms of its Unicode code points, e.g.

 for (int i = 0; i < text.length; i++) {
     System.out.println("U+" + Integer.toHexString(text.charAt(i));
 }

Then check those Unicode values against the ones at the Unicode web site.

回复收藏 0 原文

~没有更多了~

关于作者

稚然

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

Java 中的日语字符编码

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

Java 中的日语字符编码

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。