如何在java中阅读非英文文本?它们以错误的编码表示

发布于 2024-08-14 04:36:52 字数 223 浏览 2 评论 0原文

我使用 apache HttpClient。当我尝试“阅读网站”时,所有非英语内容都被错误地表示。

实际上,它以 windows-1252 表示,但应该以 UTF-8 表示。我该如何解决这个问题?

我尝试使用InputStreamReader(inputStream, Charset.forName("UTF-8")),但它没有帮助(错误的符号转换为?????????)。

I use apache HttpClient. And when I'm trying to "read site", all non-english content is represented wrongly.

Actually, it's represented in windows-1252 but it should be in UTF-8. How can I fix this?

I tried to use InputStreamReader (inputStream, Charset.forName ("UTF-8")), but it didn't help (wrong symbols transformed into ????????).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

穿透光 2024-08-21 04:36:52

如果文件位于 Windows-1252 中,则告诉它使用 UTF-8 将不起作用。给它Windows-1252作为字符集名称,然后就可以读取正确的数据了。知道数据应该采用什么格式并不比了解数据实际采用什么格式有用:)

接下来是否用 UTF-8 重写它取决于您。 ..

If the file is in Windows-1252, then telling it to use UTF-8 isn't going to work. Give it Windows-1252 as the charset name, and then you can read the correct data. Knowing what format data should be in isn't nearly as useful as knowing what format it's actually in :)

It's up to you whether you then rewrite it in UTF-8...

瀟灑尐姊 2024-08-21 04:36:52

找到正确的字符编码可能是一场噩梦。根据您网站的内容,以下内容可能有用。我过去做过的一件事是依赖一个类,该类将使用多种方法来确定正确的字符编码:

XmlReader 将使用 UTF 字节顺序标记和/或 XML 声明来确定正确的编码。

因此,您可以使用以下构造:

new BufferedReader(new XmlReader(inputStream))

来获取内容。

Finding the correct character encoding can be a bit of a nightmare. Depending on what the content of your site is, the following might be useful. One thing I've done in the past is rely on a class that will use multiple methods for determining the correct character encoding:

The XmlReader from the rome project will use the UTF byte order mark and/or XML declarations to determine the correct encoding.

So you could use the following construct:

new BufferedReader(new XmlReader(inputStream))

to get to the content.

我很坚强 2024-08-21 04:36:52

如果页面在“Content-Type”标头中有编码,HttpClient 将遵循它。如果不是,它将采用 Latin-1,而不是 Windows-1252。您确定要获取 Windows-1252 吗?您可以这样检查编码,

String encoding = method.getResponseCharSet();

如果您知道响应确实使用 UTF-8 但标头没有指定它,您可以像这样强制它读取 UTF-8,

byte[] body = method.getResponseBody();
String response = new String(body, "UTF-8");

If the page has encoding in "Content-Type" header, HttpClient will honor it. If not, it will assume Latin-1, not Windows-1252. Are you sure you are getting Windows-1252? You can check encoding like this,

String encoding = method.getResponseCharSet();

If you know the response indeed uses UTF-8 but the header didn't specify it, you can force it to read UTF-8 like this,

byte[] body = method.getResponseBody();
String response = new String(body, "UTF-8");
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文