当前位置：文江博客话题详情

如何在java中阅读非英文文本？它们以错误的编码表示

发布于 2024-08-14 04:36:52 字数 223 浏览 5 评论 0原文

我使用 apache HttpClient。当我尝试“阅读网站”时，所有非英语内容都被错误地表示。

实际上，它以 windows-1252 表示，但应该以 UTF-8 表示。我该如何解决这个问题？

我尝试使用InputStreamReader(inputStream, Charset.forName("UTF-8"))，但它没有帮助（错误的符号转换为?????????）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

穿透光 2024-08-21 04:36:52

如果文件位于 Windows-1252 中，则告诉它使用 UTF-8 将不起作用。给它Windows-1252作为字符集名称，然后就可以读取正确的数据了。知道数据应该采用什么格式并不比了解数据实际采用什么格式有用:)

接下来是否用 UTF-8 重写它取决于您。 ..

回复收藏 0 原文

瀟灑尐姊 2024-08-21 04:36:52

找到正确的字符编码可能是一场噩梦。根据您网站的内容，以下内容可能有用。我过去做过的一件事是依赖一个类，该类将使用多种方法来确定正确的字符编码：

XmlReader 将使用 UTF 字节顺序标记和/或 XML 声明来确定正确的编码。

因此，您可以使用以下构造：

new BufferedReader(new XmlReader(inputStream))

来获取内容。

Finding the correct character encoding can be a bit of a nightmare. Depending on what the content of your site is, the following might be useful. One thing I've done in the past is rely on a class that will use multiple methods for determining the correct character encoding:

The XmlReader from the rome project will use the UTF byte order mark and/or XML declarations to determine the correct encoding.

So you could use the following construct:

new BufferedReader(new XmlReader(inputStream))

to get to the content.

回复收藏 0 原文

我很坚强 2024-08-21 04:36:52

如果页面在“Content-Type”标头中有编码，HttpClient 将遵循它。如果不是，它将采用 Latin-1，而不是 Windows-1252。您确定要获取 Windows-1252 吗？您可以这样检查编码，

String encoding = method.getResponseCharSet();

如果您知道响应确实使用 UTF-8 但标头没有指定它，您可以像这样强制它读取 UTF-8，

byte[] body = method.getResponseBody();
String response = new String(body, "UTF-8");

If the page has encoding in "Content-Type" header, HttpClient will honor it. If not, it will assume Latin-1, not Windows-1252. Are you sure you are getting Windows-1252? You can check encoding like this,

String encoding = method.getResponseCharSet();

If you know the response indeed uses UTF-8 but the header didn't specify it, you can force it to read UTF-8 like this,

byte[] body = method.getResponseBody();
String response = new String(body, "UTF-8");

回复收藏 0 原文

~没有更多了~

关于作者

清泪尽

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

如何在java中阅读非英文文本？它们以错误的编码表示

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

狼性发作

美煞众生

黑凤梨

慕巷

virou

两仪

友情链接

如何在java中阅读非英文文本？它们以错误的编码表示

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

狼性发作

美煞众生

黑凤梨

慕巷

virou

两仪

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。