如何在java中阅读非英文文本?它们以错误的编码表示
我使用 apache HttpClient。当我尝试“阅读网站”时,所有非英语内容都被错误地表示。
实际上,它以 windows-1252 表示,但应该以 UTF-8 表示。我该如何解决这个问题?
我尝试使用InputStreamReader(inputStream, Charset.forName("UTF-8"))
,但它没有帮助(错误的符号转换为?????????)。
I use apache HttpClient. And when I'm trying to "read site", all non-english content is represented wrongly.
Actually, it's represented in windows-1252 but it should be in UTF-8. How can I fix this?
I tried to use InputStreamReader (inputStream, Charset.forName ("UTF-8"))
, but it didn't help (wrong symbols transformed into ????????).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果文件位于 Windows-1252 中,则告诉它使用 UTF-8 将不起作用。给它Windows-1252作为字符集名称,然后就可以读取正确的数据了。知道数据应该采用什么格式并不比了解数据实际采用什么格式有用:)
接下来是否用 UTF-8 重写它取决于您。 ..
If the file is in Windows-1252, then telling it to use UTF-8 isn't going to work. Give it Windows-1252 as the charset name, and then you can read the correct data. Knowing what format data should be in isn't nearly as useful as knowing what format it's actually in :)
It's up to you whether you then rewrite it in UTF-8...
找到正确的字符编码可能是一场噩梦。根据您网站的内容,以下内容可能有用。我过去做过的一件事是依赖一个类,该类将使用多种方法来确定正确的字符编码:
XmlReader 将使用 UTF 字节顺序标记和/或 XML 声明来确定正确的编码。
因此,您可以使用以下构造:
来获取内容。
Finding the correct character encoding can be a bit of a nightmare. Depending on what the content of your site is, the following might be useful. One thing I've done in the past is rely on a class that will use multiple methods for determining the correct character encoding:
The XmlReader from the rome project will use the UTF byte order mark and/or XML declarations to determine the correct encoding.
So you could use the following construct:
to get to the content.
如果页面在“Content-Type”标头中有编码,HttpClient 将遵循它。如果不是,它将采用 Latin-1,而不是 Windows-1252。您确定要获取 Windows-1252 吗?您可以这样检查编码,
如果您知道响应确实使用 UTF-8 但标头没有指定它,您可以像这样强制它读取 UTF-8,
If the page has encoding in "Content-Type" header, HttpClient will honor it. If not, it will assume Latin-1, not Windows-1252. Are you sure you are getting Windows-1252? You can check encoding like this,
If you know the response indeed uses UTF-8 but the header didn't specify it, you can force it to read UTF-8 like this,