如何从网站获取非拉丁字符?
我尝试从 latata.pl/pl.php 获取数据并查看所有符号 (polish - iso-8859-2)
final URL url = new URL("http://latata.pl/pl.php");
final URLConnection urlConnection = url.openConnection();
final BufferedReader in = new BufferedReader(new InputStreamReader(
urlConnection.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
in.close();
它不起作用。 :( 有什么想法吗?
I try to get data from latata.pl/pl.php and view all sign (polish - iso-8859-2)
final URL url = new URL("http://latata.pl/pl.php");
final URLConnection urlConnection = url.openConnection();
final BufferedReader in = new BufferedReader(new InputStreamReader(
urlConnection.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
in.close();
It doesn't work. :( Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
InputStream 读取器有多个构造函数,您可以 (在这种情况下,应该/必须)在这些构造函数之一中指定编码。
InputStream reader has multiple constructors and you can (should/have to) specify encoding in such case in one of these constructors.
您的
InputStreamReader
将尝试使用您的平台默认编码(很可能是 UTF-8 或可怕的 Windows 编码之一)转换通过 TCP 连接返回的字节。您应该明确指定编码。假设 Web 服务器运行良好,您可以在其中一个 HTTP 标头中找到正确的编码(我忘了是哪一个)。或者您可以假设它是 iso-8859-2,但这可能会在以后中断。
Your
InputStreamReader
will be attempting to convert the bytes coming back over the TCP connection using your platform default encoding (which is most likely UTF-8 or one of the horrible Windows ones). You should explicitly specify an encoding.Assuming the web server is doing a good job, you can find the correct encoding in one of the HTTP headers (I forget which one). Or you can just assume it's iso-8859-2, but that might break later.
对于评论来说这太长了,但是谁设置了该网页?你?据我所知,它看起来不正确。
您将得到以下结果:
HTML 很简单:
这就是您的页面在浏览器中的显示方式。在该 HTML 页面中没有指定 charset 是否有正当理由?
This is too long for a comment but who set that webpage? You? From what I can see it doesn't look correct.
Here's what you get back:
The HTML is simply:
And that's how your page appears from a browser. Is there a valid reason why no charset is specified in that HTML page?
您的 php 脚本
pl.php
的输出有错误。有一个未声明字符集的 HTTP 标头Content-Type: text/html
集。如果没有声明的字符集,客户端必须假设它是关于 HTTP 规范的ISO-8859-1
。如果将其解释为 ISO-8859-1,则发送的正文为±ê3ó¿¡Ê£ה
。php 脚本发送的字节表示
ąęłóżĄĘŁŻŹ
如果它被声明为Content-Type: text/html; charset=ISO-8859-2
您可以使用简单的代码片段来检查这一点,该代码片段会将错误的 ISO-8859-1 编码转换为 ISO-8859-2:
输出将为
ąęłóżĄĘŁŻŹ
>,这是一些波兰语字符。作为快速修复,在 php 脚本中设置字符集以输出
Content-Type: text/html; charset=ISO-8859-2
作为 HTTP 标头。但无论如何,您应该考虑切换到 UTF-8 编码输出。
The output of your php-script
pl.php
is faulty. There is a HTTP-headerContent-Type: text/html
set without a declared charset. Without a declared charset, the client has to assume that it isISO-8859-1
regarding to the HTTP-specifications. The sent body is±ê³ó¿¡Ê£¯¬
if interpreted as ISO-8859-1.The bytes sended by the php-script are representing
ąęłóżĄĘŁŻŹ
if it were declared asContent-Type: text/html; charset=ISO-8859-2
You can check this with the simple code fragment, which will transform the faulty ISO-8859-1 encoding to ISO-8859-2:
The output will be
ąęłóżĄĘŁŻŹ
, which are some polish characters.As a quick fix, set the charset in your php-script to output
Content-Type: text/html; charset=ISO-8859-2
as HTTP-Header.But you should think about to switch to UTF-8 encoded output anyway.
正如有人已经指出的那样,没有为响应指定字符集编码。强制将响应文档视为 ISO-8859-2(通常在中欧使用)会导致显示合法的波兰语字符,因此我认为这是实际使用的编码。由于未指定编码,因此将假定 ISO-8859-1,因为这是默认值。
响应标头需要包含标头 Content-Type: text/html; charset=ISO-8859-2 用于正确解释字符代码点。构造响应
InputStream
时将使用此字符集。As someone has already stated there is no charset encoding specified for the response. Forcing the response document to be viewed as ISO-8859-2 (typically used in central Europe) results in legitimate polish characters being displayed, so I assume this is the encoding actually being used. Since no encoding has been specified, ISO-8859-1 will be assumed as this is the default.
The response headers need to include the header Content-Type: text/html; charset=ISO-8859-2 for the character code points to be interpreted correctly. This charset will be used when constructing the response
InputStream
.