如何从网站获取非拉丁字符?

发布于 2024-10-19 23:38:17 字数 481 浏览 7 评论 0原文

我尝试从 latata.pl/pl.php 获取数据并查看所有符号 (polish - iso-8859-2)

 final URL url = new URL("http://latata.pl/pl.php");
    final URLConnection urlConnection = url.openConnection();
    final BufferedReader in = new BufferedReader(new InputStreamReader(
            urlConnection.getInputStream()));
    String inputLine;

    while ((inputLine = in.readLine()) != null) {
        System.out.println(inputLine);
    }
    in.close();

它不起作用。 :( 有什么想法吗?

I try to get data from latata.pl/pl.php and view all sign (polish - iso-8859-2)

 final URL url = new URL("http://latata.pl/pl.php");
    final URLConnection urlConnection = url.openConnection();
    final BufferedReader in = new BufferedReader(new InputStreamReader(
            urlConnection.getInputStream()));
    String inputLine;

    while ((inputLine = in.readLine()) != null) {
        System.out.println(inputLine);
    }
    in.close();

It doesn't work. :( Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

小忆控 2024-10-26 23:38:17

InputStream 读取器有多个构造函数,您可以 (在这种情况下,应该/必须)在这些构造函数之一中指定编码。

InputStream reader has multiple constructors and you can (should/have to) specify encoding in such case in one of these constructors.

ぶ宁プ宁ぶ 2024-10-26 23:38:17

您的 InputStreamReader 将尝试使用您的平台默认编码(很可能是 UTF-8 或可怕的 Windows 编码之一)转换通过 TCP 连接返回的字节。您应该明确指定编码。

假设 Web 服务器运行良好,您可以在其中一个 HTTP 标头中找到正确的编码(我忘了是哪一个)。或者您可以假设它是 iso-8859-2,但这可能会在以后中断。

Your InputStreamReader will be attempting to convert the bytes coming back over the TCP connection using your platform default encoding (which is most likely UTF-8 or one of the horrible Windows ones). You should explicitly specify an encoding.

Assuming the web server is doing a good job, you can find the correct encoding in one of the HTTP headers (I forget which one). Or you can just assume it's iso-8859-2, but that might break later.

寂寞花火° 2024-10-26 23:38:17

对于评论来说这太长了,但是谁设置了该网页?你?据我所知,它看起来不正确。

您将得到以下结果:

$ telnet latata.pl 80
Trying 91.205.74.65...
Connected to latata.pl.
Escape character is '^]'.
GET /pl.php HTTP/1.0
Host: latata.pl

HTTP/1.1 200 OK
Date: Sun, 27 Feb 2011 13:49:19 GMT
Server: Apache/2
X-Powered-By: PHP/5.2.16
Vary: Accept-Encoding,User-Agent
Content-Length: 10
Connection: close
Content-Type: text/html

����ʣ��Connection closed by foreign host.

HTML 很简单:

<html>
<head></head>
<body>±ê³ó¿¡Ê£¯¬</body>
</html>

这就是您的页面在浏览器中的显示方式。在该 HTML 页面中没有指定 charset 是否有正当理由?

This is too long for a comment but who set that webpage? You? From what I can see it doesn't look correct.

Here's what you get back:

$ telnet latata.pl 80
Trying 91.205.74.65...
Connected to latata.pl.
Escape character is '^]'.
GET /pl.php HTTP/1.0
Host: latata.pl

HTTP/1.1 200 OK
Date: Sun, 27 Feb 2011 13:49:19 GMT
Server: Apache/2
X-Powered-By: PHP/5.2.16
Vary: Accept-Encoding,User-Agent
Content-Length: 10
Connection: close
Content-Type: text/html

����ʣ��Connection closed by foreign host.

The HTML is simply:

<html>
<head></head>
<body>±ê³ó¿¡Ê£¯¬</body>
</html>

And that's how your page appears from a browser. Is there a valid reason why no charset is specified in that HTML page?

凉城 2024-10-26 23:38:17

您的 php 脚本 pl.php 的输出有错误。有一个未声明字符集的 HTTP 标头 Content-Type: text/html 集。如果没有声明的字符集,客户端必须假设它是关于 HTTP 规范的 ISO-8859-1。如果将其解释为 ISO-8859-1,则发送的正文为 ±ê3ó¿¡Ê£ה

php 脚本发送的字节表示 ąęłóżĄĘŁŻŹ 如果它被声明为

Content-Type: text/html; charset=ISO-8859-2

您可以使用简单的代码片段来检查这一点,该代码片段会将错误的 ISO-8859-1 编码转换为 ISO-8859-2:

final String test="±ê³ó¿¡Ê£¯¬";
String corrupt=new String(test.getBytes("ISO-8859-1"),"ISO-8859-2");
System.out.println(corrupt);    

输出将为 ąęłóżĄĘŁŻŹ >,这是一些波兰语字符。

作为快速修复,在 php 脚本中设置字符集以输出 Content-Type: text/html; charset=ISO-8859-2 作为 HTTP 标头。

但无论如何,您应该考虑切换到 UTF-8 编码输出。

The output of your php-script pl.php is faulty. There is a HTTP-header Content-Type: text/html set without a declared charset. Without a declared charset, the client has to assume that it is ISO-8859-1 regarding to the HTTP-specifications. The sent body is ±ê³ó¿¡Ê£¯¬ if interpreted as ISO-8859-1.

The bytes sended by the php-script are representing ąęłóżĄĘŁŻŹ if it were declared as

Content-Type: text/html; charset=ISO-8859-2

You can check this with the simple code fragment, which will transform the faulty ISO-8859-1 encoding to ISO-8859-2:

final String test="±ê³ó¿¡Ê£¯¬";
String corrupt=new String(test.getBytes("ISO-8859-1"),"ISO-8859-2");
System.out.println(corrupt);    

The output will be ąęłóżĄĘŁŻŹ, which are some polish characters.

As a quick fix, set the charset in your php-script to output Content-Type: text/html; charset=ISO-8859-2 as HTTP-Header.

But you should think about to switch to UTF-8 encoded output anyway.

无所谓啦 2024-10-26 23:38:17

正如有人已经指出的那样,没有为响应指定字符集编码。强制将响应文档视为 ISO-8859-2(通常在中欧使用)会导致显示合法的波兰语字符,因此我认为这是实际使用的编码。由于未指定编码,因此将假定 ISO-8859-1,因为这是默认值。

响应标头需要包含标头 Content-Type: text/html; charset=ISO-8859-2 用于正确解释字符代码点。构造响应InputStream 时将使用此字符集。

As someone has already stated there is no charset encoding specified for the response. Forcing the response document to be viewed as ISO-8859-2 (typically used in central Europe) results in legitimate polish characters being displayed, so I assume this is the encoding actually being used. Since no encoding has been specified, ISO-8859-1 will be assumed as this is the default.

The response headers need to include the header Content-Type: text/html; charset=ISO-8859-2 for the character code points to be interpreted correctly. This charset will be used when constructing the response InputStream.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文