如何从网站获取非拉丁字符？

发布于 2024-10-19 23:38:17 字数 481 浏览 7 评论 0原文

我尝试从 latata.pl/pl.php 获取数据并查看所有符号 (polish - iso-8859-2)

 final URL url = new URL("http://latata.pl/pl.php");
    final URLConnection urlConnection = url.openConnection();
    final BufferedReader in = new BufferedReader(new InputStreamReader(
            urlConnection.getInputStream()));
    String inputLine;

    while ((inputLine = in.readLine()) != null) {
        System.out.println(inputLine);
    }
    in.close();

它不起作用。 :( 有什么想法吗？

原文

I try to get data from latata.pl/pl.php and view all sign (polish - iso-8859-2)

 final URL url = new URL("http://latata.pl/pl.php");
    final URLConnection urlConnection = url.openConnection();
    final BufferedReader in = new BufferedReader(new InputStreamReader(
            urlConnection.getInputStream()));
    String inputLine;

    while ((inputLine = in.readLine()) != null) {
        System.out.println(inputLine);
    }
    in.close();

It doesn't work. :( Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小忆控 2024-10-26 23:38:17

InputStream 读取器有多个构造函数，您可以 (在这种情况下，应该/必须）在这些构造函数之一中指定编码。

回复收藏 0 原文

ぶ宁プ宁ぶ 2024-10-26 23:38:17

您的 InputStreamReader 将尝试使用您的平台默认编码（很可能是 UTF-8 或可怕的 Windows 编码之一）转换通过 TCP 连接返回的字节。您应该明确指定编码。

假设 Web 服务器运行良好，您可以在其中一个 HTTP 标头中找到正确的编码（我忘了是哪一个）。或者您可以假设它是 iso-8859-2，但这可能会在以后中断。

回复收藏 0 原文

寂寞花火° 2024-10-26 23:38:17

对于评论来说这太长了，但是谁设置了该网页？你？据我所知，它看起来不正确。

您将得到以下结果：

$ telnet latata.pl 80
Trying 91.205.74.65...
Connected to latata.pl.
Escape character is '^]'.
GET /pl.php HTTP/1.0
Host: latata.pl

HTTP/1.1 200 OK
Date: Sun, 27 Feb 2011 13:49:19 GMT
Server: Apache/2
X-Powered-By: PHP/5.2.16
Vary: Accept-Encoding,User-Agent
Content-Length: 10
Connection: close
Content-Type: text/html

����ʣ��Connection closed by foreign host.

HTML 很简单：

<html>
<head></head>
<body>±ê³ó¿¡Ê£¯¬</body>
</html>

这就是您的页面在浏览器中的显示方式。在该 HTML 页面中没有指定 charset 是否有正当理由？

This is too long for a comment but who set that webpage? You? From what I can see it doesn't look correct.

Here's what you get back:

$ telnet latata.pl 80
Trying 91.205.74.65...
Connected to latata.pl.
Escape character is '^]'.
GET /pl.php HTTP/1.0
Host: latata.pl

HTTP/1.1 200 OK
Date: Sun, 27 Feb 2011 13:49:19 GMT
Server: Apache/2
X-Powered-By: PHP/5.2.16
Vary: Accept-Encoding,User-Agent
Content-Length: 10
Connection: close
Content-Type: text/html

����ʣ��Connection closed by foreign host.

The HTML is simply:

<html>
<head></head>
<body>±ê³ó¿¡Ê£¯¬</body>
</html>

And that's how your page appears from a browser. Is there a valid reason why no charset is specified in that HTML page?

回复收藏 0 原文

凉城 2024-10-26 23:38:17

您的 php 脚本 pl.php 的输出有错误。有一个未声明字符集的 HTTP 标头 Content-Type: text/html 集。如果没有声明的字符集，客户端必须假设它是关于 HTTP 规范的 ISO-8859-1。如果将其解释为 ISO-8859-1，则发送的正文为 ±ê3ó¿¡Ê£ה。

php 脚本发送的字节表示 ąęłóżĄĘŁŻŹ 如果它被声明为

Content-Type: text/html; charset=ISO-8859-2

您可以使用简单的代码片段来检查这一点，该代码片段会将错误的 ISO-8859-1 编码转换为 ISO-8859-2：

final String test="±ê³ó¿¡Ê£¯¬";
String corrupt=new String(test.getBytes("ISO-8859-1"),"ISO-8859-2");
System.out.println(corrupt);

输出将为 ąęłóżĄĘŁŻŹ >，这是一些波兰语字符。

作为快速修复，在 php 脚本中设置字符集以输出 Content-Type: text/html; charset=ISO-8859-2 作为 HTTP 标头。

但无论如何，您应该考虑切换到 UTF-8 编码输出。

The output of your php-script pl.php is faulty. There is a HTTP-header Content-Type: text/html set without a declared charset. Without a declared charset, the client has to assume that it is ISO-8859-1 regarding to the HTTP-specifications. The sent body is ±ê³ó¿¡Ê£¯¬ if interpreted as ISO-8859-1.

The bytes sended by the php-script are representing ąęłóżĄĘŁŻŹ if it were declared as

Content-Type: text/html; charset=ISO-8859-2

You can check this with the simple code fragment, which will transform the faulty ISO-8859-1 encoding to ISO-8859-2:

final String test="±ê³ó¿¡Ê£¯¬";
String corrupt=new String(test.getBytes("ISO-8859-1"),"ISO-8859-2");
System.out.println(corrupt);

The output will be ąęłóżĄĘŁŻŹ, which are some polish characters.

As a quick fix, set the charset in your php-script to output Content-Type: text/html; charset=ISO-8859-2 as HTTP-Header.

But you should think about to switch to UTF-8 encoded output anyway.

回复收藏 0 原文

无所谓啦 2024-10-26 23:38:17

正如有人已经指出的那样，没有为响应指定字符集编码。强制将响应文档视为 ISO-8859-2（通常在中欧使用）会导致显示合法的波兰语字符，因此我认为这是实际使用的编码。由于未指定编码，因此将假定 ISO-8859-1，因为这是默认值。

响应标头需要包含标头 Content-Type: text/html; charset=ISO-8859-2 用于正确解释字符代码点。构造响应InputStream 时将使用此字符集。

回复收藏 0 原文

~没有更多了~

关于作者

筑梦

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

如何从网站获取非拉丁字符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如何从网站获取非拉丁字符？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。