在下载页面之前如何获取页面的编码?

发布于 2024-11-05 19:38:47 字数 271 浏览 0 评论 0原文

在下载网页之前,我需要获取网页的编码(UTF-8、ISO-8859-1 等),因为我将使用编码将其从下载的 InputStream 转换为 String。

我使用 HttpUrlConnection 并且有一个名为 getContentEncoding 的方法,但仅当服务器发送它时它才会返回。

在某些情况下,编码是在属性字符集(HTML4?)中,在其他情况下是在属性编码(XHTML)中,还有一些我不知道,但我认为还有其他形式。

有一些课程可以做到这一点或者有什么方法可以做到?

I need get the encoding of a web page(UTF-8,ISO-8859-1,etc) before I download it because I will convert it from the InputStream downloaded to String using the encode.

I using HttpUrlConnection and there is a method called getContentEncoding, but it will return only if the server sends it.

In some cases, the encoding is in the attribute charset(HTML4?), in others in the attribute encoding(XHTML), and others I dont know, but I presume that there are another forms.

There is some class that do this or what is the way to do?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

掩饰不了的爱 2024-11-12 19:38:47

HTTP 1.1 规范 表明 Content-Type “应该”用于指示内容,并且不包含此标头的响应应被视为“application/octet-stream”——换句话说,是字节序列而不是字符。使用“应该”表示这是推荐的做法,但某些服务器可能不遵循它。

因此,您的第一步是查找此标头。如果它不存在,则不要对内容应用任何字符集解码。对于 XML,假设您将流传递给解析器,这将正常工作:流将采用 UTF-8 编码,或者序言将指定编码。并且您应该始终将流直接传递到 XML 解析器,而不是先尝试将它们转换为字符串。

如果存在 Content-Type 标头,并且它指定了字符集,那么您可以根据该字符集自由进行解码。该规范还讨论了如何处理缺少字符集:对于任何 text 内容类型,您应该假设它是使用 ISO-8859-1 编码的。

这就是下一步:如果有字符集,或者是 text 媒体类型,则应用解码。

否则,请保留该流。

The HTTP 1.1 specification indicates that Content-Type "should" be used to indicate the content, and that responses that do not include this header should be treated as "application/octet-stream" -- in other words, a sequence of bytes rather than characters. The use of "should" indicates that it's recommended practice, but some servers may not follow it.

So, your first step is to look for this header. If it's not present, don't apply any character-set decoding to the content. In the case of XML, assuming that you pass the stream on to a parser this will just work: either the stream will be UTF-8 encoded, or the prologue will specify the encoding. And you should always pass streams directly to an XML parser, without attempting to convert them to a string first.

If there is a Content-Type header, and it specifies a character set, then you're free to decode according to that character set. The spec also talks about how to deal with a missing character set: for any text content type, you should assume that it is encoded using ISO-8859-1.

So that's the next step: if there's a character set, or if it's a text media type, apply the decoding.

Otherwise, leave the stream alone.

烟沫凡尘 2024-11-12 19:38:47

也许您可以尝试发出 HEAD 请求 来获取 HTTP 标头,然后再尝试完全处理页面? HTTPUrlConnection 有 setRequestMethod,您可以在其中指定 HEAD。

对于 HEAD 请求,服务器应该返回所有标头,但不返回消息正文。您可以尝试解析 Content-Type 标头值。从服务器返回的标头示例如下:

HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Server: Apache/1.3.3.7 (Unix)  (Red-Hat/Linux)
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Etag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: bytes
Content-Length: 438
Connection: close
Content-Type: text/html; charset=UTF-8

以下代码片段应该让您了解如何迭代和读取 HEAD 请求中返回的标头。

int i=1;// this will print all header parameter
String hKey;
while ((hKey=conn.getHeaderFieldKey(i))!=null){
    String hVal = conn.getHeaderField(i);
    System.out.println(hKey+"="+hVal);
    i++;
}

Perhaps you could try issuing a HEAD request to fetch the HTTP headers before attempting to fully process the page? HTTPUrlConnection has setRequestMethod, where you could specify HEAD.

With a HEAD request, the server is supposed to return all headers but without the message body. You can try parsing the Content-Type header value. Example headers returned from server would be:

HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Server: Apache/1.3.3.7 (Unix)  (Red-Hat/Linux)
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Etag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: bytes
Content-Length: 438
Connection: close
Content-Type: text/html; charset=UTF-8

The following snippet should give you an idea of how to iterate and read the headers returned in a HEAD request.

int i=1;// this will print all header parameter
String hKey;
while ((hKey=conn.getHeaderFieldKey(i))!=null){
    String hVal = conn.getHeaderField(i);
    System.out.println(hKey+"="+hVal);
    i++;
}
我乃一代侩神 2024-11-12 19:38:47

无法保证您可以在不检查文档的情况下执行此操作。

HTML 4.0.1 规范详细说明了如何指定通过 Content-Type HTTP 标头和/或文档中的 meta 元素进行编码。

对于使用 Content-Type: application/xhtml+xml 必须从文档中发现编码。

There is no guarantee that you can do this without inspecting the document.

The HTML 4.0.1 spec details how to specify the encoding via the Content-Type HTTP header and/or the meta elements within the document.

In the case of XHTML served with Content-Type: application/xhtml+xml the encoding must be discovered from the document.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文