在下载页面之前如何获取页面的编码?
在下载网页之前,我需要获取网页的编码(UTF-8、ISO-8859-1 等),因为我将使用编码将其从下载的 InputStream 转换为 String。
我使用 HttpUrlConnection 并且有一个名为 getContentEncoding 的方法,但仅当服务器发送它时它才会返回。
在某些情况下,编码是在属性字符集(HTML4?)中,在其他情况下是在属性编码(XHTML)中,还有一些我不知道,但我认为还有其他形式。
有一些课程可以做到这一点或者有什么方法可以做到?
I need get the encoding of a web page(UTF-8,ISO-8859-1,etc) before I download it because I will convert it from the InputStream downloaded to String using the encode.
I using HttpUrlConnection and there is a method called getContentEncoding, but it will return only if the server sends it.
In some cases, the encoding is in the attribute charset(HTML4?), in others in the attribute encoding(XHTML), and others I dont know, but I presume that there are another forms.
There is some class that do this or what is the way to do?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
HTTP 1.1 规范 表明
Content-Type
“应该”用于指示内容,并且不包含此标头的响应应被视为“application/octet-stream”——换句话说,是字节序列而不是字符。使用“应该”表示这是推荐的做法,但某些服务器可能不遵循它。因此,您的第一步是查找此标头。如果它不存在,则不要对内容应用任何字符集解码。对于 XML,假设您将流传递给解析器,这将正常工作:流将采用 UTF-8 编码,或者序言将指定编码。并且您应该始终将流直接传递到 XML 解析器,而不是先尝试将它们转换为字符串。
如果存在
Content-Type
标头,并且它指定了字符集,那么您可以根据该字符集自由进行解码。该规范还讨论了如何处理缺少字符集:对于任何text
内容类型,您应该假设它是使用 ISO-8859-1 编码的。这就是下一步:如果有字符集,或者是
text
媒体类型,则应用解码。否则,请保留该流。
The HTTP 1.1 specification indicates that
Content-Type
"should" be used to indicate the content, and that responses that do not include this header should be treated as "application/octet-stream" -- in other words, a sequence of bytes rather than characters. The use of "should" indicates that it's recommended practice, but some servers may not follow it.So, your first step is to look for this header. If it's not present, don't apply any character-set decoding to the content. In the case of XML, assuming that you pass the stream on to a parser this will just work: either the stream will be UTF-8 encoded, or the prologue will specify the encoding. And you should always pass streams directly to an XML parser, without attempting to convert them to a string first.
If there is a
Content-Type
header, and it specifies a character set, then you're free to decode according to that character set. The spec also talks about how to deal with a missing character set: for anytext
content type, you should assume that it is encoded using ISO-8859-1.So that's the next step: if there's a character set, or if it's a
text
media type, apply the decoding.Otherwise, leave the stream alone.
也许您可以尝试发出 HEAD 请求 来获取 HTTP 标头,然后再尝试完全处理页面? HTTPUrlConnection 有 setRequestMethod,您可以在其中指定 HEAD。
对于 HEAD 请求,服务器应该返回所有标头,但不返回消息正文。您可以尝试解析 Content-Type 标头值。从服务器返回的标头示例如下:
以下代码片段应该让您了解如何迭代和读取 HEAD 请求中返回的标头。
Perhaps you could try issuing a HEAD request to fetch the HTTP headers before attempting to fully process the page? HTTPUrlConnection has setRequestMethod, where you could specify HEAD.
With a HEAD request, the server is supposed to return all headers but without the message body. You can try parsing the Content-Type header value. Example headers returned from server would be:
The following snippet should give you an idea of how to iterate and read the headers returned in a HEAD request.
无法保证您可以在不检查文档的情况下执行此操作。
HTML 4.0.1 规范详细说明了如何指定通过
Content-Type
HTTP 标头和/或文档中的meta
元素进行编码。对于使用
Content-Type: application/xhtml+xml
必须从文档中发现编码。There is no guarantee that you can do this without inspecting the document.
The HTML 4.0.1 spec details how to specify the encoding via the
Content-Type
HTTP header and/or themeta
elements within the document.In the case of XHTML served with
Content-Type: application/xhtml+xml
the encoding must be discovered from the document.