HttpWebRequest:使用正确的编码接收响应

发布于 2024-07-14 12:08:07 字数 1495 浏览 5 评论 0原文

我当前正在使用以下代码下载 HTML 页面:

Try
    Dim req As System.Net.HttpWebRequest = DirectCast(WebRequest.Create(URL), HttpWebRequest)
    req.Method = "GET"
    Dim resp As Net.HttpWebResponse = DirectCast(req.GetResponse(), Net.HttpWebResponse)
    Dim stIn As IO.StreamReader = New IO.StreamReader(resp.GetResponseStream())
    Dim strResponse As String = stIn.ReadToEnd

    ''Clean up
    stIn.Close()
    stIn.Dispose()
    resp.Close()

    Return strResponse

Catch ex As Exception
    Return ""
End Try

这对于大多数页面都适用,但对于某些页面(例如:www.gap.com),我得到的响应编码不正确。
例如,在gap.com 中,我将“'”视为“?”
更不用说如果我尝试加载 google.cn 会发生什么......

我在这里缺少什么,让 .Net 正确编码?

我最担心的是,我实际上必须读取指定编码的 HTML 内的元标记,然后重新读取(重新编码?)整个流。

任何指示将不胜感激。


更新:

感谢约翰桑德斯的回复,我更接近了。 HttpWebResponse.ContentEncoding 属性似乎总是​​为空。 然而,HttpWebResponse.CharacterSet 似乎很有用,通过这段代码,我越来越接近:

Dim resp As Net.HttpWebResponse = DirectCast(req.GetResponse(), Net.HttpWebResponse)
Dim respEncoding As Encoding = Encoding.GetEncoding(resp.CharacterSet)
Dim stIn As IO.StreamReader = New IO.StreamReader(resp.GetResponseStream(), respEncoding)

现在 Google.cn 完美地出现了,包含所有的汉字。
然而,Gap.Com 仍然犯了错误。

对于 Gap.com,HttpWebResponse.CharacterSet 是 ISO-8859-1,我通过 GetEncoding 获得的编码是 {System.Text.Latin1Encoding},其主体名称中显示“ISO-8859-1”,并且内容-在 HTML 中键入 META 标记指定“charset=ISO-8859-1”。

我是不是还做错了什么?
还是GAP做错了什么?

I'm currently downloading an HTML page, using the following code:

Try
    Dim req As System.Net.HttpWebRequest = DirectCast(WebRequest.Create(URL), HttpWebRequest)
    req.Method = "GET"
    Dim resp As Net.HttpWebResponse = DirectCast(req.GetResponse(), Net.HttpWebResponse)
    Dim stIn As IO.StreamReader = New IO.StreamReader(resp.GetResponseStream())
    Dim strResponse As String = stIn.ReadToEnd

    ''Clean up
    stIn.Close()
    stIn.Dispose()
    resp.Close()

    Return strResponse

Catch ex As Exception
    Return ""
End Try

This works well for most pages, but for some (eg: www.gap.com), I get the response incorrectly encoded.
In gap.com, for example, I get "’" as "?"
And not to mention what happens if I try to load google.cn...

What am I missing here, to get .Net to encode this right?

My worst fear is that i'll actually have to read the meta tag inside the HTML that specified the encoding, and then re-read (re-encode?) the whole stream.

Any pointers will be greatly appreciated.


UPDATE:

Thanks to John Saunders' reply, i'm a bit closer.
The HttpWebResponse.ContentEncoding property seems to always come in empty. However, HttpWebResponse.CharacterSet seems useful, and with this code, i'm getting closer:

Dim resp As Net.HttpWebResponse = DirectCast(req.GetResponse(), Net.HttpWebResponse)
Dim respEncoding As Encoding = Encoding.GetEncoding(resp.CharacterSet)
Dim stIn As IO.StreamReader = New IO.StreamReader(resp.GetResponseStream(), respEncoding)

Now Google.cn comes in perfectly, with all the chinese characters.
However, Gap.Com is still coming in wrong.

For Gap.com, HttpWebResponse.CharacterSet is ISO-8859-1, the Encoding i'm getting through GetEncoding is {System.Text.Latin1Encoding}, which says "ISO-8859-1" in it's body name, AND the Content-Type META tag in the HTML specified "charset=ISO-8859-1".

Am I still doing something wrong?
Or is GAP doing something wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

彼岸花ソ最美的依靠 2024-07-21 12:08:07

我相信 HttpWebResponse 有一个 ContentEncoding 属性。 在 StreamReader 的构造函数中使用它。

I believe that the HttpWebResponse has a ContentEncoding property. Use it in the constructor of your StreamReader.

初心未许 2024-07-21 12:08:07

Gap 的网站有误。 具体问题是,他们的页面声称采用 Latin1 (ISO-8859-1) 编码,而该页面使用在 ISO-8859-1 中无效的字符 #146。

但是,该字符在 Windows CP-1252 编码(ISO 8859-1 的超集)中有效。 在 CP-1252 中,字符代码 #146 和 用于右引号字符。 您会在 Gap.com 主页今天的文字“您会发现娇小和小尺寸”中看到它作为撇号。

您可以阅读http://en.wikipedia.org/wiki/Windows-1252 了解更多详细信息。 事实证明,这种情况是网页上的常见问题,其中内容最初以 CP-1252 编码保存(例如从 Word 复制/粘贴)。

这个故事的寓意是:始终将国际化文本作为 Unicode 存储在数据库中,并始终在 Web 服务器上以 UTF8 形式发出 HTML!

Gap's site is wrong. The specific problem is that their page claims an encoding of Latin1 (ISO-8859-1), while the page uses character #146 which is not valid in ISO-8859-1.

That character is, however, valid in the Windows CP-1252 encoding (which is a superset of ISO 8859-1). In CP-1252, character code #146 and is used for the right-quote character. You'll see this as an apostrophe in "Youll find Petites and small sizes" in today's text on the Gap.com home page.

You can read http://en.wikipedia.org/wiki/Windows-1252 for more details. Turns out this kind of thing is a common problem on web pages where the content was originally saved in the CP-1252 encoding (e.g. copy/pasted from Word).

Moral of the story here: always store internationalized text as Unicode in your database, and always emit HTML as UTF8 on your web server!

后来的我们 2024-07-21 12:08:07

丹尼尔,
有些页面甚至不返回CharacterSet中的值,因此这种方法不太可靠。
有时甚至浏览器都无法“猜测”要使用哪种编码,所以我认为你不能进行 100% 的编码识别。

在我的特殊情况下,当我处理西班牙语或葡萄牙语页面时,我使用 UTF7 编码,它对我来说工作得很好(áéíóúñÑêã...等)。

也许您可以先加载一个字符集代码及其相应编码的表。 如果字符集为空,您可以提供默认编码。

StreamReader 构造函数中的 detectorEncodingFromByteOrderMarks 参数可能会有所帮助,因为它会自动检测或从第一个字节推断一些编码。

Daniel,
Some pages not even return a value in the CharacterSet, so this approach is not so reliable.
Sometimes not even the browsers are able to "guess" which Encoding to use, so I think you can't do a 100% enconding recogniton.

In my particular case, as I deal with spanish or portuguese pages, I use the UTF7 encoding and it is working fine for me (áéíóúñÑêã... etc).

May be you can first load a table of CharacterSet codes and their corresponding Encoding. And in case the CharacterSet is empty, you can provide a Default encoding.

The detectEncodingFromByteOrderMarks parameter in the StreamReader constructor, may help a little as it automatically detect or infers some encodings from the very first bytes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文