HttpWebRequest：使用正确的编码接收响应

发布于 2024-07-14 12:08:07 字数 1495 浏览 5 评论 0原文

我当前正在使用以下代码下载 HTML 页面：

Try
    Dim req As System.Net.HttpWebRequest = DirectCast(WebRequest.Create(URL), HttpWebRequest)
    req.Method = "GET"
    Dim resp As Net.HttpWebResponse = DirectCast(req.GetResponse(), Net.HttpWebResponse)
    Dim stIn As IO.StreamReader = New IO.StreamReader(resp.GetResponseStream())
    Dim strResponse As String = stIn.ReadToEnd

    ''Clean up
    stIn.Close()
    stIn.Dispose()
    resp.Close()

    Return strResponse

Catch ex As Exception
    Return ""
End Try

这对于大多数页面都适用，但对于某些页面（例如：www.gap.com），我得到的响应编码不正确。
例如，在gap.com 中，我将“'”视为“？”
更不用说如果我尝试加载 google.cn 会发生什么......

我在这里缺少什么，让 .Net 正确编码？

我最担心的是，我实际上必须读取指定编码的 HTML 内的元标记，然后重新读取（重新编码？）整个流。

任何指示将不胜感激。

更新：

感谢约翰桑德斯的回复，我更接近了。 HttpWebResponse.ContentEncoding 属性似乎总是为空。然而，HttpWebResponse.CharacterSet 似乎很有用，通过这段代码，我越来越接近：

Dim resp As Net.HttpWebResponse = DirectCast(req.GetResponse(), Net.HttpWebResponse)
Dim respEncoding As Encoding = Encoding.GetEncoding(resp.CharacterSet)
Dim stIn As IO.StreamReader = New IO.StreamReader(resp.GetResponseStream(), respEncoding)

现在 Google.cn 完美地出现了，包含所有的汉字。
然而，Gap.Com 仍然犯了错误。

对于 Gap.com，HttpWebResponse.CharacterSet 是 ISO-8859-1，我通过 GetEncoding 获得的编码是 {System.Text.Latin1Encoding}，其主体名称中显示“ISO-8859-1”，并且内容-在 HTML 中键入 META 标记指定“charset=ISO-8859-1”。

我是不是还做错了什么？
还是GAP做错了什么？

原文

I'm currently downloading an HTML page, using the following code:

Try
    Dim req As System.Net.HttpWebRequest = DirectCast(WebRequest.Create(URL), HttpWebRequest)
    req.Method = "GET"
    Dim resp As Net.HttpWebResponse = DirectCast(req.GetResponse(), Net.HttpWebResponse)
    Dim stIn As IO.StreamReader = New IO.StreamReader(resp.GetResponseStream())
    Dim strResponse As String = stIn.ReadToEnd

    ''Clean up
    stIn.Close()
    stIn.Dispose()
    resp.Close()

    Return strResponse

Catch ex As Exception
    Return ""
End Try

This works well for most pages, but for some (eg: www.gap.com), I get the response incorrectly encoded.
In gap.com, for example, I get "’" as "?"
And not to mention what happens if I try to load google.cn...

What am I missing here, to get .Net to encode this right?

My worst fear is that i'll actually have to read the meta tag inside the HTML that specified the encoding, and then re-read (re-encode?) the whole stream.

Any pointers will be greatly appreciated.

UPDATE:

Thanks to John Saunders' reply, i'm a bit closer.
The HttpWebResponse.ContentEncoding property seems to always come in empty. However, HttpWebResponse.CharacterSet seems useful, and with this code, i'm getting closer:

Dim resp As Net.HttpWebResponse = DirectCast(req.GetResponse(), Net.HttpWebResponse)
Dim respEncoding As Encoding = Encoding.GetEncoding(resp.CharacterSet)
Dim stIn As IO.StreamReader = New IO.StreamReader(resp.GetResponseStream(), respEncoding)

Now Google.cn comes in perfectly, with all the chinese characters.
However, Gap.Com is still coming in wrong.

For Gap.com, HttpWebResponse.CharacterSet is ISO-8859-1, the Encoding i'm getting through GetEncoding is {System.Text.Latin1Encoding}, which says "ISO-8859-1" in it's body name, AND the Content-Type META tag in the HTML specified "charset=ISO-8859-1".

Am I still doing something wrong?
Or is GAP doing something wrong?

分享到QQ

分享到微博