通过 TCP 的 HTTP 请求会丢弃数据吗？

发布于 2024-11-30 17:09:14 字数 2648 浏览 1 评论 0原文

我正在制作一个 DownloadString 函数来检索 HTML 数据（因为 WebClient 缺乏相当多的速度 =/）

这是我到目前为止所拥有的...

    public static string DownloadString(string url)
    {
        TcpClient client = new TcpClient();
        client.Client.ReceiveTimeout = 5;
        string dns = UrlToDNS(url);
        byte[] buffer = new byte[51200];
        client.Client.Connect(dns, 80);
        string getVal = url.Substring(url.IndexOf(dns) + dns.Length);
        string HTTPHeader = "GET " + getVal + " HTTP/1.1\nHost: " + dns + "\nConnection: close\nUser-Agent: Pastebin API 0.1\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\nAccept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7\nCache-Control: no-cache\nAccept-Language: en;q=0.7,en-us;q=0.3\n\n";
        client.Client.Send(s2b(HTTPHeader));
        client.Client.Receive(buffer);
        return b2s(buffer);
    }

    private static string b2s(byte[] ba)
    {
        string ret = "";
        foreach (byte b in ba)
            ret += Convert.ToChar(b);
        return ret;
    }

（s2b 不是必需的，因为 http 服务器返回 OK）

但是，当我运行代码（以 http://www.google.com/ 作为测试），似乎有些数据被删除/未读取：

HTTP/1.1 200 OK
Date: Sat, 20 Aug 2011 15:18:28 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=3714446c9ffb56bf:FF=0:TM=1313853508:LM=1313853508:S=mu1XpTcwqFTwgwJM; expires=Mon, 19-Aug-2013 15:18:28 GMT; path=/; domain=.google.com
Set-Cookie: NID=50=B8YKlYj7eK84obqC5YO10AKF9jJNcQ5w4NkzidRL9of0Sc24EpbWeP-w7HVfm-eBCfE2NX2QMZAfEBpsqsgjhWqylFUIXU-bs6ObkLQbXJ59sa_daivfBLYJkQvq_WH; expires=Sun, 19-Feb-2012 15:18:2>8 GMT; path=/; domain=.google.com; HttpOnly
Server: gws
X-XSS-Protection: 1; mode=block
Connection: close

<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><meta name="description" content="Search the world&#39;s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you&#39;re looking for."><meta name="robots" content="noodp"><title>Google</title><script>window.google={kEI:"RNBPTvPcI5C_gQeywpHfBg",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},kEXPI:"28936,29049,29774,30465,30542,31760",kCSI:{e

要添加另一个复杂性，它似乎每次都会丢弃不同数量的数据；对于丢失的数据量，我没有得到一致的结果，有时只丢失少量数据，有时（如示例）丢失大量数据，

对于造成这种情况的原因有什么想法吗？（或者在没有 WebClient 的情况下检索网页源代码的更好方法）

（也忽略输入和输出数据尚未清理的事实）

原文

I am making a DownloadString function in order to retrieve HTML data (since the WebClient lacks quite a bit of speed =/)

Here's what i have so far...

    public static string DownloadString(string url)
    {
        TcpClient client = new TcpClient();
        client.Client.ReceiveTimeout = 5;
        string dns = UrlToDNS(url);
        byte[] buffer = new byte[51200];
        client.Client.Connect(dns, 80);
        string getVal = url.Substring(url.IndexOf(dns) + dns.Length);
        string HTTPHeader = "GET " + getVal + " HTTP/1.1\nHost: " + dns + "\nConnection: close\nUser-Agent: Pastebin API 0.1\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\nAccept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7\nCache-Control: no-cache\nAccept-Language: en;q=0.7,en-us;q=0.3\n\n";
        client.Client.Send(s2b(HTTPHeader));
        client.Client.Receive(buffer);
        return b2s(buffer);
    }

    private static string b2s(byte[] ba)
    {
        string ret = "";
        foreach (byte b in ba)
            ret += Convert.ToChar(b);
        return ret;
    }

(s2b not necessary since the http server returns OK)

However, when i run the code (with http://www.google.com/ as a test), it seems that some of the data is dropped/not read:

HTTP/1.1 200 OK
Date: Sat, 20 Aug 2011 15:18:28 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=3714446c9ffb56bf:FF=0:TM=1313853508:LM=1313853508:S=mu1XpTcwqFTwgwJM; expires=Mon, 19-Aug-2013 15:18:28 GMT; path=/; domain=.google.com
Set-Cookie: NID=50=B8YKlYj7eK84obqC5YO10AKF9jJNcQ5w4NkzidRL9of0Sc24EpbWeP-w7HVfm-eBCfE2NX2QMZAfEBpsqsgjhWqylFUIXU-bs6ObkLQbXJ59sa_daivfBLYJkQvq_WH; expires=Sun, 19-Feb-2012 15:18:2>8 GMT; path=/; domain=.google.com; HttpOnly
Server: gws
X-XSS-Protection: 1; mode=block
Connection: close

<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><meta name="description" content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."><meta name="robots" content="noodp"><title>Google</title><script>window.google={kEI:"RNBPTvPcI5C_gQeywpHfBg",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},kEXPI:"28936,29049,29774,30465,30542,31760",kCSI:{e

To add another complication, it seems to drop a variable amount of data each time; I haven't gotten consistent results with how much data is lost, sometimes it loses only a small amount and sometimes (like the example) a larger amount

Any ideas on what is causing it? (or a better method of retrieving the source code of a webpage without WebClient)

(also ignore the fact that the input and output data hasn't been sanitized)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

年华零落成诗 2024-12-07 17:09:14

您应该使用WebClient.DownloadString。我非常怀疑这种方法速度慢并导致性能问题。

但如果您想重新发明轮子，这里有一个更简洁的方法：

class Program
{
    static void Main()
    {
        using (var client = new TcpClient("www.google.com", 80))
        using (var stream = client.GetStream())
        using (var writer = new StreamWriter(stream))
        using (var reader = new StreamReader(stream))
        {
            writer.AutoFlush = true;
            // Send request headers
            writer.WriteLine("GET / HTTP/1.1");
            writer.WriteLine("Host: www.google.com:80");
            writer.WriteLine("User-Agent: Pastebin API 0.1");
            writer.WriteLine("Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
            writer.WriteLine("Accept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7");
            writer.WriteLine("Cache-Control: no-cache");
            writer.WriteLine("Accept-Language: en;q=0.7,en-us;q=0.3");
            writer.WriteLine("Connection: close");
            writer.WriteLine();
            writer.WriteLine();

            // Read the response from server
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

显然，此代码不遵循来自服务器的 HTTP 重定向。这是非常基本的。要获得从 WebClient.DownloadString 获得的所有功能，还需要做更多的工作。

You should use a WebClient.DownloadString. I very highly doubt that it is this method that is slow and causing you performance problems.

But if you want to reinvent wheels, here's a cleaner approach:

class Program
{
    static void Main()
    {
        using (var client = new TcpClient("www.google.com", 80))
        using (var stream = client.GetStream())
        using (var writer = new StreamWriter(stream))
        using (var reader = new StreamReader(stream))
        {
            writer.AutoFlush = true;
            // Send request headers
            writer.WriteLine("GET / HTTP/1.1");
            writer.WriteLine("Host: www.google.com:80");
            writer.WriteLine("User-Agent: Pastebin API 0.1");
            writer.WriteLine("Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
            writer.WriteLine("Accept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7");
            writer.WriteLine("Cache-Control: no-cache");
            writer.WriteLine("Accept-Language: en;q=0.7,en-us;q=0.3");
            writer.WriteLine("Connection: close");
            writer.WriteLine();
            writer.WriteLine();

            // Read the response from server
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

Obviously this code doesn't follow HTTP redirects from the server. It is very basic. Much more will be required to get all the functionality you would get from a WebClient.DownloadString.

回复收藏 0 原文