通过 TCP 的 HTTP 请求会丢弃数据吗?

发布于 2024-11-30 17:09:14 字数 2648 浏览 1 评论 0原文

我正在制作一个 DownloadString 函数来检索 HTML 数据(因为 WebClient 缺乏相当多的速度 =/)

这是我到目前为止所拥有的...

    public static string DownloadString(string url)
    {
        TcpClient client = new TcpClient();
        client.Client.ReceiveTimeout = 5;
        string dns = UrlToDNS(url);
        byte[] buffer = new byte[51200];
        client.Client.Connect(dns, 80);
        string getVal = url.Substring(url.IndexOf(dns) + dns.Length);
        string HTTPHeader = "GET " + getVal + " HTTP/1.1\nHost: " + dns + "\nConnection: close\nUser-Agent: Pastebin API 0.1\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\nAccept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7\nCache-Control: no-cache\nAccept-Language: en;q=0.7,en-us;q=0.3\n\n";
        client.Client.Send(s2b(HTTPHeader));
        client.Client.Receive(buffer);
        return b2s(buffer);
    }

    private static string b2s(byte[] ba)
    {
        string ret = "";
        foreach (byte b in ba)
            ret += Convert.ToChar(b);
        return ret;
    }

(s2b 不是必需的,因为 http 服务器返回 OK)

但是,当我运行代码(以 http://www.google.com/ 作为测试),似乎有些数据被删除/未读取:

HTTP/1.1 200 OK
Date: Sat, 20 Aug 2011 15:18:28 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=3714446c9ffb56bf:FF=0:TM=1313853508:LM=1313853508:S=mu1XpTcwqFTwgwJM; expires=Mon, 19-Aug-2013 15:18:28 GMT; path=/; domain=.google.com
Set-Cookie: NID=50=B8YKlYj7eK84obqC5YO10AKF9jJNcQ5w4NkzidRL9of0Sc24EpbWeP-w7HVfm-eBCfE2NX2QMZAfEBpsqsgjhWqylFUIXU-bs6ObkLQbXJ59sa_daivfBLYJkQvq_WH; expires=Sun, 19-Feb-2012 15:18:2>8 GMT; path=/; domain=.google.com; HttpOnly
Server: gws
X-XSS-Protection: 1; mode=block
Connection: close

<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><meta name="description" content="Search the world&#39;s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you&#39;re looking for."><meta name="robots" content="noodp"><title>Google</title><script>window.google={kEI:"RNBPTvPcI5C_gQeywpHfBg",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},kEXPI:"28936,29049,29774,30465,30542,31760",kCSI:{e

要添加另一个复杂性,它似乎每次都会丢弃不同数量的数据;对于丢失的数据量,我没有得到一致的结果,有时只丢失少量数据,有时(如示例)丢失大量数据,

对于造成这种情况的原因有什么想法吗? (或者在没有 WebClient 的情况下检索网页源代码的更好方法)

(也忽略输入和输出数据尚未清理的事实)

I am making a DownloadString function in order to retrieve HTML data (since the WebClient lacks quite a bit of speed =/)

Here's what i have so far...

    public static string DownloadString(string url)
    {
        TcpClient client = new TcpClient();
        client.Client.ReceiveTimeout = 5;
        string dns = UrlToDNS(url);
        byte[] buffer = new byte[51200];
        client.Client.Connect(dns, 80);
        string getVal = url.Substring(url.IndexOf(dns) + dns.Length);
        string HTTPHeader = "GET " + getVal + " HTTP/1.1\nHost: " + dns + "\nConnection: close\nUser-Agent: Pastebin API 0.1\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\nAccept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7\nCache-Control: no-cache\nAccept-Language: en;q=0.7,en-us;q=0.3\n\n";
        client.Client.Send(s2b(HTTPHeader));
        client.Client.Receive(buffer);
        return b2s(buffer);
    }

    private static string b2s(byte[] ba)
    {
        string ret = "";
        foreach (byte b in ba)
            ret += Convert.ToChar(b);
        return ret;
    }

(s2b not necessary since the http server returns OK)

However, when i run the code (with http://www.google.com/ as a test), it seems that some of the data is dropped/not read:

HTTP/1.1 200 OK
Date: Sat, 20 Aug 2011 15:18:28 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=3714446c9ffb56bf:FF=0:TM=1313853508:LM=1313853508:S=mu1XpTcwqFTwgwJM; expires=Mon, 19-Aug-2013 15:18:28 GMT; path=/; domain=.google.com
Set-Cookie: NID=50=B8YKlYj7eK84obqC5YO10AKF9jJNcQ5w4NkzidRL9of0Sc24EpbWeP-w7HVfm-eBCfE2NX2QMZAfEBpsqsgjhWqylFUIXU-bs6ObkLQbXJ59sa_daivfBLYJkQvq_WH; expires=Sun, 19-Feb-2012 15:18:2>8 GMT; path=/; domain=.google.com; HttpOnly
Server: gws
X-XSS-Protection: 1; mode=block
Connection: close

<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><meta name="description" content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."><meta name="robots" content="noodp"><title>Google</title><script>window.google={kEI:"RNBPTvPcI5C_gQeywpHfBg",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},kEXPI:"28936,29049,29774,30465,30542,31760",kCSI:{e

To add another complication, it seems to drop a variable amount of data each time; I haven't gotten consistent results with how much data is lost, sometimes it loses only a small amount and sometimes (like the example) a larger amount

Any ideas on what is causing it? (or a better method of retrieving the source code of a webpage without WebClient)

(also ignore the fact that the input and output data hasn't been sanitized)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

年华零落成诗 2024-12-07 17:09:14

您应该使用WebClient.DownloadString。我非常怀疑这种方法速度慢并导致性能问题。

但如果您想重新发明轮子,这里有一个更简洁的方法:

class Program
{
    static void Main()
    {
        using (var client = new TcpClient("www.google.com", 80))
        using (var stream = client.GetStream())
        using (var writer = new StreamWriter(stream))
        using (var reader = new StreamReader(stream))
        {
            writer.AutoFlush = true;
            // Send request headers
            writer.WriteLine("GET / HTTP/1.1");
            writer.WriteLine("Host: www.google.com:80");
            writer.WriteLine("User-Agent: Pastebin API 0.1");
            writer.WriteLine("Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
            writer.WriteLine("Accept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7");
            writer.WriteLine("Cache-Control: no-cache");
            writer.WriteLine("Accept-Language: en;q=0.7,en-us;q=0.3");
            writer.WriteLine("Connection: close");
            writer.WriteLine();
            writer.WriteLine();

            // Read the response from server
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

显然,此代码不遵循来自服务器的 HTTP 重定向。这是非常基本的。要获得从 WebClient.DownloadString 获得的所有功能,还需要做更多的工作。

You should use a WebClient.DownloadString. I very highly doubt that it is this method that is slow and causing you performance problems.

But if you want to reinvent wheels, here's a cleaner approach:

class Program
{
    static void Main()
    {
        using (var client = new TcpClient("www.google.com", 80))
        using (var stream = client.GetStream())
        using (var writer = new StreamWriter(stream))
        using (var reader = new StreamReader(stream))
        {
            writer.AutoFlush = true;
            // Send request headers
            writer.WriteLine("GET / HTTP/1.1");
            writer.WriteLine("Host: www.google.com:80");
            writer.WriteLine("User-Agent: Pastebin API 0.1");
            writer.WriteLine("Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
            writer.WriteLine("Accept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7");
            writer.WriteLine("Cache-Control: no-cache");
            writer.WriteLine("Accept-Language: en;q=0.7,en-us;q=0.3");
            writer.WriteLine("Connection: close");
            writer.WriteLine();
            writer.WriteLine();

            // Read the response from server
            Console.WriteLine(reader.ReadToEnd());
        }
    }
}

Obviously this code doesn't follow HTTP redirects from the server. It is very basic. Much more will be required to get all the functionality you would get from a WebClient.DownloadString.

眼眸 2024-12-07 17:09:14

Socket.Receive() 当前仅返回可用数据。如果页面中的所有数据尚不可用,则它仅返回其中的一部分。

如果要接收所有数据,则需要循环调用Receive(),直到返回0,因为这意味着所有数据已读取。

Socket.Receive() only returns currently available data. If not all data from the page is available yet, it returns only part of it.

If you want to receive all the data, you need to call Receive() in a loop, until it returns 0, because that means all data has been read.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文