通过 TCP 的 HTTP 请求会丢弃数据吗?
我正在制作一个 DownloadString 函数来检索 HTML 数据(因为 WebClient 缺乏相当多的速度 =/)
这是我到目前为止所拥有的...
public static string DownloadString(string url)
{
TcpClient client = new TcpClient();
client.Client.ReceiveTimeout = 5;
string dns = UrlToDNS(url);
byte[] buffer = new byte[51200];
client.Client.Connect(dns, 80);
string getVal = url.Substring(url.IndexOf(dns) + dns.Length);
string HTTPHeader = "GET " + getVal + " HTTP/1.1\nHost: " + dns + "\nConnection: close\nUser-Agent: Pastebin API 0.1\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\nAccept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7\nCache-Control: no-cache\nAccept-Language: en;q=0.7,en-us;q=0.3\n\n";
client.Client.Send(s2b(HTTPHeader));
client.Client.Receive(buffer);
return b2s(buffer);
}
private static string b2s(byte[] ba)
{
string ret = "";
foreach (byte b in ba)
ret += Convert.ToChar(b);
return ret;
}
(s2b 不是必需的,因为 http 服务器返回 OK)
但是,当我运行代码(以 http://www.google.com/ 作为测试),似乎有些数据被删除/未读取:
HTTP/1.1 200 OK
Date: Sat, 20 Aug 2011 15:18:28 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=3714446c9ffb56bf:FF=0:TM=1313853508:LM=1313853508:S=mu1XpTcwqFTwgwJM; expires=Mon, 19-Aug-2013 15:18:28 GMT; path=/; domain=.google.com
Set-Cookie: NID=50=B8YKlYj7eK84obqC5YO10AKF9jJNcQ5w4NkzidRL9of0Sc24EpbWeP-w7HVfm-eBCfE2NX2QMZAfEBpsqsgjhWqylFUIXU-bs6ObkLQbXJ59sa_daivfBLYJkQvq_WH; expires=Sun, 19-Feb-2012 15:18:2>8 GMT; path=/; domain=.google.com; HttpOnly
Server: gws
X-XSS-Protection: 1; mode=block
Connection: close
<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><meta name="description" content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."><meta name="robots" content="noodp"><title>Google</title><script>window.google={kEI:"RNBPTvPcI5C_gQeywpHfBg",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},kEXPI:"28936,29049,29774,30465,30542,31760",kCSI:{e
要添加另一个复杂性,它似乎每次都会丢弃不同数量的数据;对于丢失的数据量,我没有得到一致的结果,有时只丢失少量数据,有时(如示例)丢失大量数据,
对于造成这种情况的原因有什么想法吗? (或者在没有 WebClient 的情况下检索网页源代码的更好方法)
(也忽略输入和输出数据尚未清理的事实)
I am making a DownloadString function in order to retrieve HTML data (since the WebClient lacks quite a bit of speed =/)
Here's what i have so far...
public static string DownloadString(string url)
{
TcpClient client = new TcpClient();
client.Client.ReceiveTimeout = 5;
string dns = UrlToDNS(url);
byte[] buffer = new byte[51200];
client.Client.Connect(dns, 80);
string getVal = url.Substring(url.IndexOf(dns) + dns.Length);
string HTTPHeader = "GET " + getVal + " HTTP/1.1\nHost: " + dns + "\nConnection: close\nUser-Agent: Pastebin API 0.1\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\nAccept-Charset: ISO-8859-1,UTF-8;q=0.7,*;q=0.7\nCache-Control: no-cache\nAccept-Language: en;q=0.7,en-us;q=0.3\n\n";
client.Client.Send(s2b(HTTPHeader));
client.Client.Receive(buffer);
return b2s(buffer);
}
private static string b2s(byte[] ba)
{
string ret = "";
foreach (byte b in ba)
ret += Convert.ToChar(b);
return ret;
}
(s2b not necessary since the http server returns OK)
However, when i run the code (with http://www.google.com/ as a test), it seems that some of the data is dropped/not read:
HTTP/1.1 200 OK
Date: Sat, 20 Aug 2011 15:18:28 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=3714446c9ffb56bf:FF=0:TM=1313853508:LM=1313853508:S=mu1XpTcwqFTwgwJM; expires=Mon, 19-Aug-2013 15:18:28 GMT; path=/; domain=.google.com
Set-Cookie: NID=50=B8YKlYj7eK84obqC5YO10AKF9jJNcQ5w4NkzidRL9of0Sc24EpbWeP-w7HVfm-eBCfE2NX2QMZAfEBpsqsgjhWqylFUIXU-bs6ObkLQbXJ59sa_daivfBLYJkQvq_WH; expires=Sun, 19-Feb-2012 15:18:2>8 GMT; path=/; domain=.google.com; HttpOnly
Server: gws
X-XSS-Protection: 1; mode=block
Connection: close
<!doctype html><html><head><meta http-equiv="content-type" content="text/html; charset=ISO-8859-1"><meta name="description" content="Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."><meta name="robots" content="noodp"><title>Google</title><script>window.google={kEI:"RNBPTvPcI5C_gQeywpHfBg",getEI:function(a){var b;while(a&&!(a.getAttribute&&(b=a.getAttribute("eid"))))a=a.parentNode;return b||google.kEI},kEXPI:"28936,29049,29774,30465,30542,31760",kCSI:{e
To add another complication, it seems to drop a variable amount of data each time; I haven't gotten consistent results with how much data is lost, sometimes it loses only a small amount and sometimes (like the example) a larger amount
Any ideas on what is causing it? (or a better method of retrieving the source code of a webpage without WebClient)
(also ignore the fact that the input and output data hasn't been sanitized)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您应该使用
WebClient.DownloadString
。我非常怀疑这种方法速度慢并导致性能问题。但如果您想重新发明轮子,这里有一个更简洁的方法:
显然,此代码不遵循来自服务器的 HTTP 重定向。这是非常基本的。要获得从
WebClient.DownloadString
获得的所有功能,还需要做更多的工作。You should use a
WebClient.DownloadString
. I very highly doubt that it is this method that is slow and causing you performance problems.But if you want to reinvent wheels, here's a cleaner approach:
Obviously this code doesn't follow HTTP redirects from the server. It is very basic. Much more will be required to get all the functionality you would get from a
WebClient.DownloadString
.Socket.Receive()
当前仅返回可用数据。如果页面中的所有数据尚不可用,则它仅返回其中的一部分。如果要接收所有数据,则需要循环调用
Receive()
,直到返回0,因为这意味着所有数据已读取。Socket.Receive()
only returns currently available data. If not all data from the page is available yet, it returns only part of it.If you want to receive all the data, you need to call
Receive()
in a loop, until it returns 0, because that means all data has been read.