在 C# 中使用 TcpClient 真的很奇怪的 HTTP 客户端
我正在实现一个简单的 HTTP 客户端,它仅连接到 Web 服务器并获取其默认主页。就是这样,效果很好:
using System;
using System.Net.Sockets;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
TcpClient tc = new TcpClient();
tc.Connect("www.google.com", 80);
using (NetworkStream ns = tc.GetStream())
{
System.IO.StreamWriter sw = new System.IO.StreamWriter(ns);
System.IO.StreamReader sr = new System.IO.StreamReader(ns);
string req = "";
req += "GET / HTTP/1.0\r\n";
req += "Host: www.google.com\r\n";
req += "\r\n";
sw.Write(req);
sw.Flush();
Console.WriteLine("[reading...]");
Console.WriteLine(sr.ReadToEnd());
}
tc.Close();
Console.WriteLine("[done!]");
Console.ReadKey();
}
}
}
当我从上面的代码中删除以下行时,程序会在 sr.ReadToEnd 上阻塞。
req += "Host: www.google.com\r\n";
我什至将 sr.ReadToEnd 替换为 sr.Read,但它无法读取任何内容。我使用 Wireshark 来查看发生了什么:
使用 Wireshark http://www.imagechicken 捕获的数据包的屏幕截图。 com/uploads/1252514718052893500.jpg
如您所见,在我的 GET 请求之后,Google 没有响应,并且请求一次又一次地重新传输。看来我们必须在 HTTP 请求中指定 Host 部分。奇怪的是我们不这样做。我使用 telnet 发送此请求并得到了 Google 的响应。我还捕获了telnet发送的请求,它与我的请求完全相同。
我尝试了很多其他网站(例如雅虎、微软),但结果都是一样的。
那么,telnet 中的延迟是否会导致 Web 服务器表现不同(因为在 telnet 中,我们实际上键入字符,而不是在 1 个数据包中将它们一起发送)。
另一个奇怪的问题是,当我将 HTTP/1.0 更改为 HTTP/1.1 时,程序总是在 sr.ReadToEnd 行上阻塞。我猜这是因为网络服务器没有关闭连接。
一种解决方案是使用Read(或ReadLine)和ns.DataAvailable来读取响应。但我不能确定我是否已阅读所有回复。如何读取响应并确保 HTTP/1.1 请求的响应中没有剩余字节?
注意: 正如W3所说,
(我为我的 HTTP/1.1 请求做到了这一点)。但我还没有在 HTTP/1.0 中看到这样的事情。使用 telnet 发送没有 Host 标头的请求也没有任何问题。
更新:
TCP 段中的 Push 标志已设置为 1。我还尝试过 netsh Winsock Reset 来重置我的 TCP/IP 堆栈。测试计算机上没有防火墙或防病毒软件。该数据包实际上已发送,因为安装在另一台计算机上的 Wireshark 可以捕获它。
我还尝试过其他一些要求。例如,
string req = "";
req += "GET / HTTP/1.0\r\n";
req += "s df slkjfd sdf/ s/fd \\sdf/\\\\dsfdsf \r\n";
req += "qwretyuiopasdfghjkl\r\n";
req += "Host: www.google.com\r\n";
req += "\r\n";
在所有类型的请求中,如果我省略 Host: 部分,网络服务器不会响应,如果带有 Host: 部分,即使是无效的请求(就像上面的请求一样)将得到响应(通过 400:HTTP Bad Request)。
nos 表示他的机器上不需要 Host: 部分,这使得情况更加严重诡异的。
I am implementing a simple HTTP Client that just connects to a web server and gets its default homepage. Here it is and it works nice:
using System;
using System.Net.Sockets;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
TcpClient tc = new TcpClient();
tc.Connect("www.google.com", 80);
using (NetworkStream ns = tc.GetStream())
{
System.IO.StreamWriter sw = new System.IO.StreamWriter(ns);
System.IO.StreamReader sr = new System.IO.StreamReader(ns);
string req = "";
req += "GET / HTTP/1.0\r\n";
req += "Host: www.google.com\r\n";
req += "\r\n";
sw.Write(req);
sw.Flush();
Console.WriteLine("[reading...]");
Console.WriteLine(sr.ReadToEnd());
}
tc.Close();
Console.WriteLine("[done!]");
Console.ReadKey();
}
}
}
When I delete the below line from above code, the program blocks on sr.ReadToEnd.
req += "Host: www.google.com\r\n";
I even replaced sr.ReadToEnd with sr.Read, but it cannot read anything. I used Wireshark to see what's happen:
As you see, after my GET request Google doesn't respond and the request is retransmitted again and again. It seems we HAVE TO specify the Host part in HTTP request. The weird part is WE DON'T. I used telnet to send this request and got the respond from Google. I also captured the request sent by telnet and it was exactly same as my request.
I tried many other websites (e.g. Yahoo, Microsoft) but the result is same.
So, does the delay in telnet cause the web-server act differently (because in telnet we actually type characters instead of sending them together in 1 packet).
Another weird problem is when I change HTTP/1.0 to HTTP/1.1, the program always blocks on sr.ReadToEnd line. I guess that's because the web server don't close the connection.
One solution is using Read (or ReadLine) and ns.DataAvailable to read the response. But I cannot be sure that I have read all of the response. How I can read the response and be sure there is no more bytes left in the response of a HTTP/1.1 request?
Note:
As W3 says,
the Host request-header field MUST accompany all HTTP/1.1
requests
(and I did it for my HTTP/1.1 requests). But I haven't seen such thing for HTTP/1.0. Also sending a request without Host header using telnet works without any problem.
Update:
Push flag has been set to 1 in the TCP segment. I also have tried netsh winsock reset to reset my TCP/IP stack. There is no firewalls nor anti-viruses on the testing computer. The packet are actually sent because Wireshark installed on another computer can capture it.
I also have tried some other requests. For Instance,
string req = "";
req += "GET / HTTP/1.0\r\n";
req += "s df slkjfd sdf/ s/fd \\sdf/\\\\dsfdsf \r\n";
req += "qwretyuiopasdfghjkl\r\n";
req += "Host: www.google.com\r\n";
req += "\r\n";
In all kind of requests, if I omit the Host: part, the web-server doesn't respond and if with a Host: part, even an invalid request (just like the above request) will be responded (by a 400: HTTP Bad Request).
nos says the Host: part is not required on his machine, and this makes the situation more weird.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这与使用 TcpClient 相关。
我知道这篇文章很旧。我提供此信息是为了防止其他人遇到此情况。将此答案视为对上述所有答案的补充。
某些服务器需要 HTTP 主机标头,因为它们被设置为每个 IP 地址托管多个域。作为一般规则,始终发送主机标头。好的服务器会回复“Not Found”。有些服务器根本不会回复。
当从流中读取数据的调用阻塞时,通常是因为服务器正在等待发送更多数据。当未严格遵循 HTTP 1.1 规范时,通常会出现这种情况。为了演示这一点,请尝试省略最终的 CR LF 序列,然后从流中读取数据 - 对 read 的调用将等待,直到客户端超时或服务器通过终止连接放弃等待。
我希望这能带来一些启发......
This pertains to using TcpClient.
I know this post is old. I am providing this information just in case anyone else comes across this. Consider this answer a supplement to all of the above answers.
The HTTP host header is required by some servers since they are setup to host more than one domain per IP address. As a general rule, always sent the Host header. A good server will reply with "Not Found". Some servers won't reply at all.
When the call to read data from the stream blocks, it's usually because the server is waiting for more data to be sent. This is typically the case when the HTTP 1.1 spec is not followed closely. To demonstrate this, try omitting the final CR LF sequence and then read data from the stream - the call to read will wait until either the client times out or the server gives up waiting by terminating the connection.
I hope this sheds a bit of light...
我发现了一个问题:
我如何读取响应并确保我读取了 HTTP/1.1 请求中的所有响应?
这是我可以回答的问题!
您在这里使用的所有方法都是同步的,这很容易使用,但不太可靠。一旦你得到了相当大的响应,但只得到了其中的一部分,你就会发现问题。
为了最稳健地实现 TcpClient 连接,您应该使用所有异步方法和回调。相关方法如下:
1)使用TcpClient.BeginConnect(...)创建连接,回调调用TcpClient.EndConnect(...)
2) 使用 TcpClient.GetStream().BeginWrite(...) 发送请求,回调调用 TcpClient.GetStream().EndWrite(...)
3) 使用 TcpClient.GetStream().BeginRead(...) 接收响应,回调调用 TcpClient.GetStream().EndRead(...),将结果附加到 StringBuilder 缓冲区,然后调用 TcpClient.GetStream( ).BeginRead(...) 再次(使用相同的回调)直到收到 0 字节的响应。
最后一步(重复调用 BeginRead 直到读取到 0 个字节)解决了获取响应、整个响应以及除了响应之外什么都没有的问题。所以请帮助我们TCP。
希望有帮助!
I found one question in all that:
How i can read the response and be sure i read all of the response in HTTP/1.1 request?
And that is a question I can answer!
All the methods you're using here are synchronous, which is easy to use but not even slightly reliable. You'll see problems as soon as you have a sizable response and only get part of it.
To implement a TcpClient connection most robustly, you should use all asynchronous methods and callbacks. The relevant methods are as follows:
1) Create the connection with TcpClient.BeginConnect(...) with the callback calling TcpClient.EndConnect(...)
2) Send a request with TcpClient.GetStream().BeginWrite(...) with the callback calling TcpClient.GetStream().EndWrite(...)
3) Receive a response with TcpClient.GetStream().BeginRead(...) with the callback calling TcpClient.GetStream().EndRead(...), appending the result to a StringBuilder buffer, and then calling TcpClient.GetStream().BeginRead(...) again (with the same callback) until a response of 0 bytes is received.
It's that final step (repeatedly calling BeginRead until 0 bytes are read) that solves the problem of fetching the response, the whole response, and nothing but the response. So help us TCP.
Hope that helps!
我建议您在您自己的本地计算机上安装的标准、经过充分测试、广泛接受的 Web 服务器上尝试您的代码,例如 Apache HTTPD 或 IIS。
将您的 Web 服务器配置为在没有 Host 标头的情况下进行响应(例如 IIS 中的默认 Web 应用程序),然后查看是否一切顺利。
归根结底,您无法真正了解幕后发生的事情,因为您无法控制 google、yahoo 等网站/网络应用程序。
例如,网站管理员可以配置网站,以便使用 HTTP 协议,端口 80 上的传入 TCP 连接没有默认应用程序。
但他/她可能希望在使用 TELNET 协议通过 TCP 端口 23 连接时配置默认的 telnet 应用程序。
I suggest you try your code against a standard, well tested, largely accepted web server installed on your own local machine, such as Apache HTTPD, or IIS.
Configure your web server to respond without the Host header (e.g. a default web application in IIS) and see if all goes well.
At the bottom line, you can't really tell what goes on behind the scenes, since you don't control web sites / web applications like google, yahoo, etc.
For instance, a web site administrator can configure the site so that there's no default application for incoming TCP connections on port 80, using HTTP protocol.
But he/she may want to configure a default telnet application when connecting via TCP port 23, using TELNET protocol.
我相信 ReadToEnd 会等到连接关闭。但它似乎并没有关闭。相反,你应该不断地阅读它。然后它将按照您的预期工作。
I believe ReadToEnd will wait until the connection is closed. However it doesnt appear to close. You should continuously read it instead. Then it will work as you may expect.
尝试直接使用 System.Net.WebClient 而不是 System.Net.Sockets.TcpClient:
Try using System.Net.WebClient instead of System.Net.Sockets.TcpClient directly: