解析原始 HTTP 标头
我有一个原始 HTTP 字符串,我想表示对象中的字段。有没有办法从 HTTP 字符串中解析各个标头?
'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n
[...]'
I have a string of raw HTTP and I would like to represent the fields in an object. Is there any way to parse the individual headers from an HTTP string?
'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n
[...]'
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
标准库中有一些出色的工具,既可用于解析 RFC 821 标头,也可用于解析整个 HTTP 请求。这是一个示例请求字符串(请注意,Python 将其视为一个大字符串,即使我们为了可读性而将其分成几行),我们可以将其提供给我的示例:
正如 @TryPyPy 指出的,您可以使用 Python 的电子邮件消息库解析标头 - 尽管我们应该补充一下,一旦创建完成,生成的
Message
对象就像标头字典一样:但是,这当然会忽略请求行,或者让您解析它你自己。事实证明,有一个更好的解决方案。
如果您使用其
BaseHTTPRequestHandler
,标准库将为您解析 HTTP。虽然它的文档有点晦涩——标准库中整套 HTTP 和 URL 工具的问题——要让它解析字符串,您所要做的就是 (a) 将字符串包装在 BytesIO() 中,(b) 读取raw_requestline
以便它可以被解析,并且 (c) 捕获解析期间发生的任何错误代码,而不是让它尝试将它们写回到客户(因为我们没有客户!)。因此,这是我们对标准库类的专门化:
再次,我希望标准库人员已经意识到,HTTP 解析应该以一种不需要我们编写九行代码来正确调用它的方式进行分解,但是什么可以你做?以下是如何使用这个简单的类:
如果在解析过程中出现错误,
error_code
将不会是None
:我更喜欢像这样使用标准库,因为我怀疑他们已经遇到并解决了任何边缘情况,如果我尝试自己用正则表达式重新实现互联网规范,这些情况可能会困扰我。
旧的 Python 2 代码
这是这个答案的原始代码,回溯到我第一次写它时:
And:
And:
And:
And:
There are excellent tools in the Standard Library both for parsing RFC 821 headers, and also for parsing entire HTTP requests. Here is an example request string (note that Python treats it as one big string, even though we are breaking it across several lines for readability) that we can feed to my examples:
As @TryPyPy points out, you can use Python’s email message library to parse the headers — though we should add that the resulting
Message
object acts like a dictionary of headers once you are done creating it:But this, of course, ignores the request line, or makes you parse it yourself. It turns out that there is a much better solution.
The Standard Library will parse HTTP for you if you use its
BaseHTTPRequestHandler
. Though its documentation is a bit obscure — a problem with the whole suite of HTTP and URL tools in the Standard Library — all you have to do to make it parse a string is (a) wrap your string in aBytesIO()
, (b) read theraw_requestline
so that it stands ready to be parsed, and (c) capture any error codes that occur during parsing instead of letting it try to write them back to the client (since we do not have one!).So here is our specialization of the Standard Library class:
Again, I wish the Standard Library folks had realized that HTTP parsing should be broken out in a way that did not require us to write nine lines of code to properly call it, but what can you do? Here is how you would use this simple class:
If there is an error during parsing, the
error_code
will not beNone
:I prefer using the Standard Library like this because I suspect that they have already encountered and resolved any edge cases that might bite me if I try re-implementing an Internet specification myself with regular expressions.
Old Python 2 code
Here’s the original code for this answer, back when I first wrote it:
And:
And:
And:
And:
mimetools
自 Python 2.3 起已被弃用,并从 Python 3 中完全删除(链接)。在 Python 3 中你应该这样做:
mimetools
has been deprecated since Python 2.3 and totally removed from Python 3 (link).Here is how you should do in Python 3:
如果您删除
GET
行,这似乎工作正常:解析示例并将信息从第一行添加到对象的方法是:
This seems to work fine if you strip the
GET
line:A way to parse your example and add information from the first line to the object would be:
使用 python3.7、
urllib3.HTTPResponse
、http.client.parse_headers
以及 此处的卷曲标志说明:输出:
注释:
Using python3.7,
urllib3.HTTPResponse
,http.client.parse_headers
, and with curl flag explanation here:Output:
notes:
我写了一个简单的函数,可以返回一个字典对象,希望它可以帮助你。 ^_^
Python 3
输出:
I wrote a simple function that can return a dictionary object, hope it can help you. ^_^
Python 3
Output:
以蟒蛇式的方式
In a pythonic way
在Python3中
in python3
来自这个问题: How to parse raw HTTP request in Python 3?
以下是一些旨在正确解析 HTTP 协议的 Python 包:
From this question: How to parse raw HTTP request in Python 3?
Here are some Python packages aimed at proper HTTP protocol parsing: