解析 HTTP 标头字段值的每个部分
我直接从数据包中解析 HTTP 数据(无论 TCP 是否重建,你都可以假设它是重建的)。
我正在寻找尽可能准确地解析 HTTP 的最佳方法。
这里的主要问题是 HTTP 标头。
看看基本的HTTP/1.1的RFC,看来HTTP头解析会很复杂。 RFC 描述了标头不同部分的非常复杂的正则表达式。
我应该编写这些正则表达式来解析 HTTP 标头的不同部分吗?
到目前为止,我为 HTTP 标头编写的基本解析是针对通用 HTTP 标头:
message-header = field-name ":" [ field-value ]
并且我已将内部 LWS
替换为 SP
并使用相同的 重复标头code>field-name
使用逗号分隔值,如第 4.2 节所述。
然而,例如查看第 14.9 节会发现,为了解析 field-value
的不同部分,我需要一个更复杂的解析方案。
假设我想为解析器用户提供 HTTP 的全部功能并解析 HTTP 的每个部分,您建议我应该如何处理 HTTP 解析的复杂部分(特别是 field-value
)?
对此的设计建议也将受到赞赏。
谢谢。
I'm parsing HTTP data directly from packets (either TCP reconstructed or not, you can assume it is).
I'm looking for the best way to parse HTTP as accurately as possible.
The main issue here is the HTTP header.
Looking at the basic RFC of HTTP/1.1, it seems that HTTP header parsing would be complex.
The RFC describes very complex regular expressions for different parts of the header.
Should I write these regular expressions to parse the different parts of the HTTP header?
The basic parsing I've written so far for HTTP header is for the generic HTTP header:
message-header = field-name ":" [ field-value ]
And I've included replacing inner LWS
with SP
and repeating headers with the same field-name
with comma separated values as described in section 4.2.
However, looking at section 14.9 for example would show that in order to parse the different parts of the field-value
I need a much more complex parsing scheme.
How do you suggest I should handle the complex parts of HTTP parsing (specifically the field-value
) assuming I want to give the parser users the full capabilities of HTTP and to parse every part of HTTP?
Design suggestions for this would also be appreciated.
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我会遵循单一责任原则。与其尝试创建一个了解人类已知的每个 HTTP 标头的每个细节的单一整体解析器,不如更简单一些。编写一个简单的可扩展解析器,它本身只负责解析字段名称并将该名称与原始值相关联。然后使用仅负责解析单一类型标头的可插入扩展。当您创建解析器的实例时,注入扩展集合,并将每个扩展映射到它知道如何解析的一组字段名称。
用这种方法可以一石二鸟。您的核心解析器仍然简单且有针对性。您还可以获得扩展解析器的能力,而不必弄乱它的内部结构,从而产生更健壮的代码。
I would follow the Principal of Single Responsibility. Rather than trying to create a single monolithic parser that knows every detail of every HTTP header known to man, go simpler. Write a simple extensible parser that in and of itself is responsible for just dealing with parsing the field name and associating that name with the raw value. Then make use of pluggable extensions that are only responsible for parsing a single kind of header. When you create an instance of your parser, inject a collection of extensions, and map each extension to a set of field names that it knows how to parse.
You kill two birds with one stone with this approach. Your core parser remains simple and targeted. You also gain the ability to extend your parser without having to mess around with its guts, which results in more robust code.
System.Net.Http.Headers 命名空间内有很多解析器。值得一看。
There are a bunch of parsers inside the
System.Net.Http.Headers
namespace. It's worth having a look.