首先我要说的是,我正在使用 twisted.web
框架。 Twisted.web
的文件上传没有像我想要的那样工作(它只包含文件数据,没有任何其他信息),cgi.parse_multipart
没有不能像我想要的那样工作(同样的事情,twisted.web
使用这个函数),cgi.FieldStorage
不起作用(因为我正在获取POST数据通过twisted,而不是CGI接口——据我所知,FieldStorage尝试通过stdin获取请求,而twisted.web2对我不起作用因为使用 Deferred
让我感到困惑和愤怒(对于我想要的来说太复杂了)。
话虽这么说,我决定尝试自己解析 HTTP 请求。
使用 Chrome,HTTP 请求的形成方式如下:
------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="upload_file_nonce"
11b03b61-9252-11df-a357-00266c608adb
------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="file"; filename="login.html"
Content-Type: text/html
<!DOCTYPE html>
<html>
<head>
...
------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="file"; filename=""
------WebKitFormBoundary7fouZ8mEjlCe92pq--
总是这样形成吗?我用正则表达式解析它,就像这样(请原谅代码墙):(
注意,我剪掉了大部分代码以仅显示我认为相关的内容(正则表达式(是的,嵌套括号),这是我构建的 Uploads
类中的 __init__
方法(迄今为止唯一的方法)完整的代码可以在修订历史记录中看到(我希望我没有不匹配任何代码)。括号)
if line == "--{0}--".format(boundary):
finished = True
if in_header == True and not line:
in_header = False
if 'type' not in current_file:
ignore_current_file = True
if in_header == True:
m = re.match(
"Content-Disposition: form-data; name=\"(.*?)\"; filename=\"(.*?)\"$", line)
if m:
input_name, current_file['filename'] = m.group(1), m.group(2)
m = re.match("Content-Type: (.*)$", line)
if m:
current_file['type'] = m.group(1)
else:
if 'data' not in current_file:
current_file['data'] = line
else:
current_file['data'] += line
你可以看到,每当到达边界时,我都会开始一个新的“文件”字典,我将 in_header
设置为 True
来表示我正在解析标头。到达空行时,我将其切换为 False - 但在检查是否为该表单值设置了 Content-Type
之前 - 如果没有,我设置 ignore_current_file
因为我只是在寻找文件上传,
我知道我应该使用一个库,但我厌倦了阅读文档,试图在我的项目中找到不同的解决方案,但仍然有。代码看起来很合理。我只是想跳过这一部分——如果解析带有文件上传的 HTTP POST 是如此简单,那么我将坚持下去。
注意:此代码目前工作正常,我只是想知道它是否会阻止/吐出来自某些浏览器的请求。
Let me start off by saying, I'm using the twisted.web
framework. Twisted.web
's file uploading didn't work like I wanted it to (it only included the file data, and not any other information), cgi.parse_multipart
doesn't work like I want it to (same thing, twisted.web
uses this function), cgi.FieldStorage
didn't work ('cause I'm getting the POST data through twisted, not a CGI interface -- so far as I can tell, FieldStorage
tries to get the request via stdin), and twisted.web2
didn't work for me because the use of Deferred
confused and infuriated me (too complicated for what I want).
That being said, I decided to try and just parse the HTTP request myself.
Using Chrome, the HTTP request is formed like this:
------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="upload_file_nonce"
11b03b61-9252-11df-a357-00266c608adb
------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="file"; filename="login.html"
Content-Type: text/html
<!DOCTYPE html>
<html>
<head>
...
------WebKitFormBoundary7fouZ8mEjlCe92pq
Content-Disposition: form-data; name="file"; filename=""
------WebKitFormBoundary7fouZ8mEjlCe92pq--
Is this always how it will be formed? I'm parsing it with regular expressions, like so (pardon the wall of code):
(note, I snipped out most of the code to show only what I thought was relevant (the regular expressions (yeah, nested parentheses), this is an __init__
method (the only method so far) in an Uploads
class I built. The full code can be seen in the revision history (I hope I didn't mismatch any parentheses)
if line == "--{0}--".format(boundary):
finished = True
if in_header == True and not line:
in_header = False
if 'type' not in current_file:
ignore_current_file = True
if in_header == True:
m = re.match(
"Content-Disposition: form-data; name=\"(.*?)\"; filename=\"(.*?)\"$", line)
if m:
input_name, current_file['filename'] = m.group(1), m.group(2)
m = re.match("Content-Type: (.*)$", line)
if m:
current_file['type'] = m.group(1)
else:
if 'data' not in current_file:
current_file['data'] = line
else:
current_file['data'] += line
you can see that I start a new "file" dict whenever a boundary is reached. I set in_header
to True
to say that I'm parsing headers. When I reach a blank line, I switch it to False
-- but not before checking if a Content-Type
was set for that form value -- if not, I set ignore_current_file
since I'm only looking for file uploads.
I know I should be using a library, but I'm sick to death of reading documentation, trying to get different solutions to work in my project, and still having the code look reasonable. I just want to get past this part -- and if parsing an HTTP POST with file uploads is this simple, then I shall stick with that.
Note: this code works perfectly for now, I'm just wondering if it will choke on/spit out requests from certain browsers.
发布评论
评论(3)
我对这个问题的解决方案是使用 cgi.FieldStorage 解析内容,如下所示:
My solution to this Problem was parsing the content with cgi.FieldStorage like:
您试图避免阅读文档,但我认为最好的建议是实际阅读:
以确保您不会错过任何案例。更简单的方法可能是使用 poster 库。
You're trying to avoid reading documentation, but I think the best advice is to actually read:
to make sure you don't miss any cases. An easier route might be to use the poster library.
内容处置标头没有定义字段的顺序,而且它可能包含比文件名更多的字段。所以你的文件名匹配可能会失败 - 甚至可能没有文件名!
请参阅 rfc2183 (编辑,用于邮件,请参阅 rfc1806、rfc2616 以及 http 的更多内容)
另外,我建议在这些正则表达式中用 \s* 替换每个空格,而不是依赖于字符大小写。
The content-disposition header has no defined order for fields, plus it may contain more fields than just the filename. So your match for filename may fail - there may not even be a filename!
See rfc2183 (edit that's for mail, see rfc1806, rfc2616 and maybe more for http)
Also I would suggest in these kind of regexps to replace every space by \s*, and not to rely on character case.