如何下载从 html 表单提交“间接”(??) 返回的文件? (python、urllib、urllib2 等)
编辑:问题解决。最终结果是网址中的“http:”而不是“https:”的问题(这只是我的一个愚蠢的错误)。但来自 cetver 的干净代码示例帮助我隔离了问题。感谢所有提供建议的人。
将此网址放入 Firefox 中会触发相应的下载和另存为对话框:
https://www.virwox.com/orders.php?download_open=Download&format_open=.xls
上面的链接与在页面上使用表单的“下载”按钮提交表单相同 https://www.virwox.com/orders.php。
这是生成上述 url 的表单的相关 html:
<form action='orders.php' method='get'><fieldset><legend>Open Orders (2):</legend>
<input type='submit' value='Download' name='download_open' />
<select name='format_open'>
<option value='.xls'>.xls</option>
<option value='.csv'>.csv</option>
<option value='.xml'>.xml</option></select>
</form>
但是当我尝试以下 python 代码时(我预计它不会工作)...
# get orders list
openOrders_url = virwoxTopLevel_url+"/orders.php"
openOrders_params = urlencode( { "download_open":"Download", "format_open":".xml" } )
openOrders_request = urllib2.Request(openOrders_url,openOrders_params,headers)
openOrders_response = virwox_opener.open(openOrders_request)
openOrders_xml = openOrders_response.read()
print(openOrders_xml)
...openOrders_xml 最终只是原始页面(https:// /www.virwox.com/orders.php)。
Firefox 如何知道还有一个文件需要下载,以及如何在 Python 中检测并下载该文件?
请注意,这不是安全/登录问题,因为如果我遇到身份验证问题,我什至无法获取orders.php 页面。
编辑:我想知道这是否与重定向有关(我正在使用基本重定向处理程序),或者也许我应该使用 urllib.fileretrieve() 之类的东西。
编辑:这是完整程序的代码,以防万一相关......
import urllib
import urllib2
import cookielib
import pprint
from urllib import urlencode
username=###############
password=###############
virwoxTopLevel_url = "http://www.virwox.com/"
overview_url = "https://www.virwox.com/index.php"
# Header
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
# Handlers...
# cookie handler...
cookie_handler= urllib2.HTTPCookieProcessor( cookielib.CookieJar() )
# redirect handler...
redirect_handler= urllib2.HTTPRedirectHandler()
# create "opener" (OpenerDirector instance)
virwox_opener = urllib2.build_opener(redirect_handler,cookie_handler)
# login
login_url = "https://www.virwox.com/index.php"
values = { 'uname' : username, 'password' : password }
login_data = urllib.urlencode(values)
login_request = urllib2.Request(login_url,login_data,headers)
login_response = virwox_opener.open(login_request)
overview_html = login_response.read();
virwox_json_url = "http://api.virwox.com/api/json.php"
getTest = urllib.urlencode( { "method":"getMarketDepth", "symbols[0]":"EUR/SLL","symbols[1]":"USD/SLL","buyDepth":1,"sellDepth":1,"id":1 } )
get_response = urllib2.urlopen(virwox_json_url,getTest)
#print get_response.read()
# get orders list
openOrders_url = virwoxTopLevel_url+"/orders.php"
openOrders_params = urlencode( { "download_open":"Download", "format_open":".xml" } )
openOrders_request = urllib2.Request(openOrders_url,openOrders_params,headers)
openOrders_response = virwox_opener.open(openOrders_request)
openOrders_xml = openOrders_response.read()
# the following prints the html of the /orders.php page not the desired download data:
print "******************************************"
print(openOrders_xml)
print "******************************************"
print openOrders_response.info()
print openOrders_response.geturl()
print "******************************************"
# the following prints nothing, i assume because without the cookie handler, fails to authenticate
# (note that authentication is by the php program, not html authentication, so no "authentication hangler" above
print urllib2.urlopen("https://www.virwox.com/orders.php?download_open=Download&format_open=.xml").read()
EDIT: Problem solve. Ultimately it turned out to be a matter of "http:" instead of "https:" in the url (just a stupid mistake on my part). But it was the nice clean code example from cetver that helped me isolate the problem. Thanks to all who offered suggestions.
Putting this url in firefox triggers the appropriate download and save-as dialog:
https://www.virwox.com/orders.php?download_open=Download&format_open=.xls
The above link is same as submitting form with a "download" button of form on the page https://www.virwox.com/orders.php.
Here is the relevant html for the form that generates the above url:
<form action='orders.php' method='get'><fieldset><legend>Open Orders (2):</legend>
<input type='submit' value='Download' name='download_open' />
<select name='format_open'>
<option value='.xls'>.xls</option>
<option value='.csv'>.csv</option>
<option value='.xml'>.xml</option></select>
</form>
But when I try the following python code (which I sort of expected would not work)...
# get orders list
openOrders_url = virwoxTopLevel_url+"/orders.php"
openOrders_params = urlencode( { "download_open":"Download", "format_open":".xml" } )
openOrders_request = urllib2.Request(openOrders_url,openOrders_params,headers)
openOrders_response = virwox_opener.open(openOrders_request)
openOrders_xml = openOrders_response.read()
print(openOrders_xml)
...openOrders_xml ends up just being the original page (https://www.virwox.com/orders.php).
How does firefox know there is also a file to be downloaded, and how do I detect and download this file in Python?
Please note that this is not a security/login issue, as I would not even be able to get the orders.php page if I was having authentication trouble.
EDIT: I am wondering if this has something to do with redirection (I am using the basic redirection handler) or maybe I should be using something liek urllib.fileretrieve().
EDIT: here is code for complete program, just in case is relevant...
import urllib
import urllib2
import cookielib
import pprint
from urllib import urlencode
username=###############
password=###############
virwoxTopLevel_url = "http://www.virwox.com/"
overview_url = "https://www.virwox.com/index.php"
# Header
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
# Handlers...
# cookie handler...
cookie_handler= urllib2.HTTPCookieProcessor( cookielib.CookieJar() )
# redirect handler...
redirect_handler= urllib2.HTTPRedirectHandler()
# create "opener" (OpenerDirector instance)
virwox_opener = urllib2.build_opener(redirect_handler,cookie_handler)
# login
login_url = "https://www.virwox.com/index.php"
values = { 'uname' : username, 'password' : password }
login_data = urllib.urlencode(values)
login_request = urllib2.Request(login_url,login_data,headers)
login_response = virwox_opener.open(login_request)
overview_html = login_response.read();
virwox_json_url = "http://api.virwox.com/api/json.php"
getTest = urllib.urlencode( { "method":"getMarketDepth", "symbols[0]":"EUR/SLL","symbols[1]":"USD/SLL","buyDepth":1,"sellDepth":1,"id":1 } )
get_response = urllib2.urlopen(virwox_json_url,getTest)
#print get_response.read()
# get orders list
openOrders_url = virwoxTopLevel_url+"/orders.php"
openOrders_params = urlencode( { "download_open":"Download", "format_open":".xml" } )
openOrders_request = urllib2.Request(openOrders_url,openOrders_params,headers)
openOrders_response = virwox_opener.open(openOrders_request)
openOrders_xml = openOrders_response.read()
# the following prints the html of the /orders.php page not the desired download data:
print "******************************************"
print(openOrders_xml)
print "******************************************"
print openOrders_response.info()
print openOrders_response.geturl()
print "******************************************"
# the following prints nothing, i assume because without the cookie handler, fails to authenticate
# (note that authentication is by the php program, not html authentication, so no "authentication hangler" above
print urllib2.urlopen("https://www.virwox.com/orders.php?download_open=Download&format_open=.xml").read()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
代码未经过测试
类似:
CODE BELLOW ISN'T TESTED
something like:
看起来您的问题已经得到解答,但您可能想看看 请求包。它基本上是标准库工具的一个很好的包装。以下(可能)可以满足您的要求。
Looks like your question is already answered, but you might like to take a look at the Requests package. It is basically a nice wrapper around the standard lib tools. The following (probably) does what you want.
您可能需要像这样的
urllib2.HTTPPasswordMgr
(未经测试,因为我没有您的uname/pw):然后,您可以:
查看它是否包含您需要的xml数据。
You may need a
urllib2.HTTPPasswordMgr
like this (untested since I dont have your uname/pw):Then, you can:
to see if it contains the xml data you need.