使用 Python Urllib、Urllib2 下载文件
我正在尝试使用 urllib 从网站下载文件,如以下线程中所述: 链接文本
import urllib
urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
我能够下载文件(主要是 pdf),但我得到的只是无法打开的损坏文件。我怀疑这是因为该网站需要登录。
如何修改上述函数来处理cookies?我已经知道带有用户名和名称的表单字段的名称。密码信息。当我打印 urlretrieve 的返回值时,我收到如下消息:
a, b = urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
print a, b
>> **cache-control:** no-cache, no-store, must-revalidate, s-maxage=300, proxy-revalida
te
>> **connection:** close
如果我在浏览器中输入文件的网址,我就可以手动下载文件。谢谢
I am trying to download files from a website using urllib as described in this thread: link text
import urllib
urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
I am able to download the files (mostly pdf) but all I get is corrupted files that cannot open. I suspect it's because the website requires a login.
How can the above function be modified to handle cookies? I already know the names of the form fields that carry the username & password information. When I print the return values of urlretrieve I get messages like:
a, b = urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")
print a, b
>> **cache-control:** no-cache, no-store, must-revalidate, s-maxage=300, proxy-revalida
te
>> **connection:** close
I am able to manually download the files if I enter their urls in the browser. Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
首先 urllib2 实际上支持 cookie 并且 cookie 处理应该很容易,其次你可以检查什么您下载的文件类型。例如,据我所知,所有 mp3 都以字节“ID3”开头
First urllib2 actually supports cookies and cookie handling should be easy, second of all you can check what kind of file you have downloaded. E.g. AFAIK all mp3 starts with the bytes "ID3"
我可能您请求的服务器正在寻找某些标头消息,例如用户代理。您可以尝试通过发送附加标头来模仿浏览器行为。
I might be possible that the server you requesting to is looking for certain header messages, such as User-Agent. You may try mimicking a browser behavior by sending additional headers.