python - 当我尝试从网站获取文档时发生问题
我尝试从此页面下载文档 <一href="http://securities.stanford.edu/fmi/xsl/SCACPUDB/recordlist.xsl?-db=SCACPUDB&-lay=Search&FIC_DateFiled_Quater=Q1&F IC_DateFiled_Year=2011&-sortfield.1=FIC_DateFiled&-sortfield.2=LitigationName&-sortorder.1=ascend&-max=25&-find" rel="nofollow">证券集体诉讼申请
我尝试下载页面上的25个文档。 我认为这很简单,这是我的代码:
from BeautifulSoup import BeautifulSoup
import re
import urllib2
import os
if __name__ == "__main__":
pre_url = "http://securities.stanford.edu"
url = "http://securities.stanford.edu/fmi/xsl/SCACPUDB/recordlist.xsl?-db=SCACPUDB&-lay=Search&FIC_DateFiled_Quater=Q1&FIC_DateFiled_Year=2011&-sortfield.1=FIC_DateFiled&-sortfield.2=LitigationName&-sortorder.1=ascend&-max=25&-find"
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read()).findAll('tr')
url_list = []
for s in soup[8:]:
url_list.append(pre_url + s.a['href'])
for x in url_list:
name = x.split("/")[4]
context = urllib2.urlopen(x).read()
soup = BeautifulSoup(context)
file = open(name + ".txt", "w")
file.write(soup.prettify())
print "DONE"
执行脚本后,我成功下载了 25 个文件。 但后来我发现其中10个都是垃圾字符! 怎么会? 有人可以帮助我吗?
非常感谢,我很抱歉我的英语不好。
更新: 这是脚本会错误下载的页面之一 http://securities.stanford.edu/1046/BWEN00_01/
I tried to download documents from this page
Securities Class Action Filings
I tried to download the 25 documents on the page.
I thought it was simple, and here's my code:
from BeautifulSoup import BeautifulSoup
import re
import urllib2
import os
if __name__ == "__main__":
pre_url = "http://securities.stanford.edu"
url = "http://securities.stanford.edu/fmi/xsl/SCACPUDB/recordlist.xsl?-db=SCACPUDB&-lay=Search&FIC_DateFiled_Quater=Q1&FIC_DateFiled_Year=2011&-sortfield.1=FIC_DateFiled&-sortfield.2=LitigationName&-sortorder.1=ascend&-max=25&-find"
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read()).findAll('tr')
url_list = []
for s in soup[8:]:
url_list.append(pre_url + s.a['href'])
for x in url_list:
name = x.split("/")[4]
context = urllib2.urlopen(x).read()
soup = BeautifulSoup(context)
file = open(name + ".txt", "w")
file.write(soup.prettify())
print "DONE"
After executing the script, I downloaded 25 files successfully.
But then I found 10 of them are full of garbage characters!
How come?
Can anyone help me?
Thanks a lot, and I'm sorry for my poor English.
Update:
This is one of the pages which would be downloaded incorrectly by the script
http://securities.stanford.edu/1046/BWEN00_01/
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
示例页面以 UTF-16 编码,但未在标头中正确提供该事实。
尝试
page.decode('utf-16')
以正确的 Unicode 字符而不是字节查看页面。The sample page is encoded in UTF-16 without properly providing that factoid in the header.
Try
page.decode('utf-16')
to see the page in proper Unicode characters instead of bytes.您的问题可能是您以文本模式打开文件,但它们是以二进制模式下载的。将上面的表达式替换为
并看看它是否会改善情况。
It's possible that your problem is that you're opening the files in text mode, but they're being downloaded in binary mode. Replace the above expression with
and see if it improves things.