python - 当我尝试从网站获取文档时发生问题

发布于 2024-11-30 21:35:53 字数 1415 浏览 0 评论 0原文

我尝试从此页面下载文档 <一href="http://securities.stanford.edu/fmi/xsl/SCACPUDB/recordlist.xsl?-db=SCACPUDB&-lay=Search&FIC_DateFiled_Quater=Q1&F IC_DateFiled_Year=2011&-sortfield.1=FIC_DateFiled&-sortfield.2=LitigationName&-sortorder.1=ascend&-max=25&-find" rel="nofollow">证券集体诉讼申请

我尝试下载页面上的25个文档。我认为这很简单，这是我的代码：

from BeautifulSoup import BeautifulSoup
import re
import urllib2
import os

if __name__ == "__main__":
  pre_url = "http://securities.stanford.edu"
  url = "http://securities.stanford.edu/fmi/xsl/SCACPUDB/recordlist.xsl?-db=SCACPUDB&-lay=Search&FIC_DateFiled_Quater=Q1&FIC_DateFiled_Year=2011&-sortfield.1=FIC_DateFiled&-sortfield.2=LitigationName&-sortorder.1=ascend&-max=25&-find" 
  response = urllib2.urlopen(url)
  soup = BeautifulSoup(response.read()).findAll('tr')
  url_list = []
  for s in soup[8:]:
    url_list.append(pre_url + s.a['href'])
  for x in url_list:
    name = x.split("/")[4]  
    context = urllib2.urlopen(x).read()
    soup = BeautifulSoup(context)
    file = open(name + ".txt", "w")
    file.write(soup.prettify())
  print "DONE"

执行脚本后，我成功下载了 25 个文件。但后来我发现其中10个都是垃圾字符！怎么会？有人可以帮助我吗？

非常感谢，我很抱歉我的英语不好。

更新：这是脚本会错误下载的页面之一 http://securities.stanford.edu/1046/BWEN00_01/

原文

I tried to download documents from this page
Securities Class Action Filings

I tried to download the 25 documents on the page.
I thought it was simple, and here's my code:

from BeautifulSoup import BeautifulSoup
import re
import urllib2
import os

if __name__ == "__main__":
  pre_url = "http://securities.stanford.edu"
  url = "http://securities.stanford.edu/fmi/xsl/SCACPUDB/recordlist.xsl?-db=SCACPUDB&-lay=Search&FIC_DateFiled_Quater=Q1&FIC_DateFiled_Year=2011&-sortfield.1=FIC_DateFiled&-sortfield.2=LitigationName&-sortorder.1=ascend&-max=25&-find" 
  response = urllib2.urlopen(url)
  soup = BeautifulSoup(response.read()).findAll('tr')
  url_list = []
  for s in soup[8:]:
    url_list.append(pre_url + s.a['href'])
  for x in url_list:
    name = x.split("/")[4]  
    context = urllib2.urlopen(x).read()
    soup = BeautifulSoup(context)
    file = open(name + ".txt", "w")
    file.write(soup.prettify())
  print "DONE"

After executing the script, I downloaded 25 files successfully.
But then I found 10 of them are full of garbage characters!
How come?
Can anyone help me?

Thanks a lot, and I'm sorry for my poor English.

Update:
This is one of the pages which would be downloaded incorrectly by the script
http://securities.stanford.edu/1046/BWEN00_01/

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱已欠费 2024-12-07 21:35:53

示例页面以 UTF-16 编码，但未在标头中正确提供该事实。

>>> page = urllib2.urlopen( "http://securities.stanford.edu/1046/BWEN00_01" )
>>> page.info().headers
['Date: Mon, 22 Aug 2011 13:13:56 GMT\r\n', 'Server: Apache/1.3.33 (Darwin) mod_jk/1.2.2 DAV/1.0.3\r\n', 'Cache-Control: max-age=60\r\n', 'Expires: Mon, 22 Aug 2011 13:14:56 GMT\r\n', 'Last-Modified: Thu, 21 Jul 2011 22:06:51 GMT\r\n', 'ETag: "18b9a6e-9af6-4e28a2fb"\r\n', 'Accept-Ranges: bytes\r\n', 'Content-Length: 39670\r\n', 'Connection: close\r\n', 'Content-Type: text/html\r\n']

尝试 page.decode('utf-16') 以正确的 Unicode 字符而不是字节查看页面。

The sample page is encoded in UTF-16 without properly providing that factoid in the header.

>>> page = urllib2.urlopen( "http://securities.stanford.edu/1046/BWEN00_01" )
>>> page.info().headers
['Date: Mon, 22 Aug 2011 13:13:56 GMT\r\n', 'Server: Apache/1.3.33 (Darwin) mod_jk/1.2.2 DAV/1.0.3\r\n', 'Cache-Control: max-age=60\r\n', 'Expires: Mon, 22 Aug 2011 13:14:56 GMT\r\n', 'Last-Modified: Thu, 21 Jul 2011 22:06:51 GMT\r\n', 'ETag: "18b9a6e-9af6-4e28a2fb"\r\n', 'Accept-Ranges: bytes\r\n', 'Content-Length: 39670\r\n', 'Connection: close\r\n', 'Content-Type: text/html\r\n']

Try page.decode('utf-16') to see the page in proper Unicode characters instead of bytes.

回复收藏 0 原文

你的他你的她 2024-12-07 21:35:53

open(name + ".txt", "w")

您的问题可能是您以文本模式打开文件，但它们是以二进制模式下载的。将上面的表达式替换为

open(name + ".txt", "wb")

并看看它是否会改善情况。

open(name + ".txt", "w")

It's possible that your problem is that you're opening the files in text mode, but they're being downloaded in binary mode. Replace the above expression with