使用 BeautifulSoup 抓取数据的问题

发布于 2024-09-08 00:42:07 字数 763 浏览 1 评论 0原文

我编写了以下试用代码,以从欧洲议会检索立法法案的标题。

import urllib2
from BeautifulSoup import BeautifulSoup

search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN"

for number in xrange(1,10):   
    url = search_url % number
    page = urllib2.urlopen(url).read()
    soup = BeautifulSoup(page)
    title = soup.findAll("title")
    print title

但是,每当我运行它时,我都会收到以下错误:

Traceback (most recent call last):
  File "<stdin>", line 20, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 70: ordinal not in range(128)

我已将范围缩小到 BeautifulSoup 无法读取循环中的第四个文档。谁能向我解释我做错了什么?

亲切的问候

托马斯

I have written the following trial code to retreive the title of legislative acts from the European parliament.

import urllib2
from BeautifulSoup import BeautifulSoup

search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN"

for number in xrange(1,10):   
    url = search_url % number
    page = urllib2.urlopen(url).read()
    soup = BeautifulSoup(page)
    title = soup.findAll("title")
    print title

However, whenever I run it i get the following error:

Traceback (most recent call last):
  File "<stdin>", line 20, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 70: ordinal not in range(128)

I have narrowed it down to BeautifulSoup not being able to read the fourth document in the loop. Can anyone explain to me what I am doing wrong?

With kind regards

Thomas

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

晚雾 2024-09-15 00:42:07

BeautifulSoup 在 Unicode 中工作,因此它不对解码错误负责。更有可能的是,您的问题与 print 语句有关 - 您的标准输出似乎是 ascii 格式(即 sys.stdout.encoding = 'ascii' 或不存在)并且因此,如果尝试打印包含非 ASCII 字符的字符串,您确实会收到此类错误。

你的操作系统是什么?您的控制台又名终端设置如何(例如,如果在 Windows 上,则为“代码页”)?您是否在环境 PYTHONIOENCODING 中设置来控制 sys.stdout.encoding 或者您只是希望自动选择编码?

在我的 Mac 上,检测到编码正确正确,运行您的代码(为了清楚起见,还可以将数字与每个标题一起打印)工作正常并显示:

$ python ebs.py 
1 [<title>REPORT Report on the proposal for a Council regulation temporarily suspending autonomous Common Customs Tariff duties on imports of certain industrial products into the autonomous regions of Madeira and the Azores - A7-0001/2010</title>]
2 [<title>REPORT Report on the proposal for a Council directive concerning mutual assistance for the recovery of claims relating to taxes, duties and other measures - A7-0002/2010</title>]
3 [<title>REPORT Report on the proposal for a regulation of the European Parliament and of the Council amending Council Regulation (EC) No 1085/2006 of 17 July 2006 establishing an Instrument for Pre-Accession Assistance (IPA) - A7-0003/2010</title>]
4 [<title>REPORT on equality between women and men in the European Union – 2009 - A7-0004/2010</title>]
5 [<title>REPORT Report on the proposal for a Council decision on the conclusion by the European Community of the Convention on the International Recovery of Child Support and Other Forms of Family Maintenance - A7-0005/2010</title>]
6 [<title>REPORT on the proposal for a Council directive on administrative cooperation in the field of taxation - A7-0006/2010</title>]
7 [<title>REPORT Report on promoting good governance in tax matters - A7-0007/2010</title>]
8 [<title>REPORT Report on the proposal for a Council Directive amending Directive 2006/112/EC as regards an optional and temporary application of the reverse charge mechanism in relation to supplies of certain goods and services susceptible to fraud - A7-0008/2010</title>]
9 [<title>REPORT Recommendation on the proposal for a Council decision concerning the conclusion, on behalf of the European Community, of the Additional Protocol to the Cooperation Agreement for the Protection of the Coasts and Waters of the North-East Atlantic against Pollution - A7-0009/2010</title>]
$ 

BeautifulSoup works in Unicode, so it's not responsible for that decoding error. More likely, your problem comes with the print statement -- your standard output seems to be in ascii (i.e., sys.stdout.encoding = 'ascii' or absent) and therefore you would indeed get such errors if trying to print a string containing non-ascii characters.

What's your OS? How is your console AKA terminal set (e.g. if on Windows what "codepage")? Did you set in the environment PYTHONIOENCODING to control sys.stdout.encoding or are you just hoping the encoding will be picked up automatically?

On my Mac, where the encoding is correct detected, running your code (save for also printing the number together with each title, for clarity) works fine and shows:

$ python ebs.py 
1 [<title>REPORT Report on the proposal for a Council regulation temporarily suspending autonomous Common Customs Tariff duties on imports of certain industrial products into the autonomous regions of Madeira and the Azores - A7-0001/2010</title>]
2 [<title>REPORT Report on the proposal for a Council directive concerning mutual assistance for the recovery of claims relating to taxes, duties and other measures - A7-0002/2010</title>]
3 [<title>REPORT Report on the proposal for a regulation of the European Parliament and of the Council amending Council Regulation (EC) No 1085/2006 of 17 July 2006 establishing an Instrument for Pre-Accession Assistance (IPA) - A7-0003/2010</title>]
4 [<title>REPORT on equality between women and men in the European Union – 2009 - A7-0004/2010</title>]
5 [<title>REPORT Report on the proposal for a Council decision on the conclusion by the European Community of the Convention on the International Recovery of Child Support and Other Forms of Family Maintenance - A7-0005/2010</title>]
6 [<title>REPORT on the proposal for a Council directive on administrative cooperation in the field of taxation - A7-0006/2010</title>]
7 [<title>REPORT Report on promoting good governance in tax matters - A7-0007/2010</title>]
8 [<title>REPORT Report on the proposal for a Council Directive amending Directive 2006/112/EC as regards an optional and temporary application of the reverse charge mechanism in relation to supplies of certain goods and services susceptible to fraud - A7-0008/2010</title>]
9 [<title>REPORT Recommendation on the proposal for a Council decision concerning the conclusion, on behalf of the European Community, of the Additional Protocol to the Cooperation Agreement for the Protection of the Coasts and Waters of the North-East Atlantic against Pollution - A7-0009/2010</title>]
$ 
相思故 2024-09-15 00:42:07

替换

print title

for t in title:
    print(t)

print('\n'.join(t.string for t in title))

有效。我不完全确定为什么 print 有时有效,有时却无效。

Replacing

print title

with

for t in title:
    print(t)

or

print('\n'.join(t.string for t in title))

works. I'm not entirely sure why print <somelist> sometimes works, and sometimes doesn't however.

始终不够 2024-09-15 00:42:07

如果要将标题打印到文件中,则需要指定一些可以表示非 ascii 字符的编码,utf8 应该可以正常工作。为此,您需要

out = codecs.open('titles.txt', 'w', 'utf8')

在脚本顶部

添加:并打印到文件:

print >> out, title

If you want to print the titles to a file, you need to specify some encoding that can represent the non-ascii char, utf8 should work fine. To do this, you need to add:

out = codecs.open('titles.txt', 'w', 'utf8')

at the top of the script

and print to the file:

print >> out, title
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文