Python - BeautifulSoup html解析处理gbk编码不佳 - 中文网页抓取问题

发布于 2024-09-30 09:30:53 字数 1440 浏览 7 评论 0原文

我一直在修改以下脚本:

#    -*- coding: utf8 -*-
import codecs
from BeautifulSoup import BeautifulSoup, NavigableString,
UnicodeDammit
import urllib2,sys
import time
try:
    import timeoutsocket # http://www.timo-tasi.org/python/timeoutsocket.py
    timeoutsocket.setDefaultSocketTimeout(10)
except ImportError:
    pass

h=u'\u3000\u3000\u4fe1\u606f\u901a\u4fe1\u6280\u672f'

address=urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read()
soup=BeautifulSoup(address)

p=soup.findAll('p')
t=p[2].string[:10]

具有以下输出:

<块引用> <块引用>

打印

¡¡¡

<块引用> <块引用>

打印h

信息通

<块引用> <块引用>

t

'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'

<块引用> <块引用>

小时

'\u3000\u3000\u4fe1\u606f\u901a'

<块引用> <块引用>

h.encode('gbk')

'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'

简单地说:当我通过 BeautifulSoup 传入这个 html 时,需要 gbk编码的文本并认为是unicode,无法识别 它需要首先被解码。但是,“h”和“t”应该相同, 因为 h 只是我从 html 文件中获取文本并将其转换 手动。

我该如何解决这个问题?

最佳

惠顿

I have been tinkering with the following script:

#    -*- coding: utf8 -*-
import codecs
from BeautifulSoup import BeautifulSoup, NavigableString,
UnicodeDammit
import urllib2,sys
import time
try:
    import timeoutsocket # http://www.timo-tasi.org/python/timeoutsocket.py
    timeoutsocket.setDefaultSocketTimeout(10)
except ImportError:
    pass

h=u'\u3000\u3000\u4fe1\u606f\u901a\u4fe1\u6280\u672f'

address=urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read()
soup=BeautifulSoup(address)

p=soup.findAll('p')
t=p[2].string[:10]

with the following output:

print t

¡¡¡¡ÐÅϢͨ

print h

  信息通

t

u'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'

h

u'\u3000\u3000\u4fe1\u606f\u901a'

h.encode('gbk')

'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'

Simply put: When I pass in this html through BeautifulSoup, it takes
the gbk encoded text and thinks that it is unicode, not recognizing
that it needs to be decoded first. "h" and "t" should be the same, however,
as h is just me taking the text from the html file and converting it
manually.

how do I solve this problem?

best

wheaton

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

樱花坊 2024-10-07 09:30:53

该文件的元标记声称字符集是 GB2312,但数据包含来自较新版本的字符GBK/GB18030 这就是使 BeautifulSoup 崩溃的原因:

simon@lucifer:~$ python
Python 2.7 (r27:82508, Jul  3 2010, 21:12:11) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> data = urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read()
>>> data.decode("gb2312")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 20148-20149: illegal multibyte sequence

此时,UnicodeDammit 退出,尝试 chardetUTF-8 以及最后的 Windows-1252,它总是成功的 - 从外观上看,这就是你得到的。

如果我们告诉解码器用“?”替换无法识别的字符,我们可以看到 GB2312 中缺少的字符:

>>> print data[20140:20160].decode("gb2312", "replace")
毒尾气二�英的排放难

使用正确的编码:

>>> print data[20140:20160].decode("gb18030", "replace")
毒尾气二噁英的排放难
>>> from BeautifulSoup import BeautifulSoup
>>> s = BeautifulSoup(data, fromEncoding="gb18030")
>>> print s.findAll("p")[2].string[:10]
  信息通信技术是&

另外:

>>> print s.findAll("p")[2].string
  信息通信技术是“十二五”规划重点发展方向,行业具有很强的内在增长潜
力,增速远高于GDP。软件外包、服务外包、管理软件、车载导航、网上购物、网络游戏、
移动办公、移动网络游戏、网络视频等均存在很强的潜在需求,使信息技术行业继续保持较
高增长。

The file's meta tag claims that the character set is GB2312, but the data contains a character from the newer GBK/GB18030 and this is what's tripping BeautifulSoup up:

simon@lucifer:~$ python
Python 2.7 (r27:82508, Jul  3 2010, 21:12:11) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> data = urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read()
>>> data.decode("gb2312")
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 20148-20149: illegal multibyte sequence

At this point, UnicodeDammit bails out, tries chardet, UTF-8 and finally Windows-1252, which always succeeds - this is what you got, by the looks of it.

If we tell the decoder to replace unrecognised characters with a '?', we can see the character that's missing in GB2312:

>>> print data[20140:20160].decode("gb2312", "replace")
毒尾气二�英的排放难

Using the correct encoding:

>>> print data[20140:20160].decode("gb18030", "replace")
毒尾气二噁英的排放难
>>> from BeautifulSoup import BeautifulSoup
>>> s = BeautifulSoup(data, fromEncoding="gb18030")
>>> print s.findAll("p")[2].string[:10]
  信息通信技术是&

Also:

>>> print s.findAll("p")[2].string
  信息通信技术是“十二五”规划重点发展方向,行业具有很强的内在增长潜
力,增速远高于GDP。软件外包、服务外包、管理软件、车载导航、网上购物、网络游戏、
移动办公、移动网络游戏、网络视频等均存在很强的潜在需求,使信息技术行业继续保持较
高增长。
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文