Python - BeautifulSoup html解析处理gbk编码不佳 - 中文网页抓取问题
我一直在修改以下脚本:
# -*- coding: utf8 -*-
import codecs
from BeautifulSoup import BeautifulSoup, NavigableString,
UnicodeDammit
import urllib2,sys
import time
try:
import timeoutsocket # http://www.timo-tasi.org/python/timeoutsocket.py
timeoutsocket.setDefaultSocketTimeout(10)
except ImportError:
pass
h=u'\u3000\u3000\u4fe1\u606f\u901a\u4fe1\u6280\u672f'
address=urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read()
soup=BeautifulSoup(address)
p=soup.findAll('p')
t=p[2].string[:10]
具有以下输出:
<块引用> <块引用>打印
¡¡¡
<块引用> <块引用>打印h
信息通
<块引用> <块引用>t
'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'
<块引用> <块引用>小时
'\u3000\u3000\u4fe1\u606f\u901a'
<块引用> <块引用>h.encode('gbk')
'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'
简单地说:当我通过 BeautifulSoup 传入这个 html 时,需要 gbk编码的文本并认为是unicode,无法识别 它需要首先被解码。但是,“h”和“t”应该相同, 因为 h 只是我从 html 文件中获取文本并将其转换 手动。
我该如何解决这个问题?
最佳
惠顿
I have been tinkering with the following script:
# -*- coding: utf8 -*-
import codecs
from BeautifulSoup import BeautifulSoup, NavigableString,
UnicodeDammit
import urllib2,sys
import time
try:
import timeoutsocket # http://www.timo-tasi.org/python/timeoutsocket.py
timeoutsocket.setDefaultSocketTimeout(10)
except ImportError:
pass
h=u'\u3000\u3000\u4fe1\u606f\u901a\u4fe1\u6280\u672f'
address=urllib2.urlopen('http://stock.eastmoney.com/news/1408,20101022101395594.html').read()
soup=BeautifulSoup(address)
p=soup.findAll('p')
t=p[2].string[:10]
with the following output:
print t
¡¡¡¡ÐÅϢͨ
print h
信息通
t
u'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'
h
u'\u3000\u3000\u4fe1\u606f\u901a'
h.encode('gbk')
'\xa1\xa1\xa1\xa1\xd0\xc5\xcf\xa2\xcd\xa8'
Simply put: When I pass in this html through BeautifulSoup, it takes
the gbk encoded text and thinks that it is unicode, not recognizing
that it needs to be decoded first. "h" and "t" should be the same, however,
as h is just me taking the text from the html file and converting it
manually.
how do I solve this problem?
best
wheaton
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
该文件的元标记声称字符集是 GB2312,但数据包含来自较新版本的字符GBK/GB18030 这就是使 BeautifulSoup 崩溃的原因:
此时,UnicodeDammit 退出,尝试 chardet、UTF-8 以及最后的 Windows-1252,它总是成功的 - 从外观上看,这就是你得到的。
如果我们告诉解码器用“?”替换无法识别的字符,我们可以看到 GB2312 中缺少的字符:
使用正确的编码:
另外:
The file's meta tag claims that the character set is GB2312, but the data contains a character from the newer GBK/GB18030 and this is what's tripping BeautifulSoup up:
At this point, UnicodeDammit bails out, tries chardet, UTF-8 and finally Windows-1252, which always succeeds - this is what you got, by the looks of it.
If we tell the decoder to replace unrecognised characters with a '?', we can see the character that's missing in GB2312:
Using the correct encoding:
Also: