逃跑……与美丽汤
我目前正在使用 BeautifulSoup 来抓取一些网站,但是我对某些特定字符有问题,UnicodeDammit 中的代码似乎表明(再次)这是一些 Microsoft 发明的字符。
我正在使用最新版本的 BeautifulSoup(3.0.8.1),因为我仍在使用 python2.5
以下代码说明了我的问题:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('...Baby One More Time (Digital Deluxe Version…')
print soup
'...Baby One More Time (Digital Deluxe Version…'
如您所见,问题是末尾的 '...'(&hellip) 字符 (您的浏览器可能正确转义了)。显然这不是我感兴趣的。
如果能有这个字符的 unicode 表示或其他东西就好了。 即使简单地忽略它也会解决我的特殊问题。
我怎样才能用 BeautifulSoup 做到这一点?
I am currrently using BeautifulSoup to scrape some websites, however I have a problem with some specific characters, the code inside UnicodeDammit seems to indicate this (again) are some Microsoft-invented ones.
I'm using the newest version of BeautifulSoup(3.0.8.1) as I am still using python2.5
The following code illustrates my problem:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('...Baby One More Time (Digital Deluxe Version…')
print soup
'...Baby One More Time (Digital Deluxe Version…'
As you can see the problem is the '…'(&hellip) character at the end (which your browser probably escaped correctly). Obviously that's not what I am interested in.
It would be nice to have this characters unicode representation or something.
Even sinmply ignoring it would solve my particular problem.
How can I do this with BeautifulSoup?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
自己找到了解决方案:
Found the solution myself:
MS 可能发明了它,但
…
是 HTML 4 的一部分:http://www.w3.org/TR/REC-html40/sgml/entities.html也许您的
Lib/htmlentitydefs.py
丢失了或者已经过时了,因为这是 BeautifulSoup 用来转换实体的方法。如果您查看 Python 2.5源代码树你会清楚地看到它在第126行定义。
MS may have invented it, but
…
is part of HTML 4: http://www.w3.org/TR/REC-html40/sgml/entities.htmlPerhaps your
Lib/htmlentitydefs.py
is missing or out-of-date, as that's what BeautifulSoup uses to convert entities.If you look at the Python 2.5 source tree you will clearly see it defined on line 126.