逃跑……与美丽汤

发布于 2024-09-07 23:13:51 字数 562 浏览 0 评论 0原文

我目前正在使用 BeautifulSoup 来抓取一些网站，但是我对某些特定字符有问题，UnicodeDammit 中的代码似乎表明（再次）这是一些 Microsoft 发明的字符。

我正在使用最新版本的 BeautifulSoup(3.0.8.1)，因为我仍在使用 python2.5

以下代码说明了我的问题：

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('...Baby One More Time (Digital Deluxe Version&hellip;')
print soup

'...Baby One More Time (Digital Deluxe Version&hellip;'

如您所见，问题是末尾的 '...'(&hellip) 字符 (您的浏览器可能正确转义了）。显然这不是我感兴趣的。

如果能有这个字符的 unicode 表示或其他东西就好了。即使简单地忽略它也会解决我的特殊问题。

我怎样才能用 BeautifulSoup 做到这一点？

原文

I am currrently using BeautifulSoup to scrape some websites, however I have a problem with some specific characters, the code inside UnicodeDammit seems to indicate this (again) are some Microsoft-invented ones.

I'm using the newest version of BeautifulSoup(3.0.8.1) as I am still using python2.5

The following code illustrates my problem:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('...Baby One More Time (Digital Deluxe Version…')
print soup

'...Baby One More Time (Digital Deluxe Version…'

As you can see the problem is the '…'(&hellip) character at the end (which your browser probably escaped correctly). Obviously that's not what I am interested in.

It would be nice to have this characters unicode representation or something.
Even sinmply ignoring it would solve my particular problem.

How can I do this with BeautifulSoup?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

说不完的你爱 2024-09-14 23:13:51

自己找到了解决方案：

soup = BeautifulSoup('...Baby One More Time (Digital Deluxe Version…', convertEntities="html")

Found the solution myself:

soup = BeautifulSoup('...Baby One More Time (Digital Deluxe Version…', convertEntities="html")

回复收藏 0 原文

另类 2024-09-14 23:13:51

MS 可能发明了它，但 … 是 HTML 4 的一部分：http://www.w3.org/TR/REC-html40/sgml/entities.html

也许您的 Lib/htmlentitydefs.py 丢失了或者已经过时了，因为这是 BeautifulSoup 用来转换实体的方法。

如果您查看 Python 2.5源代码树你会清楚地看到它在第126行定义。

回复收藏 0 原文

~没有更多了~

关于作者

爱*していゐ

暂无简介

0 文章

0 评论

21 人气

关注发私信

友情链接

文江博客

逃跑……与美丽汤

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

逃跑……与美丽汤

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。