为什么 Beautiful Soup 会截断此页面?
我正在尝试从我的学校图书馆订阅的资源列表中提取资源/数据库名称和 ID 列表。 有些页面列出了不同的资源,我可以使用 urllib2 来获取页面,但是当我将页面传递给 BeautifulSoup 时,它会在列表中第一个资源的条目末尾之前截断其树。 问题似乎出在用于将资源添加到搜索集中的图像链接中。 这是事情被切断的地方,这里是 HTML:
<a href="http://www2.lib.myschool.edu:7017/V/ACDYFUAMVRFJRN4PV8CIL7RUPC9QXMQT8SFV2DVDSBA5GBJCTT-45899?func=find-db-add-res&resource=XYZ00618&z122_key=000000000&function-in=www_v_find_db_0" onclick='javascript:addToz122("XYZ00618","000000000","myImageXYZ00618","http://discover.lib.myschool.edu:8331/V/ACDYFUAMVRFJRN4PV8CIL7RUPC9QXMQT8SFV2DVDSBA5GBJCTT-45900");return false;'>
<img name="myImageXYZ00618" id="myImageXYZ00618" src="http://www2.lib.myschool.edu:7017/INS01/icon_eng/v-add_favorite.png" title="Add to My Sets" alt="Add to My Sets" border="0">
</a>
这是我的 python 代码:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://discover.lib.myschool.edu:8331/V?func=find-db-1-title&mode=titles&scan_start=latp&scan_utf=D&azlist=Y&restricted=all")
print BeautifulSoup(page).prettify
在 BeautifulSoup 的版本中,开头 出现,但是
< ;img>
不会,并且 会立即关闭,其余开放标记也是如此,一直到
< /代码>。
我看到这些“添加到集合”图像的唯一显着特征是它们是唯一具有 name 和 id 属性的图像。 不过,我不明白为什么这会导致 BeautifulSoup 立即停止解析。
注意:我对 Python 几乎完全陌生,但似乎对它的理解还不错。
感谢您的帮助!
I am trying to pull at list of resource/database names and IDs from a listing of resources that my school library has subscriptions to. There are pages listing the different resources, and I can use urllib2 to get the pages, but when I pass the page to BeautifulSoup, it truncates its tree just before the end of the entry for the first resource in the list. The problem seems to be in image link used to add the resource to a search set. This is where things get cut off, here's the HTML:
<a href="http://www2.lib.myschool.edu:7017/V/ACDYFUAMVRFJRN4PV8CIL7RUPC9QXMQT8SFV2DVDSBA5GBJCTT-45899?func=find-db-add-res&resource=XYZ00618&z122_key=000000000&function-in=www_v_find_db_0" onclick='javascript:addToz122("XYZ00618","000000000","myImageXYZ00618","http://discover.lib.myschool.edu:8331/V/ACDYFUAMVRFJRN4PV8CIL7RUPC9QXMQT8SFV2DVDSBA5GBJCTT-45900");return false;'>
<img name="myImageXYZ00618" id="myImageXYZ00618" src="http://www2.lib.myschool.edu:7017/INS01/icon_eng/v-add_favorite.png" title="Add to My Sets" alt="Add to My Sets" border="0">
</a>
And here is my python code:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://discover.lib.myschool.edu:8331/V?func=find-db-1-title&mode=titles&scan_start=latp&scan_utf=D&azlist=Y&restricted=all")
print BeautifulSoup(page).prettify
In BeautifulSoup's version, the opening <a href...>
shows up, but the <img>
doesn't, and the <a>
is immediately closed, as are the rest of the open tags, all the way to </html>
.
The only distinguishing trait I see for these "add to sets" images is that they are the only ones to have name and id attributes. I can't see why that would cause BeautifulSoup to stop parsing immediately, though.
Note: I am almost entirely new to Python, but seem to be understanding it all right.
Thank you for your help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可以使用 html5lib 而不是内置解析器尝试 beautiful soup。
html5lib 更加宽松,通常会解析内置解析器截断的页面。 请参阅 http://www.crummy.com/ 上的文档软件/BeautifulSoup/bs4/doc/#searching-the-tree
You can try beautiful soup with html5lib rather than the built-in parser.
html5lib is more lenient and often parses pages that the built-in parser truncates. See the docs at http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree
我使用的是 Firefox 的“查看选择源”,它显然为我清理了 HTML。 当我查看原始源代码时,这就是我看到的
通过在
border="0"
属性后面放置一个空格,我可以让 BS 解析页面。I was using Firefox's "view selection source", which apparently cleans up the HTML for me. When I viewed the original source, this is what I saw
By putting a space after the
border="0"
attribute, I can get BS to parse the page.我强烈建议使用 html5lib + lxml 而不是 beautiful soup 。 它使用真正的 HTML 解析器(与 Firefox 中的解析器非常相似),并且 lxml 提供了一种非常灵活的方式来查询结果树(css 选择器或 xpath)。
BeautifulSoup 中有大量错误或奇怪的行为,这使得它不是许多您不信任的 HTML 标记的最佳解决方案。
I strongly recommend using html5lib + lxml instead of beautiful soup. It uses a real HTML parser (very similar to the one in Firefox) and lxml provides a very flexible way to query the resulting tree (css-selectors or xpath).
There are tons of bugs or strange behavior in BeautifulSoup which makes it not the best solution for a lot of HTML markup you can't trust.
如果我没记错的话,BeautifulSoup 在它的树中使用“name”作为标签的名称。 在这种情况下,“a”将是锚标记的“名称”。
但这似乎不应该打破它。 你使用什么版本的Python和BS?
If I remember correctly, BeautifulSoup uses "name" in it's tree as the name of the tag. In this case "a" would be the "name" of the anchor tag.
That doesn't seem like it should break it though. What version of Python and BS are you using?