为什么 Beautiful Soup 会截断此页面？

发布于 2024-07-15 17:29:28 字数 1474 浏览 7 评论 0原文

我正在尝试从我的学校图书馆订阅的资源列表中提取资源/数据库名称和 ID 列表。有些页面列出了不同的资源，我可以使用 urllib2 来获取页面，但是当我将页面传递给 BeautifulSoup 时，它会在列表中第一个资源的条目末尾之前截断其树。问题似乎出在用于将资源添加到搜索集中的图像链接中。这是事情被切断的地方，这里是 HTML：

<a href="http://www2.lib.myschool.edu:7017/V/ACDYFUAMVRFJRN4PV8CIL7RUPC9QXMQT8SFV2DVDSBA5GBJCTT-45899?func=find-db-add-res&amp;resource=XYZ00618&amp;z122_key=000000000&amp;function-in=www_v_find_db_0" onclick='javascript:addToz122("XYZ00618","000000000","myImageXYZ00618","http://discover.lib.myschool.edu:8331/V/ACDYFUAMVRFJRN4PV8CIL7RUPC9QXMQT8SFV2DVDSBA5GBJCTT-45900");return false;'>
    <img name="myImageXYZ00618" id="myImageXYZ00618" src="http://www2.lib.myschool.edu:7017/INS01/icon_eng/v-add_favorite.png" title="Add to My Sets" alt="Add to My Sets" border="0">
</a>

这是我的 python 代码：

import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://discover.lib.myschool.edu:8331/V?func=find-db-1-title&mode=titles&scan_start=latp&scan_utf=D&azlist=Y&restricted=all")
print BeautifulSoup(page).prettify

在 BeautifulSoup 的版本中，开头 出现，但是 < ;img> 不会，并且会立即关闭，其余开放标记也是如此，一直到 < /代码>。

我看到这些“添加到集合”图像的唯一显着特征是它们是唯一具有 name 和 id 属性的图像。不过，我不明白为什么这会导致 BeautifulSoup 立即停止解析。

注意：我对 Python 几乎完全陌生，但似乎对它的理解还不错。

感谢您的帮助！

原文

I am trying to pull at list of resource/database names and IDs from a listing of resources that my school library has subscriptions to. There are pages listing the different resources, and I can use urllib2 to get the pages, but when I pass the page to BeautifulSoup, it truncates its tree just before the end of the entry for the first resource in the list. The problem seems to be in image link used to add the resource to a search set. This is where things get cut off, here's the HTML:

<a href="http://www2.lib.myschool.edu:7017/V/ACDYFUAMVRFJRN4PV8CIL7RUPC9QXMQT8SFV2DVDSBA5GBJCTT-45899?func=find-db-add-res&resource=XYZ00618&z122_key=000000000&function-in=www_v_find_db_0" onclick='javascript:addToz122("XYZ00618","000000000","myImageXYZ00618","http://discover.lib.myschool.edu:8331/V/ACDYFUAMVRFJRN4PV8CIL7RUPC9QXMQT8SFV2DVDSBA5GBJCTT-45900");return false;'>
    <img name="myImageXYZ00618" id="myImageXYZ00618" src="http://www2.lib.myschool.edu:7017/INS01/icon_eng/v-add_favorite.png" title="Add to My Sets" alt="Add to My Sets" border="0">
</a>

And here is my python code:

import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://discover.lib.myschool.edu:8331/V?func=find-db-1-title&mode=titles&scan_start=latp&scan_utf=D&azlist=Y&restricted=all")
print BeautifulSoup(page).prettify

In BeautifulSoup's version, the opening <a href...> shows up, but the <img> doesn't, and the <a> is immediately closed, as are the rest of the open tags, all the way to </html>.

The only distinguishing trait I see for these "add to sets" images is that they are the only ones to have name and id attributes. I can't see why that would cause BeautifulSoup to stop parsing immediately, though.

Note: I am almost entirely new to Python, but seem to be understanding it all right.

Thank you for your help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

§对你不离不弃 2024-07-22 17:29:28

您可以使用 html5lib 而不是内置解析器尝试 beautiful soup。

BeautifulSoup(markup, "html5lib")

html5lib 更加宽松，通常会解析内置解析器截断的页面。请参阅 http://www.crummy.com/ 上的文档软件/BeautifulSoup/bs4/doc/#searching-the-tree

You can try beautiful soup with html5lib rather than the built-in parser.

BeautifulSoup(markup, "html5lib")

html5lib is more lenient and often parses pages that the built-in parser truncates. See the docs at http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree

回复收藏 0 原文

俯瞰星空 2024-07-22 17:29:28

我使用的是 Firefox 的“查看选择源”，它显然为我清理了 HTML。当我查看原始源代码时，这就是我看到的

<img name="myImageXYZ00618" id="myImageXYZ00618" src='http://www2.lib.myschool.edu:7017/INS01/icon_eng/v-add_favorite.png' alt='Add to My Sets' title='Add to My Sets' border="0"title="Add to clipboard PAIS International (CSA)" alt="Add to clipboard PAIS International (CSA)">

通过在 border="0" 属性后面放置一个空格，我可以让 BS 解析页面。

I was using Firefox's "view selection source", which apparently cleans up the HTML for me. When I viewed the original source, this is what I saw

<img name="myImageXYZ00618" id="myImageXYZ00618" src='http://www2.lib.myschool.edu:7017/INS01/icon_eng/v-add_favorite.png' alt='Add to My Sets' title='Add to My Sets' border="0"title="Add to clipboard PAIS International (CSA)" alt="Add to clipboard PAIS International (CSA)">

By putting a space after the border="0" attribute, I can get BS to parse the page.

回复收藏 0 原文