BeautifulSoup 对 jQuery 脚本感到窒息,有什么已知的解决方法吗?

发布于 2024-10-02 05:29:47 字数 3043 浏览 5 评论 0原文

我给 BeautifulSoup 一个 html 文档,只需用完整的 html 构造一个 BeautifulSoup 对象实例,它似乎就会被嵌入 html 中的 jQuery 脚本的以下行阻塞:

        var txt = "Logged in as: <a href=\"http://somedomain.com/the-blah/\">" + uname + "</a> <small>(<a href=\"http://somedomain.com/the-blah/\">The Blah</a> | <a href=\"http://somedomain.com/the-blah/?action=logout\">logout</a>)</small>";

错误的完整堆栈跟踪如下:

    /usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, *args, **kwargs)
   1497             kwargs['smartQuotesTo'] = self.HTML_ENTITIES
   1498         kwargs['isHTML'] = True
-> 1499         BeautifulStoneSoup.__init__(self, *args, **kwargs)
   1500 
   1501     SELF_CLOSING_TAGS = buildTagMap(None,

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, markup, parseOnlyThese, fromEncoding, markupMassage, smartQuotesTo, convertEntities, selfClosingTags, isHTML, builder)
   1228         self.markupMassage = markupMassage
   1229         try:
-> 1230             self._feed(isHTML=isHTML)
   1231         except StopParsing:
   1232             pass

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in _feed(self, inDocumentEncoding, isHTML)
   1261         self.builder.reset()
   1262 
-> 1263         self.builder.feed(markup)
   1264         # Close out any unfinished strings and close all the open tags.

   1265         self.endData()

/usr/lib/python2.6/HTMLParser.pyc in feed(self, data)
    106         """
    107         self.rawdata = self.rawdata + data
--> 108         self.goahead(0)
    109 
    110     def close(self):

/usr/lib/python2.6/HTMLParser.pyc in goahead(self, end)
    146             if startswith('<', i):
    147                 if starttagopen.match(rawdata, i): # < + letter
--> 148                     k = self.parse_starttag(i)
    149                 elif startswith("</", i):
    150                     k = self.parse_endtag(i)

/usr/lib/python2.6/HTMLParser.pyc in parse_starttag(self, i)
    227     def parse_starttag(self, i):
    228         self.__starttag_text = None
--> 229         endpos = self.check_for_whole_start_tag(i)
    230         if endpos < 0:
    231             return endpos

/usr/lib/python2.6/HTMLParser.pyc in check_for_whole_start_tag(self, i)
    302                 return -1
    303             self.updatepos(i, j)
--> 304             self.error("malformed start tag")
    305         raise AssertionError("we should not get here!")
    306 

/usr/lib/python2.6/HTMLParser.pyc in error(self, message)
    113 
    114     def error(self, message):
--> 115         raise HTMLParseError(message, self.getpos())
    116 
    117     __starttag_text = None

HTMLParseError: malformed start tag, at line 193, column 110

据我所知,它与引号内的尖括号有关,这似乎被抛弃了。有什么样的解决方法,或者是否有另一个库可以更好地处理这些边缘情况?或者,有没有办法告诉它忽略所有 javascript 内容?

I'm giving BeautifulSoup an html document and simply by constructing a BeautifulSoup object instance with the full html, it seems to choke on the following line of a jQuery script that's embedded within the html:

        var txt = "Logged in as: <a href=\"http://somedomain.com/the-blah/\">" + uname + "</a> <small>(<a href=\"http://somedomain.com/the-blah/\">The Blah</a> | <a href=\"http://somedomain.com/the-blah/?action=logout\">logout</a>)</small>";

The full stack trace for the error is the following:

    /usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, *args, **kwargs)
   1497             kwargs['smartQuotesTo'] = self.HTML_ENTITIES
   1498         kwargs['isHTML'] = True
-> 1499         BeautifulStoneSoup.__init__(self, *args, **kwargs)
   1500 
   1501     SELF_CLOSING_TAGS = buildTagMap(None,

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, markup, parseOnlyThese, fromEncoding, markupMassage, smartQuotesTo, convertEntities, selfClosingTags, isHTML, builder)
   1228         self.markupMassage = markupMassage
   1229         try:
-> 1230             self._feed(isHTML=isHTML)
   1231         except StopParsing:
   1232             pass

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in _feed(self, inDocumentEncoding, isHTML)
   1261         self.builder.reset()
   1262 
-> 1263         self.builder.feed(markup)
   1264         # Close out any unfinished strings and close all the open tags.

   1265         self.endData()

/usr/lib/python2.6/HTMLParser.pyc in feed(self, data)
    106         """
    107         self.rawdata = self.rawdata + data
--> 108         self.goahead(0)
    109 
    110     def close(self):

/usr/lib/python2.6/HTMLParser.pyc in goahead(self, end)
    146             if startswith('<', i):
    147                 if starttagopen.match(rawdata, i): # < + letter
--> 148                     k = self.parse_starttag(i)
    149                 elif startswith("</", i):
    150                     k = self.parse_endtag(i)

/usr/lib/python2.6/HTMLParser.pyc in parse_starttag(self, i)
    227     def parse_starttag(self, i):
    228         self.__starttag_text = None
--> 229         endpos = self.check_for_whole_start_tag(i)
    230         if endpos < 0:
    231             return endpos

/usr/lib/python2.6/HTMLParser.pyc in check_for_whole_start_tag(self, i)
    302                 return -1
    303             self.updatepos(i, j)
--> 304             self.error("malformed start tag")
    305         raise AssertionError("we should not get here!")
    306 

/usr/lib/python2.6/HTMLParser.pyc in error(self, message)
    113 
    114     def error(self, message):
--> 115         raise HTMLParseError(message, self.getpos())
    116 
    117     __starttag_text = None

HTMLParseError: malformed start tag, at line 193, column 110

From what I can glean it has something to do with the angle brackets being within quotes, it seems to be thrown off by this. What kind of work around is there, or is there another library that handles these edge cases better? Or alternatively, is there a way to tell it to ignore all javascript content?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

公布 2024-10-09 05:29:48

最简单的方法可能是删除所有脚本。请参阅文档中的“删除元素”部分:http://www.crummy。 com/software/BeautifulSoup/documentation.html#Removing%20elements

the easiest way would probably be to delete all the scripts. see the section Removing Elements in the documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html#Removing%20elements

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文