BeautifulSoup 对 jQuery 脚本感到窒息，有什么已知的解决方法吗？

发布于 2024-10-02 05:29:47 字数 3043 浏览 5 评论 0原文

我给 BeautifulSoup 一个 html 文档，只需用完整的 html 构造一个 BeautifulSoup 对象实例，它似乎就会被嵌入 html 中的 jQuery 脚本的以下行阻塞：

        var txt = "Logged in as: <a href=\"http://somedomain.com/the-blah/\">" + uname + "</a> <small>(<a href=\"http://somedomain.com/the-blah/\">The Blah</a> | <a href=\"http://somedomain.com/the-blah/?action=logout\">logout</a>)</small>";

错误的完整堆栈跟踪如下：

    /usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, *args, **kwargs)
   1497             kwargs['smartQuotesTo'] = self.HTML_ENTITIES
   1498         kwargs['isHTML'] = True
-> 1499         BeautifulStoneSoup.__init__(self, *args, **kwargs)
   1500 
   1501     SELF_CLOSING_TAGS = buildTagMap(None,

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, markup, parseOnlyThese, fromEncoding, markupMassage, smartQuotesTo, convertEntities, selfClosingTags, isHTML, builder)
   1228         self.markupMassage = markupMassage
   1229         try:
-> 1230             self._feed(isHTML=isHTML)
   1231         except StopParsing:
   1232             pass

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in _feed(self, inDocumentEncoding, isHTML)
   1261         self.builder.reset()
   1262 
-> 1263         self.builder.feed(markup)
   1264         # Close out any unfinished strings and close all the open tags.

   1265         self.endData()

/usr/lib/python2.6/HTMLParser.pyc in feed(self, data)
    106         """
    107         self.rawdata = self.rawdata + data
--> 108         self.goahead(0)
    109 
    110     def close(self):

/usr/lib/python2.6/HTMLParser.pyc in goahead(self, end)
    146             if startswith('<', i):
    147                 if starttagopen.match(rawdata, i): # < + letter
--> 148                     k = self.parse_starttag(i)
    149                 elif startswith("</", i):
    150                     k = self.parse_endtag(i)

/usr/lib/python2.6/HTMLParser.pyc in parse_starttag(self, i)
    227     def parse_starttag(self, i):
    228         self.__starttag_text = None
--> 229         endpos = self.check_for_whole_start_tag(i)
    230         if endpos < 0:
    231             return endpos

/usr/lib/python2.6/HTMLParser.pyc in check_for_whole_start_tag(self, i)
    302                 return -1
    303             self.updatepos(i, j)
--> 304             self.error("malformed start tag")
    305         raise AssertionError("we should not get here!")
    306 

/usr/lib/python2.6/HTMLParser.pyc in error(self, message)
    113 
    114     def error(self, message):
--> 115         raise HTMLParseError(message, self.getpos())
    116 
    117     __starttag_text = None

HTMLParseError: malformed start tag, at line 193, column 110

据我所知，它与引号内的尖括号有关，这似乎被抛弃了。有什么样的解决方法，或者是否有另一个库可以更好地处理这些边缘情况？或者，有没有办法告诉它忽略所有 javascript 内容？

原文

I'm giving BeautifulSoup an html document and simply by constructing a BeautifulSoup object instance with the full html, it seems to choke on the following line of a jQuery script that's embedded within the html:

        var txt = "Logged in as: <a href=\"http://somedomain.com/the-blah/\">" + uname + "</a> <small>(<a href=\"http://somedomain.com/the-blah/\">The Blah</a> | <a href=\"http://somedomain.com/the-blah/?action=logout\">logout</a>)</small>";

The full stack trace for the error is the following:

    /usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, *args, **kwargs)
   1497             kwargs['smartQuotesTo'] = self.HTML_ENTITIES
   1498         kwargs['isHTML'] = True
-> 1499         BeautifulStoneSoup.__init__(self, *args, **kwargs)
   1500 
   1501     SELF_CLOSING_TAGS = buildTagMap(None,

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, markup, parseOnlyThese, fromEncoding, markupMassage, smartQuotesTo, convertEntities, selfClosingTags, isHTML, builder)
   1228         self.markupMassage = markupMassage
   1229         try:
-> 1230             self._feed(isHTML=isHTML)
   1231         except StopParsing:
   1232             pass

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in _feed(self, inDocumentEncoding, isHTML)
   1261         self.builder.reset()
   1262 
-> 1263         self.builder.feed(markup)
   1264         # Close out any unfinished strings and close all the open tags.

   1265         self.endData()

/usr/lib/python2.6/HTMLParser.pyc in feed(self, data)
    106         """
    107         self.rawdata = self.rawdata + data
--> 108         self.goahead(0)
    109 
    110     def close(self):

/usr/lib/python2.6/HTMLParser.pyc in goahead(self, end)
    146             if startswith('<', i):
    147                 if starttagopen.match(rawdata, i): # < + letter
--> 148                     k = self.parse_starttag(i)
    149                 elif startswith("</", i):
    150                     k = self.parse_endtag(i)

/usr/lib/python2.6/HTMLParser.pyc in parse_starttag(self, i)
    227     def parse_starttag(self, i):
    228         self.__starttag_text = None
--> 229         endpos = self.check_for_whole_start_tag(i)
    230         if endpos < 0:
    231             return endpos

/usr/lib/python2.6/HTMLParser.pyc in check_for_whole_start_tag(self, i)
    302                 return -1
    303             self.updatepos(i, j)
--> 304             self.error("malformed start tag")
    305         raise AssertionError("we should not get here!")
    306 

/usr/lib/python2.6/HTMLParser.pyc in error(self, message)
    113 
    114     def error(self, message):
--> 115         raise HTMLParseError(message, self.getpos())
    116 
    117     __starttag_text = None

HTMLParseError: malformed start tag, at line 193, column 110

From what I can glean it has something to do with the angle brackets being within quotes, it seems to be thrown off by this. What kind of work around is there, or is there another library that handles these edge cases better? Or alternatively, is there a way to tell it to ignore all javascript content?

分享到QQ

分享到微博