BeautifulSoup 对 jQuery 脚本感到窒息,有什么已知的解决方法吗?
我给 BeautifulSoup 一个 html 文档,只需用完整的 html 构造一个 BeautifulSoup 对象实例,它似乎就会被嵌入 html 中的 jQuery 脚本的以下行阻塞:
var txt = "Logged in as: <a href=\"http://somedomain.com/the-blah/\">" + uname + "</a> <small>(<a href=\"http://somedomain.com/the-blah/\">The Blah</a> | <a href=\"http://somedomain.com/the-blah/?action=logout\">logout</a>)</small>";
错误的完整堆栈跟踪如下:
/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, *args, **kwargs)
1497 kwargs['smartQuotesTo'] = self.HTML_ENTITIES
1498 kwargs['isHTML'] = True
-> 1499 BeautifulStoneSoup.__init__(self, *args, **kwargs)
1500
1501 SELF_CLOSING_TAGS = buildTagMap(None,
/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, markup, parseOnlyThese, fromEncoding, markupMassage, smartQuotesTo, convertEntities, selfClosingTags, isHTML, builder)
1228 self.markupMassage = markupMassage
1229 try:
-> 1230 self._feed(isHTML=isHTML)
1231 except StopParsing:
1232 pass
/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in _feed(self, inDocumentEncoding, isHTML)
1261 self.builder.reset()
1262
-> 1263 self.builder.feed(markup)
1264 # Close out any unfinished strings and close all the open tags.
1265 self.endData()
/usr/lib/python2.6/HTMLParser.pyc in feed(self, data)
106 """
107 self.rawdata = self.rawdata + data
--> 108 self.goahead(0)
109
110 def close(self):
/usr/lib/python2.6/HTMLParser.pyc in goahead(self, end)
146 if startswith('<', i):
147 if starttagopen.match(rawdata, i): # < + letter
--> 148 k = self.parse_starttag(i)
149 elif startswith("</", i):
150 k = self.parse_endtag(i)
/usr/lib/python2.6/HTMLParser.pyc in parse_starttag(self, i)
227 def parse_starttag(self, i):
228 self.__starttag_text = None
--> 229 endpos = self.check_for_whole_start_tag(i)
230 if endpos < 0:
231 return endpos
/usr/lib/python2.6/HTMLParser.pyc in check_for_whole_start_tag(self, i)
302 return -1
303 self.updatepos(i, j)
--> 304 self.error("malformed start tag")
305 raise AssertionError("we should not get here!")
306
/usr/lib/python2.6/HTMLParser.pyc in error(self, message)
113
114 def error(self, message):
--> 115 raise HTMLParseError(message, self.getpos())
116
117 __starttag_text = None
HTMLParseError: malformed start tag, at line 193, column 110
据我所知,它与引号内的尖括号有关,这似乎被抛弃了。有什么样的解决方法,或者是否有另一个库可以更好地处理这些边缘情况?或者,有没有办法告诉它忽略所有 javascript 内容?
I'm giving BeautifulSoup an html document and simply by constructing a BeautifulSoup object instance with the full html, it seems to choke on the following line of a jQuery script that's embedded within the html:
var txt = "Logged in as: <a href=\"http://somedomain.com/the-blah/\">" + uname + "</a> <small>(<a href=\"http://somedomain.com/the-blah/\">The Blah</a> | <a href=\"http://somedomain.com/the-blah/?action=logout\">logout</a>)</small>";
The full stack trace for the error is the following:
/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, *args, **kwargs)
1497 kwargs['smartQuotesTo'] = self.HTML_ENTITIES
1498 kwargs['isHTML'] = True
-> 1499 BeautifulStoneSoup.__init__(self, *args, **kwargs)
1500
1501 SELF_CLOSING_TAGS = buildTagMap(None,
/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, markup, parseOnlyThese, fromEncoding, markupMassage, smartQuotesTo, convertEntities, selfClosingTags, isHTML, builder)
1228 self.markupMassage = markupMassage
1229 try:
-> 1230 self._feed(isHTML=isHTML)
1231 except StopParsing:
1232 pass
/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in _feed(self, inDocumentEncoding, isHTML)
1261 self.builder.reset()
1262
-> 1263 self.builder.feed(markup)
1264 # Close out any unfinished strings and close all the open tags.
1265 self.endData()
/usr/lib/python2.6/HTMLParser.pyc in feed(self, data)
106 """
107 self.rawdata = self.rawdata + data
--> 108 self.goahead(0)
109
110 def close(self):
/usr/lib/python2.6/HTMLParser.pyc in goahead(self, end)
146 if startswith('<', i):
147 if starttagopen.match(rawdata, i): # < + letter
--> 148 k = self.parse_starttag(i)
149 elif startswith("</", i):
150 k = self.parse_endtag(i)
/usr/lib/python2.6/HTMLParser.pyc in parse_starttag(self, i)
227 def parse_starttag(self, i):
228 self.__starttag_text = None
--> 229 endpos = self.check_for_whole_start_tag(i)
230 if endpos < 0:
231 return endpos
/usr/lib/python2.6/HTMLParser.pyc in check_for_whole_start_tag(self, i)
302 return -1
303 self.updatepos(i, j)
--> 304 self.error("malformed start tag")
305 raise AssertionError("we should not get here!")
306
/usr/lib/python2.6/HTMLParser.pyc in error(self, message)
113
114 def error(self, message):
--> 115 raise HTMLParseError(message, self.getpos())
116
117 __starttag_text = None
HTMLParseError: malformed start tag, at line 193, column 110
From what I can glean it has something to do with the angle brackets being within quotes, it seems to be thrown off by this. What kind of work around is there, or is there another library that handles these edge cases better? Or alternatively, is there a way to tell it to ignore all javascript content?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
最简单的方法可能是删除所有脚本。请参阅文档中的“删除元素”部分:http://www.crummy。 com/software/BeautifulSoup/documentation.html#Removing%20elements
the easiest way would probably be to delete all the scripts. see the section Removing Elements in the documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html#Removing%20elements