子类化 beautifulsoup html 解析器,出现类型错误
我最近使用 beautifulsoup 很棒的 html 解析器编写了一个小包装器,
最近我尝试改进代码并使所有 beautifulsoup 方法直接在包装器类中可用(而不是通过类属性),我认为子类化 beautifulsoup 解析器将是实现的最佳方法这。
这是类:
class ScrapeInputError(Exception):pass
from BeautifulSoup import BeautifulSoup
class Scrape(BeautifulSoup):
"""base class to be subclassed
basically a subclassed BeautifulSoup wrapper that providers
basic url fetching with urllib2
and the basic html parsing with beautifulsoup
and some basic cleaning of head,scripts etc'"""
def __init__(self,file):
self._file = file
#very basic input validation
import re
if not re.search(r"^http://",self._file):
raise ScrapeInputError,"please enter a url that starts with http://"
import urllib2
#from BeautifulSoup import BeautifulSoup
self._page = urllib2.urlopen(self._file) #fetching the page
BeautifulSoup.__init__(self,self._page)
#self._soup = BeautifulSoup(self._page) #calling the html parser
这样我就可以启动类
x = Scrape("http://someurl.com")
,并能够使用 x.elem 或 x.find 遍历树,
这对于一些 beautifulsoup 方法(见上文)非常有效,但对于其他方法则失败 - 那些使用像这样的迭代器的方法“for e in x:”
错误消息:
Traceback (most recent call last):
File "<pyshell#86>", line 2, in <module>
print e
File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
seq = self.asynccall(oid, methodname, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
self.putmessage((seq, request))
File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
s = pickle.dumps(message)
File "C:\Python27\lib\copy_reg.py", line 77, in _reduce_ex
raise TypeError("a class that defines __slots__ without "
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled
我研究了错误消息,但找不到任何可以使用的东西 - 因为我不想玩 BeautifulSoup 的内部植入(老实说,我不知道或理解__slot__
或 __getstate__
..) 我只想使用该功能。
我尝试从类的 __init__
返回一个 beautifulsoup 对象,而不是子类化,但 __init__
方法返回 None
很高兴在这里获得任何帮助。
I wrote a little wrapper using beautifulsoup great html parser
recently I tried to improve the code and make all beautifulsoup methods available directly in the wrapper class (instead of through a class property ) and I thought subclassing the beautifulsoup parser would be the best way to achieve this.
Here is the the class:
class ScrapeInputError(Exception):pass
from BeautifulSoup import BeautifulSoup
class Scrape(BeautifulSoup):
"""base class to be subclassed
basically a subclassed BeautifulSoup wrapper that providers
basic url fetching with urllib2
and the basic html parsing with beautifulsoup
and some basic cleaning of head,scripts etc'"""
def __init__(self,file):
self._file = file
#very basic input validation
import re
if not re.search(r"^http://",self._file):
raise ScrapeInputError,"please enter a url that starts with http://"
import urllib2
#from BeautifulSoup import BeautifulSoup
self._page = urllib2.urlopen(self._file) #fetching the page
BeautifulSoup.__init__(self,self._page)
#self._soup = BeautifulSoup(self._page) #calling the html parser
this way I can just initiate the class with
x = Scrape("http://someurl.com")
and be able to traverse the tree with x.elem or x.find
this works wonderfull with some beautifulsoup methods (see above) but fails with others - those using iterator like "for e in x:"
the error message:
Traceback (most recent call last):
File "<pyshell#86>", line 2, in <module>
print e
File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
seq = self.asynccall(oid, methodname, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
self.putmessage((seq, request))
File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
s = pickle.dumps(message)
File "C:\Python27\lib\copy_reg.py", line 77, in _reduce_ex
raise TypeError("a class that defines __slots__ without "
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled
I researched the error message but couldn't find anything I could work with - becasue I don't want to play with the inner implantation of BeautifulSoup (and honestly I don't know or understand __slot__
or __getstate__
..) I just want to use the functionality.
instead of subclassing I tried returning a beautifulsoup object from the __init__
of the class but __init__
method returns None
Be glad for any help here.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
BeautifulSoup 代码中没有发生该错误。相反,您的 IDLE 无法检索和打印该对象。尝试使用
print str(e)
代替。无论如何,在您的情况下子类化 BeautifulSoup 可能不是最好的主意。您真的想继承所有解析方法(例如
convert_charref
、handle_pi
或error
)吗?更糟糕的是,如果您覆盖 BeautifulSoup 使用的某些内容,它可能会以难以发现的方式损坏。我不知道你的情况,但我建议优先选择组合而不是继承(即在属性中有一个 BeautifulSoup 对象)。您可以轻松地(如果以一种稍微有点hacky的方式)公开像这样的特定方法:
The error is not happening in BeautifulSoup code. Rather, your IDLE is not able to retreive and print the object. Try
print str(e)
instead.Anyway, subclassing BeautifulSoup in your situation may not be the best idea. Do you really want to inherit all of the parsing methods (like
convert_charref
,handle_pi
orerror
)? Worse, if you override something that BeautifulSoup uses, it may break in a hard-to-find way.I don't know your situation, but I suggest preferring composition over inheritance (i.e. having a BeautifulSoup object in an attribute). You can easily (if in a slightly hacky way) expose specific methods like this: