子类化 beautifulsoup html 解析器,出现类型错误

发布于 2024-12-08 09:48:27 字数 2301 浏览 0 评论 0原文

我最近使用 beautifulsoup 很棒的 html 解析器编写了一个小包装器,

最近我尝试改进代码并使所有 beautifulsoup 方法直接在包装器类中可用(而不是通过类属性),我认为子类化 beautifulsoup 解析器将是实现的最佳方法这。

这是类:

class ScrapeInputError(Exception):pass
from BeautifulSoup import BeautifulSoup

class Scrape(BeautifulSoup):
    """base class to be subclassed
    basically a subclassed BeautifulSoup wrapper that providers
    basic url fetching with urllib2
    and the basic html parsing with beautifulsoup
    and some basic cleaning of head,scripts etc'"""

    def __init__(self,file):
        self._file = file
        #very basic input validation
        import re
        if not re.search(r"^http://",self._file):
            raise ScrapeInputError,"please enter a url that starts with http://"

        import urllib2
        #from BeautifulSoup import BeautifulSoup
        self._page = urllib2.urlopen(self._file) #fetching the page
        BeautifulSoup.__init__(self,self._page)
        #self._soup = BeautifulSoup(self._page) #calling the html parser

这样我就可以启动类

x = Scrape("http://someurl.com")

,并能够使用 x.elem 或 x.find 遍历树,

这对于一些 beautifulsoup 方法(见上文)非常有效,但对于其他方法则失败 - 那些使用像这样的迭代器的方法“for e in x:”

错误消息:

 Traceback (most recent call last):
  File "<pyshell#86>", line 2, in <module>
    print e
  File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
    value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
  File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
    seq = self.asynccall(oid, methodname, args, kwargs)
  File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
    self.putmessage((seq, request))
  File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
    s = pickle.dumps(message)
  File "C:\Python27\lib\copy_reg.py", line 77, in _reduce_ex
    raise TypeError("a class that defines __slots__ without "
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

我研究了错误消息,但找不到任何可以使用的东西 - 因为我不想玩 BeautifulSoup 的内部植入(老实说,我不知道或理解__slot____getstate__..) 我只想使用该功能。

我尝试从类的 __init__ 返回一个 beautifulsoup 对象,而不是子类化,但 __init__ 方法返回 None

很高兴在这里获得任何帮助。

I wrote a little wrapper using beautifulsoup great html parser

recently I tried to improve the code and make all beautifulsoup methods available directly in the wrapper class (instead of through a class property ) and I thought subclassing the beautifulsoup parser would be the best way to achieve this.

Here is the the class:

class ScrapeInputError(Exception):pass
from BeautifulSoup import BeautifulSoup

class Scrape(BeautifulSoup):
    """base class to be subclassed
    basically a subclassed BeautifulSoup wrapper that providers
    basic url fetching with urllib2
    and the basic html parsing with beautifulsoup
    and some basic cleaning of head,scripts etc'"""

    def __init__(self,file):
        self._file = file
        #very basic input validation
        import re
        if not re.search(r"^http://",self._file):
            raise ScrapeInputError,"please enter a url that starts with http://"

        import urllib2
        #from BeautifulSoup import BeautifulSoup
        self._page = urllib2.urlopen(self._file) #fetching the page
        BeautifulSoup.__init__(self,self._page)
        #self._soup = BeautifulSoup(self._page) #calling the html parser

this way I can just initiate the class with

x = Scrape("http://someurl.com")

and be able to traverse the tree with x.elem or x.find

this works wonderfull with some beautifulsoup methods (see above) but fails with others - those using iterator like "for e in x:"

the error message:

 Traceback (most recent call last):
  File "<pyshell#86>", line 2, in <module>
    print e
  File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
    value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
  File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
    seq = self.asynccall(oid, methodname, args, kwargs)
  File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
    self.putmessage((seq, request))
  File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
    s = pickle.dumps(message)
  File "C:\Python27\lib\copy_reg.py", line 77, in _reduce_ex
    raise TypeError("a class that defines __slots__ without "
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

I researched the error message but couldn't find anything I could work with - becasue I don't want to play with the inner implantation of BeautifulSoup (and honestly I don't know or understand __slot__ or __getstate__..) I just want to use the functionality.

instead of subclassing I tried returning a beautifulsoup object from the __init__ of the class but __init__ method returns None

Be glad for any help here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

羞稚 2024-12-15 09:48:27

BeautifulSoup 代码中没有发生该错误。相反,您的 IDLE 无法检索和打印该对象。尝试使用 print str(e) 代替。


无论如何,在您的情况下子类化 BeautifulSoup 可能不是最好的主意。您真的想继承所有解析方法(例如 convert_charrefhandle_pierror)吗?更糟糕的是,如果您覆盖 BeautifulSoup 使用的某些内容,它可能会以难以发现的方式损坏。

我不知道你的情况,但我建议优先选择组合而不是继承(即在属性中有一个 BeautifulSoup 对象)。您可以轻松地(如果以一种稍微有点hacky的方式)公开像这样的特定方法:

class Scrape(object):
    def __init__(self, ...):
        self.soup = ...
        ...
        self.find = self.soup.find

The error is not happening in BeautifulSoup code. Rather, your IDLE is not able to retreive and print the object. Try print str(e) instead.


Anyway, subclassing BeautifulSoup in your situation may not be the best idea. Do you really want to inherit all of the parsing methods (like convert_charref, handle_pi or error)? Worse, if you override something that BeautifulSoup uses, it may break in a hard-to-find way.

I don't know your situation, but I suggest preferring composition over inheritance (i.e. having a BeautifulSoup object in an attribute). You can easily (if in a slightly hacky way) expose specific methods like this:

class Scrape(object):
    def __init__(self, ...):
        self.soup = ...
        ...
        self.find = self.soup.find
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文