Beautifulsoup 网络抓取问题

发布于 2024-12-02 05:06:41 字数 787 浏览 0 评论 0原文

我正在使用 BeautifulSoup 和 mechanise 从网页中查找一些内容。问题是有时找不到我正在寻找的字符串。我不知道可能出现什么问题

对于许多网页来说,它可以正常工作几个月,但突然停止工作。然后我必须重新启动我的程序并希望得到正确的结果。

问题出在 data.find(text=re.compile('string to find')) 中。网页总是在下载,但有时找不到字符串。我认为问题可能出在加载页面的 HTML 格式错误?有什么想法吗?

这是我正在使用的代码的一小部分,仅供您审阅

from BeautifulSoup import BeautifulSoup as soup
from mechanize import Browser
import sys, re

def get_page(url):
    mech = Browser()
    page = mech.open(url)
    return page.read()

def test():
    data = soup(get_page('some url'))

    div_pages = data.find(text=re.compile('string to find'))

    try: pager = div_pages.strip().split(' ')
    except:
        print div_pages
        sys.exit()

    print 'ok'

if __name__ == '__main__':
    test()

I'm using BeautifulSoup and mechanise to find some content from a web page. The thing is that SOMETIMES the string I'm looking for is not found. I don't know what could be a problem

It worked fine for many months for many web pages but suddenly it stopped working. Then I have to restart my program and hope for the right result.

The problem is in data.find(text=re.compile('string to find')). The web page is always downloading but the string sometimes is not found. I think the problem could be in wrong HTML formatting of the loaded page? Any idea?

Here is very small part of the code I'm working with and it's only for your review

from BeautifulSoup import BeautifulSoup as soup
from mechanize import Browser
import sys, re

def get_page(url):
    mech = Browser()
    page = mech.open(url)
    return page.read()

def test():
    data = soup(get_page('some url'))

    div_pages = data.find(text=re.compile('string to find'))

    try: pager = div_pages.strip().split(' ')
    except:
        print div_pages
        sys.exit()

    print 'ok'

if __name__ == '__main__':
    test()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

漆黑的白昼 2024-12-09 05:06:41

我帮助某人解决了类似的问题,结果发现是字符串的编码问题。此链接(来自他们的文档)可能会帮助您 - 漂亮的汤给你Unicode,该死

I helped someone out with a similar problem and it turned out to be the encoding of the string. This link (from their docs) might help you on the way - Beautiful Soup Gives You Unicode, Dammit

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文