Beautifulsoup 网络抓取问题
我正在使用 BeautifulSoup 和 mechanise 从网页中查找一些内容。问题是有时找不到我正在寻找的字符串。我不知道可能出现什么问题
对于许多网页来说,它可以正常工作几个月,但突然停止工作。然后我必须重新启动我的程序并希望得到正确的结果。
问题出在 data.find(text=re.compile('string to find'))
中。网页总是在下载,但有时找不到字符串。我认为问题可能出在加载页面的 HTML
格式错误?有什么想法吗?
这是我正在使用的代码的一小部分,仅供您审阅
from BeautifulSoup import BeautifulSoup as soup
from mechanize import Browser
import sys, re
def get_page(url):
mech = Browser()
page = mech.open(url)
return page.read()
def test():
data = soup(get_page('some url'))
div_pages = data.find(text=re.compile('string to find'))
try: pager = div_pages.strip().split(' ')
except:
print div_pages
sys.exit()
print 'ok'
if __name__ == '__main__':
test()
I'm using BeautifulSoup and mechanise to find some content from a web page. The thing is that SOMETIMES the string I'm looking for is not found. I don't know what could be a problem
It worked fine for many months for many web pages but suddenly it stopped working. Then I have to restart my program and hope for the right result.
The problem is in data.find(text=re.compile('string to find'))
. The web page is always downloading but the string sometimes is not found. I think the problem could be in wrong HTML
formatting of the loaded page? Any idea?
Here is very small part of the code I'm working with and it's only for your review
from BeautifulSoup import BeautifulSoup as soup
from mechanize import Browser
import sys, re
def get_page(url):
mech = Browser()
page = mech.open(url)
return page.read()
def test():
data = soup(get_page('some url'))
div_pages = data.find(text=re.compile('string to find'))
try: pager = div_pages.strip().split(' ')
except:
print div_pages
sys.exit()
print 'ok'
if __name__ == '__main__':
test()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我帮助某人解决了类似的问题,结果发现是字符串的编码问题。此链接(来自他们的文档)可能会帮助您 - 漂亮的汤给你Unicode,该死
I helped someone out with a similar problem and it turned out to be the encoding of the string. This link (from their docs) might help you on the way - Beautiful Soup Gives You Unicode, Dammit