使用 PyQt4 抓取 Javascript 驱动的网页 - 如何访问需要身份验证的页面?
我必须在我们公司的内联网上抓取一个非常非常简单的页面,以便自动化我们的一个内部流程(返回函数的输出是否成功)。
我找到了以下示例:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://sitescraper.net'
r = Render(url)
html = r.frame.toHtml()
来自 http://blog .sitescraper.net/2010/06/scraping-javascript-webpages-in-python.html 它几乎是完美的。我只需要能够提供身份验证即可查看该页面。
我一直在浏览 PyQt4 的文档,我承认其中很多内容超出了我的理解范围。如果有人可以提供帮助,我将不胜感激。
编辑: 不幸的是,gruszczy 的方法对我不起作用。当我通过 urllib2 完成类似的操作时,我使用了以下代码并且它起作用了......
username = 'user'
password = 'pass'
req = urllib2.Request(url)
base64string = base64.encodestring('%s:%s' % (username, password))[:-1]
authheader = "Basic %s" % base64string
req.add_header("Authorization", authheader)
handle = urllib2.urlopen(req)
I have to scrape a very, very simple page on our company's intranet in order to automate one of our internal processes (returning a function's output as successful or not).
I found the following example:
import sys
from PyQt4.QtGui import *
from PyQt4.QtCore import *
from PyQt4.QtWebKit import *
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = 'http://sitescraper.net'
r = Render(url)
html = r.frame.toHtml()
From http://blog.sitescraper.net/2010/06/scraping-javascript-webpages-in-python.html and it's almost perfect. I just need to be able to provide authentication to view the page.
I've been looking through the documentation for PyQt4 and I'll admit a lot of it is over my head. If anyone could help, I'd appreciate it.
Edit:
Unfortunately gruszczy's method didn't work for me. When I had done something similar through urllib2, I used the following code and it worked...
username = 'user'
password = 'pass'
req = urllib2.Request(url)
base64string = base64.encodestring('%s:%s' % (username, password))[:-1]
authheader = "Basic %s" % base64string
req.add_header("Authorization", authheader)
handle = urllib2.urlopen(req)
我想通了。这就是我最终得到的结果,以防它对其他人有帮助。
I figured it out. Here's what I ended up with in case it can help someone else.