如何使用 Python Mechanize 自动添加 Google 快讯

发布于 2024-12-01 15:04:46 字数 1631 浏览 0 评论 0原文

我知道这里有一个 Python API 出售(http://oktaykilic.com/my-projects/google-alerts-api-python/),但我想了解为什么我现在这样做不工作。

这是我到目前为止所得到的:

class GAlerts():

def __init__(self, uName = 'USERNAME', passWord = 'PASSWORD'):

    self.uName = uName
    self.passWord = passWord

def addAlert(self):

    self.cj = mechanize.CookieJar()
    loginURL = 'https://www.google.com/accounts/ServiceLogin?hl=en&service=alerts&continue=http://www.google.com/alerts'
    alertsURL = 'http://www.google.com/alerts'

    #log into google
    initialRequest = mechanize.Request(loginURL)
    response = mechanize.urlopen(initialRequest)

    #put in form info
    forms = ClientForm.ParseResponse(response, backwards_compat=False)
    forms[0]['Email'] = self.uName
    forms[0]['Passwd'] = self.passWord

    #click form and get cookies
    request2 = forms[0].click()
    response2 = mechanize.urlopen(request2)
    self.cj.extract_cookies(response, initialRequest)


    #now go to alerts page with cookies
    request3 = mechanize.Request(alertsURL)
    self.cj.add_cookie_header(request3)
    response3 = mechanize.urlopen(request3)

    #parse forms on this page
    formsAdd = ClientForm.ParseResponse(response3, backwards_compat=False)
    formsAdd[0]['q'] = 'Hines Ward'

    #click it and submit
    request4 = formsAdd[0].click()
    self.cj.add_cookie_header(request4)
    response4 = mechanize.urlopen(request4)
    print response4.read()


myAlerter = GAlerts()
myAlerter.addAlert()

据我所知,它成功登录并进入添加警报主页,但是当我输入查询并“单击”提交时,它会将我发送到一个页面,上面写着“请输入合法的邮件地址”。我缺少某种身份验证吗?我也不明白如何更改谷歌自定义下拉菜单上的值?有什么想法吗?

谢谢

I'm aware of a Python API for sale here (http://oktaykilic.com/my-projects/google-alerts-api-python/), but I'd like to understand why the way I'm doing it now isn't working.

Here is what I have so far:

class GAlerts():

def __init__(self, uName = 'USERNAME', passWord = 'PASSWORD'):

    self.uName = uName
    self.passWord = passWord

def addAlert(self):

    self.cj = mechanize.CookieJar()
    loginURL = 'https://www.google.com/accounts/ServiceLogin?hl=en&service=alerts&continue=http://www.google.com/alerts'
    alertsURL = 'http://www.google.com/alerts'

    #log into google
    initialRequest = mechanize.Request(loginURL)
    response = mechanize.urlopen(initialRequest)

    #put in form info
    forms = ClientForm.ParseResponse(response, backwards_compat=False)
    forms[0]['Email'] = self.uName
    forms[0]['Passwd'] = self.passWord

    #click form and get cookies
    request2 = forms[0].click()
    response2 = mechanize.urlopen(request2)
    self.cj.extract_cookies(response, initialRequest)


    #now go to alerts page with cookies
    request3 = mechanize.Request(alertsURL)
    self.cj.add_cookie_header(request3)
    response3 = mechanize.urlopen(request3)

    #parse forms on this page
    formsAdd = ClientForm.ParseResponse(response3, backwards_compat=False)
    formsAdd[0]['q'] = 'Hines Ward'

    #click it and submit
    request4 = formsAdd[0].click()
    self.cj.add_cookie_header(request4)
    response4 = mechanize.urlopen(request4)
    print response4.read()


myAlerter = GAlerts()
myAlerter.addAlert()

As far as I can tell, it successfully logs in and gets to the adding alerts homepage, but when I enter a query and "click" submit it sends me to a page that says "Please enter a valid e-mail address". Is there some kind of authentication I'm missing? I also don't understand how to change the values on google's custom drop-down menus? Any ideas?

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

尤怨 2024-12-08 15:04:46

自定义下拉菜单是使用 JavaScript 完成的,因此正确的解决方案是找出 URL 参数,然后尝试重现它们(这可能是它现在无法按预期工作的原因 - 您省略了所需的 URL当您在浏览器中访问站点时通常由 JavaScript 设置的参数)。

懒惰的解决方案是使用 galerts 库,看起来它完全满足您的需要。

对于涉及 mechanize (或一般的屏幕抓取)的未来项目的一些提示:

  • 使用 Fiddler,一个非常有用的HTTP调试工具。它捕获来自大多数浏览器的 HTTP 流量,并允许您查看浏览器到底请求什么。然后,您可以手动制定所需的请求,如果它不起作用,您只需进行比较即可。 Firebug 或Google Chrome 的开发者工具等工具也能派上用场,尤其是对于大量异步请求。 (您必须在浏览器对象上调用 set_proxies 才能将其与 Fiddler 一起使用,请参阅文档)
  • 出于调试目的,请执行类似 for f in self.forms(): print f 的操作>。这会向您显示页面上机械化识别的所有表单及其名称。
  • 处理 cookie 是重复性的,所以 - 令人惊讶! - 有一种简单的方法可以实现自动化。只需在浏览器类构造函数中执行此操作:self.set_cookiejar(cookielib.CookieJar())。这会自动跟踪 cookie。
  • 我长期以来一直依赖像 BeautifulSoup 这样的自定义解析(并且我仍然在某些特殊情况下使用它),但在大多数情况下,网页屏幕抓取的最快方法是使用 XPath (例如,< code>lxml 有一个非常好的实现)。

The custom drop-down menus are done using JavaScript, so the proper solution would be to figure out the URL parameters and then try to reproduce them (this might be the reason it doesn't works as expected right now - you are omitting required URL parameters that are normally set by JavaScript when you visit the site in a browser).

The lazy solution is to use the galerts library, it looks like it does exactly what you need.

A few hints for future projects involving mechanize (or screen-scraping in general):

  • Use Fiddler, an extremely useful HTTP debugging tool. It captures HTTP traffic from most browsers and allows you to see what exactly your browser requests. You can then craft the desired request manually and in case it doesn't work, you just have to compare. Tools like Firebug or Google Chrome's developer tools come in handy too, especially for lots of async requests. (you have to call set_proxies on your browser object to use it with Fiddler, see documentation)
  • For debugging purposes, do something like for f in self.forms(): print f. This shows you all forms mechanize recognized on a page, along with their name.
  • Handling cookies is repetitive, so - surprise! - there's an easy way to automate it. Just do this in your browser class constructor: self.set_cookiejar(cookielib.CookieJar()). This keeps track of cookies automatically.
  • I have been relying a long time on custom parses like BeautifulSoup (and I still use it for some special cases), but in most cases the fastest approach on web screen scraping is using XPath (for example, lxml has a very good implementation).
星軌x 2024-12-08 15:04:46

Mechanize 不处理 JavaScript,那些下拉菜单是 JS。如果您想在涉及 JavaScript 的情况下实现自动化,我建议使用 Selenium,它也具有 Python 绑定。

http://seleniumhq.org/

Mechanize doesn't handle JavaScript, and those drop-down Menus are JS. If you want to do automatization where JavaScript is involved, I suggest using Selenium, which also has Python bindings.

http://seleniumhq.org/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文