如何抓取_私人_谷歌群组?
我想抓取一个私人谷歌群组的讨论列表。这是一个多页列表,我稍后可能需要再次这样做,因此编写脚本听起来是可行的方法。
由于这是一个私人群组,我需要先登录我的谷歌帐户。 不幸的是,我无法使用 wget 或 ruby Net::HTTP 登录。令人惊讶的是,无法通过客户端登录界面访问 Google 网上论坛,因此所有代码示例毫无用处。
我的 ruby 脚本嵌入在帖子的末尾。对身份验证查询的响应是 200-OK,但响应标头中没有 cookie,并且正文包含消息“您的浏览器的 cookie 功能已关闭。请打开它。”
我用 wget 得到了相同的输出。请参阅此消息末尾的 bash 脚本。
我不知道如何解决这个问题。我错过了什么吗?有什么想法吗?
提前致谢。
John
这是 ruby 脚本:
# a ruby script
require 'net/https'
http = Net::HTTP.new('www.google.com', 443)
http.use_ssl = true
path = '/accounts/ServiceLoginAuth'
email='[email protected]'
password='topsecret'
# form inputs from the login page
data = "Email=#{email}&Passwd=#{password}&dsh=7379491738180116079&GALX=irvvmW0Z-zI"
headers = { 'Content-Type' => 'application/x-www-form-urlencoded',
'user-agent' => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/6.0"}
# Post the request and print out the response to retrieve our authentication token
resp, data = http.post(path, data, headers)
puts resp
resp.each {|h, v| puts h+'='+v}
#warning: peer certificate won't be verified in this SSL session
这是 bash 脚本:
# A bash script for wget
CMD=""
CMD="$CMD --keep-session-cookies --save-cookies cookies.tmp"
CMD="$CMD --no-check-certificate"
CMD="$CMD --post-data='[email protected]&Passwd=topsecret&dsh=-8408553335275857936&GALX=irvvmW0Z-zI'"
CMD="$CMD --user-agent='Mozilla'"
CMD="$CMD https://www.google.com/accounts/ServiceLoginAuth"
echo $CMD
wget $CMD
wget --load-cookies="cookies.tmp" http://groups.google.com/group/mygroup/topics?tsc=2
I'd like to scrape the discussion list of a private google group. It's a multi-page list and I might have to this later again so scripting sounds like the way to go.
Since this is a private group, I need to login in my google account first.
Unfortunately I can't manage to login using wget or ruby Net::HTTP. Surprisingly google groups is not accessible with the Client Login interface, so all the code samples are useless.
My ruby script is embedded at the end of the post. The response to the authentication query is a 200-OK but no cookies in the response headers and the body contains the message "Your browser's cookie functionality is turned off. Please turn it on."
I got the same output with wget. See the bash script at the end of this message.
I don't know how to workaround this. am I missing something? Any idea?
Thanks in advance.
John
Here is the ruby script:
# a ruby script
require 'net/https'
http = Net::HTTP.new('www.google.com', 443)
http.use_ssl = true
path = '/accounts/ServiceLoginAuth'
email='[email protected]'
password='topsecret'
# form inputs from the login page
data = "Email=#{email}&Passwd=#{password}&dsh=7379491738180116079&GALX=irvvmW0Z-zI"
headers = { 'Content-Type' => 'application/x-www-form-urlencoded',
'user-agent' => "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/6.0"}
# Post the request and print out the response to retrieve our authentication token
resp, data = http.post(path, data, headers)
puts resp
resp.each {|h, v| puts h+'='+v}
#warning: peer certificate won't be verified in this SSL session
Here is the bash script:
# A bash script for wget
CMD=""
CMD="$CMD --keep-session-cookies --save-cookies cookies.tmp"
CMD="$CMD --no-check-certificate"
CMD="$CMD --post-data='[email protected]&Passwd=topsecret&dsh=-8408553335275857936&GALX=irvvmW0Z-zI'"
CMD="$CMD --user-agent='Mozilla'"
CMD="$CMD https://www.google.com/accounts/ServiceLoginAuth"
echo $CMD
wget $CMD
wget --load-cookies="cookies.tmp" http://groups.google.com/group/mygroup/topics?tsc=2
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您是否尝试过使用 ruby 的 mechanize ?
Mechanize 库用于与网站进行自动化交互;您可以登录谷歌并浏览您的私人谷歌群组,保存您需要的内容。
这里使用 mechanize 的示例用于 Gmail 抓取。
Have you tried with mechanize for ruby?
Mechanize library is used for automating interaction with website; you could log in to google and browse your private google group saving what you need.
Here an example where mechanize is used for gmail scraping.
我之前通过使用 Firefox 手动登录来完成此操作,然后使用 Chickenfoot 自动浏览和刮擦。
I did this previously by logging in manually with Firefox and then used Chickenfoot to automate browsing and scraping.
找到了这个用于抓取私有 Google 网上论坛的 PHP 解决方案。
Found this PHP Solution to scraping private Google Groups.