用 Python 抓取 Facebook
我有兴趣了解 Facebook 上每个朋友的好友数量。显然,官方 Facebook API 不允许获取朋友的朋友,所以我需要以某种方式绕过这个(有点明智的)限制。我尝试了以下操作:
import sys
import urllib, urllib2, cookielib
username = '[email protected]'
password = 'mypassword'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'email' : username, 'pass' : password})
request = urllib2.Request('https://login.facebook.com/login.php')
request.add_header('User-Agent','Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.12) Gecko/20101027 Fedora/3.6.12-1.fc14 Firefox/3.6.12')
opener.open(request, login_data)
resp = opener.open('http://facebook.com')
print resp.read()
但我最终只得到了一个验证码页面。知道 FB 如何检测请求不是来自“正常”浏览器吗?我可以添加一个额外的步骤并解决验证码,但这会给程序增加不必要的复杂性,所以我宁愿避免它。当我使用具有相同用户代理字符串的网络浏览器时,我没有收到验证码。
或者,是否有人对如何实现我的目标有任何更明智的想法,即获取朋友的朋友列表?
I'm interested in getting the number of friends each of my friends on Facebook has. Apparently the official Facebook API does not allow getting the friends of friends, so I need to get around this (somehwhat sensible) limitation somehow. I tried the following:
import sys
import urllib, urllib2, cookielib
username = '[email protected]'
password = 'mypassword'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'email' : username, 'pass' : password})
request = urllib2.Request('https://login.facebook.com/login.php')
request.add_header('User-Agent','Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.12) Gecko/20101027 Fedora/3.6.12-1.fc14 Firefox/3.6.12')
opener.open(request, login_data)
resp = opener.open('http://facebook.com')
print resp.read()
but I only end up with a captcha page. Any idea how FB is detecting that the request is not from a "normal" browser? I could add an extra step and solve the captcha but that would add unnecessary complexity to the program so I would rather avoid it. When I use a web browser with the same User-Agent string I don't get a captcha.
Alternatively, does anyone have any saner ideas on how to accomplish my goal, i.e. get a list of friends of friends?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您是否尝试过使用 Fiddler2 或 Wireshark 跟踪和比较 HTTP 事务? Fiddler 甚至可以跟踪 https,只要您的客户端代码可以使用伪造的证书即可。
Have you tried tracing and comparing HTTP transactions with Fiddler2 or Wireshark? Fiddler can even trace https, as long as your client code can be made to work with bogus certs.
我确实尝试了很多方法来抓取facebook,唯一对我有用的方法是:
安装 selenium ,firefox插件、服务器和 python 客户端库。
然后使用 firefox 插件,您可以将登录和导出所执行的操作记录为 python 脚本,您可以使用它作为工作的基础,并且它将起作用。基本上,我向这个脚本添加了一个对我的网络服务器的请求,以获取要在 FB 上检查的内容列表,然后在脚本末尾将结果发送回我的服务器。
我无法找到一种方法直接从我的服务器使用浏览器模拟器(如 mechanize 或其他)来完成此操作!我想这需要从客户端浏览器完成。
I did try a lot of ways to scrape facebook and the only way that worked for me is :
To install selenium , the firefox plugin, the server and the python client library.
Then with the firefox plugin, you can record the actions you do to login and export as a python script, you use this as a base for your work and it will work. Basically I added to this script a request to my webserver to fectch a list of things to inspect on FB and then at the end of the script I send the results back to my server.
I could NOT find a way to do it directly from my server with a browser simulator like mechanize or else ! I guess It needs to be done from a client browser.