wget 用于获取 Facebook 个人资料/朋友页面

发布于 2024-11-26 10:01:39 字数 810 浏览 4 评论 0 原文

我正在尝试使用“wget”获取 facebook 用户的个人资料页面,但不断获取名为“browser.php”的非个人资料页面,该页面与该特定用户无关。我在浏览器中看到的个人资料页面的 URL 恰好采用以下格式:

http://www.facebook.com/user-name

,这就是我一直用作wget 命令:

wget http://www.facebook.com/user-name< /em>

我也有兴趣使用 wget 来获取用户的朋友列表,但即使这样也给了我同样无用的结果(“browser.php”):

wget http://www.facebook.com/user-name?sk=friends&v=friends

有人可以告诉我我在这里做错了什么吗?换句话说,我是否错过了 wget 命令的一些关键选项,或者 wget 根本不适合这种情况?

任何帮助将不胜感激。

为了向这个查询添加上下文,我需要弄清楚如何使用 wget 从 Facebook 获取这些页面,因为它会帮助我编写一个脚本/程序来从 HTML 源代码中查找朋友的个人资料 URL,然后查找其他一些页面我基本上希望这能帮助我对我没有联系的人进行某种选择性抓取(当然要经过 Facebook 的许可)。

I am trying to fetch facebook a user's profile page using "wget" but keep getting a non-profile page called "browser.php" which has nothing to do with that particular user. The profile page's URL as I see in the browser happens to be of the following format:

http://www.facebook.com/user-name

and that's what I have been using as the argument to the wget command:

wget http://www.facebook.com/user-name

I am also interested in using wget to fetch a user's friends' list but even that is giving me the same unhelpful result ("browser.php"):

wget http://www.facebook.com/user-name?sk=friends&v=friends

Could someone kindly advise me what I'm doing wrong here? In other words, am I missing out some key options for wget command or does wget not fit such a scenario at all?

Any help will be greatly appreciated.

To add context to this query, I need to figure out how to fetch these pages from Facebook using wget as it would then help me write a script/program to look up friends' profile URLs from the HTML source code and then look up some other keywords on them, etc. I am basically hoping that this would help me in doing some kind of selective-crawling (with Facebook's permission of course) of people I am not connected to.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

在梵高的星空下 2024-12-03 10:01:39

首先,Facebook 可能创造了一种条件,使某些用户代理(例如 wget)无法抓取页面。因此,他们将某些用户代理重定向到不同的页面,该页面可能会显示“不支持您的浏览器”之类的内容。他们这样做是为了保护人们不做您正在做的事情。但是,您可以使用 wget 的 -U 参数告诉 wget 将自己标识为不同的代理(请阅读 wget 手册页)。例如 wget -U Mozilla http://....

其次,Facebook 的隐私设置很少允许您读取任何/大量信息,除非您以用户身份登录,并且可能仅以以下用户身份登录:是您要抓取的个人资料的好友。

第三,有一个 Facebook API,您需要使用它来从 facebook 抓取和提取信息 - 您如果您尝试以任何其他方式获取信息,则可能违反可接受的使用政策。

First, Facebook have probably created a condition where certain user agents (e.g. wget) cannot crawl the pages. So they redirect certain user agents yo a different page which would probably say something like "your browser is not supported" They do that to protect people from doing exactly what you are doing. However you can tell wget to identify itself as a different agent using -U argument to wget (read the wget man page). e.g. wget -U Mozilla http://....

Second, Facebooks privacy setting rarely allows you to read any/much information unless you are logged in as a user, and probably only as a user who is friend to the profile you are trying to scrape.

Thridly, there is an Facebook API which you need to use to crawl and extract information from facebook -- you are likely in violation of the Acceptable Use policy if you try to obtain information in any other way.

二智少女猫性小仙女 2024-12-03 10:01:39

我不知道你为什么要使用 wget ..facebook 提供了一个优秀的 API 。

wget --user-agent=Firefox http://www.facebook.com/markzuckerberg

将公开可用的内容保存到文件中。

你应该考虑使用他们的 API。

Facebook 开发者

I donno why you want to use wget ..facebook offers an excellent API .

wget --user-agent=Firefox http://www.facebook.com/markzuckerberg

will save the publicly available content to a file.

you should consider using their API.

Facebook Developers

一笔一画续写前缘 2024-12-03 10:01:39

如果您想保存登录页面,您可以使用 Firefox 登录并选择“保持登录状态”,然后将这些 cookie 复制到文件中并通过 cookiejar 选项使用它们。您仍然会有相当多的动态脚本加载内容,WGET 不会保存这些内容。

有很多方法可以给这只猫剥皮。如果您需要提取特定项目,请查看 API。如果您只是想存档网页浏览器中显示的页面快照,请尝试 CutyCapt。它很像 wget,只不过它像网络浏览器一样解析整个文档并存储页面的图像。

If you want to save the logged in page, you can log in with Firefox with "Keep me logged in" selected, then copy those cookies to a file and use them with the cookiejar option. You will still have quite a bit of dynamic script loaded content that WGET isn't going to save.

There's many ways to skin this cat. If you need to extract a specific item, check out the API. If you're simply wanting to archive a snapshot of the page as it would appear in a web browser, try CutyCapt. It's much like wget, except it parses the entire document as a web broswer would and stores an image of the page.

花想c 2024-12-03 10:01:39

检查以下开源项目:

  • facebook-cli,它是与 Facebook API 交互的命令行实用程序。
  • facebook-friends 可以生成所有的 HTML 页面你的 Facebook 好友。

Check the following open-source projects:

  • facebook-cli, it's a command-line utility to interact with the Facebook API.
  • facebook-friends which can generate an HTML page of all of your Facebook friends.
攒眉千度 2024-12-03 10:01:39

您可以轻松地重复使用 Firefox cookie 进行登录,请参阅:

谁可以看到您的好友列表是可配置的,因此如果有人将其配置为仅限好友,您将无法查看提取该信息。

我还建议使用移动网站,它使用分页而不是 AJAX 加载,并且 HTML 更简单、更小: https://m.facebook.com/USER/friends?startindex=24

以下是(非常严格的)抓取条款:https://www.facebook.com/apps/site_scraping_tos_terms.php

You can reuse Firefox cookies easily to login, see:

Who can see your friend list is configurable, so if someone configures it to Friends only, you cannot extract that information.

Also I recommend using the mobile site, which uses pagination instead of AJAX loading and has much simpler, smaller HTML: https://m.facebook.com/USER/friends?startindex=24

And here are the (very restrictive) scrape terms: https://www.facebook.com/apps/site_scraping_tos_terms.php

季末如歌 2024-12-03 10:01:39

要使用 wget 下载 Facebook 页面,您可以使用 Chrome Web 浏览器中的 DevTools(也适用于 Firefox、Opera 等)。

首先,您需要通过转到网络选项卡(如有必要刷新页面或勾选保留日志将其转换为curl命令em>),找到您感兴趣的页面(您可以过滤列表),右键单击请求/页面,然后选择 复制为 cURL。然后将命令粘贴到终端。

要将 curl 格式转换为 wget,请执行以下转换:

  • 删除 --compress 参数,
  • 更改 -H到所有地方的 --header

还可以考虑添加以下 wget 参数:

  • -k--convert-links,以转换文档中的链接,使其适合本地观看。
  • -p--page-requirements,下载正确显示页面所需的所有文件。

另请参阅:

To download a Facebook page using wget, you can use Chrome DevTools in your web-browser (also in Firefox, Opera and other).

First, you need to convert it to curl command by going to Network tab (refresh page if necessary or tick Preserve log), find the page of your interest (you can filter the list), right click on the request/page, then select Copy as cURL. Then paste the command to the terminal.

To convert from curl format to wget, do the following conversions:

  • remove --compress parameter,
  • change -H to --header in all places.

Consider also adding the following wget parameters:

  • -k or --convert-links, to convert the links in the document to make them suitable for local viewing.
  • -p or --page-requisites, to download all the files that are necessary to properly display a page.

See also:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文