数据抓取问题

发布于 2024-09-16 08:56:28 字数 1536 浏览 2 评论 0原文

我正在从 Facebook 页面上抓取墙上帖子的数据,以下是网址:

http://www.facebook.com/GMHTheBook?v=wall&ref=ts#!/GMHTheBook?v=wall&ref=ts

我成功抓取了所有内容可见的墙柱使用 CURL。

问题

在可见的墙帖末尾,有一个较早的帖子链接,单击该链接后会显示更多墙帖。现在,我如何手动点击该链接来显示更多墙上帖子并废弃这些帖子?

有什么解决方案使用任何方法吗?我正在使用 CURL,但我希望有任何解决方案来处理这种情况?

更新:

现在我使用此代码来获取所有数据,找到下一个链接并获取该网址的数据等等,这是代码:

ini_set('display_errors', true);
error_reporting(E_ALL);

$data = json_decode(file_get_contents(($url)), true);

$names = array();
$stories = array();

foreach($data['data'] as $post)
{
    $names[] = $post['from']['name'];
    $stories[] = $post['message'];
}

$url = $data['paging']['next'];

// this is meant to scrap data recurssively from the next links
while($url !== '')
{
    $url = $data['paging']['next'];
    $data = json_decode(file_get_contents(($url)), true);

    foreach($data['data'] as $post)
    {
        $names[] = $post['from']['name'];
        $stories[] = $post['message'];
    }

    $url = urldecode($data['paging']['next']);
    echo $url . '<br />';
}


for($j = 0; $j < count($names); $j++)
{
  $data .= $names[$j] . '|' . $stories[$j] . "\n";
}

$h = fopen("data.txt", "a+");
fwrite($h, $data);
fclose($h);

但问题是脚本继续运行,根本没有输出,也没有创建文件。我也将脚本时间设置设置为更高的值。 allow_url_fopen 也设置为打开。脚本中有什么问题或者我可能没有以正确的方式进行递归?有什么解决方案/替代方案吗?

I am scraping data from facebook page for the wall posts, here is the url:

http://www.facebook.com/GMHTheBook?v=wall&ref=ts#!/GMHTheBook?v=wall&ref=ts

I sucessfully scraped all the visible wall posts using CURL.

Problem:

At the end of visible wall posts, there is Older Posts link which shows more wall posts once you click on that link. Now how do I sort of manually click that link to show more wall posts and scrap those posts as well?

Any solution using any method for that? I am using CURL though but I hope there is just about any solution to deal with such situation?

Update:

Now I am using this code to get all the data, find the next link and fetch the data for that url and so on, here is the code:

ini_set('display_errors', true);
error_reporting(E_ALL);

$data = json_decode(file_get_contents(($url)), true);

$names = array();
$stories = array();

foreach($data['data'] as $post)
{
    $names[] = $post['from']['name'];
    $stories[] = $post['message'];
}

$url = $data['paging']['next'];

// this is meant to scrap data recurssively from the next links
while($url !== '')
{
    $url = $data['paging']['next'];
    $data = json_decode(file_get_contents(($url)), true);

    foreach($data['data'] as $post)
    {
        $names[] = $post['from']['name'];
        $stories[] = $post['message'];
    }

    $url = urldecode($data['paging']['next']);
    echo $url . '<br />';
}


for($j = 0; $j < count($names); $j++)
{
  $data .= $names[$j] . '|' . $stories[$j] . "\n";
}

$h = fopen("data.txt", "a+");
fwrite($h, $data);
fclose($h);

But the problem is that script keeps on running with no output at all, also no file is created. I have set the script time settings to higher value too. allow_url_fopen is also set to on. Is there anything wrong in the script or probably I am not doing the recurssion in the right way? Any solution/alternative to this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

冷默言语 2024-09-23 08:56:28

您应该使用图形 API。您正在抓取的数据以 JSON 格式提供,位于

并包含用于获取上一页/下一页的链接,例如分页。

示例:

$data = json_decode(file_get_contents(($url)));
foreach($data->data as $post) {
    echo $post->from->name, ': ',
         $post->message,
         PHP_EOL;
}

上面将输出墙上的所有帖子。对于分页,

echo $data->paging->previous;
echo $data->paging->next;

这样做将输出两个 URL。您所要做的就是再次加载它们。

You should use the Graph API. The data you are scraping is available in JSON format at

and contains links for getting previous/next pages, e.g. paging.

Example:

$data = json_decode(file_get_contents(($url)));
foreach($data->data as $post) {
    echo $post->from->name, ': ',
         $post->message,
         PHP_EOL;
}

The above will output all the posts on the wall. For paging do

echo $data->paging->previous;
echo $data->paging->next;

This will output two URLs. All you have to do is load them again.

皇甫轩 2024-09-23 08:56:28

按钮/链接可能会启动一个 XMLHttpRequest,因此请使用 firebug/开发人员控制台/无论您使用什么,在浏览器中查看它正在请求什么 url 以及使用什么 HTTP 标头等。然后只需使用 cURL 执行相同的请求,您就可以了知道了?

The button/link probably starts a XMLHttpRequest, so look in your browser with firebug/developer console/whatever you use, to see what url it is requesting and with what HTTP headers etc. Then just do the same request with cURL and you've got it?

烟若柳尘 2024-09-23 08:56:28
http://www.facebook.com/ajax/stream/profile.php?__a=1&profile_id=139878432710216&viewer_id=(your facebook id)&filter=1&max_time=1283023194&_log_clicktype=Filter%20Stories%20or%20Pagination&ajax_log=1

它是通过ajax加载的。您还需要弄清楚这些变量。最长时间可能是从什么时候开始显示帖子。

好的,上面的链接可以更短(相同的输出)...

http://www.facebook.com/ajax/stream/profile.php?__a=1&profile_id=139878432710216&max_time=1283023194
http://www.facebook.com/ajax/stream/profile.php?__a=1&profile_id=139878432710216&viewer_id=(your facebook id)&filter=1&max_time=1283023194&_log_clicktype=Filter%20Stories%20or%20Pagination&ajax_log=1

It is loaded via ajax. You also need to figure out these variables. Max time is probably from what point on to show posts.

Ok, upper link can be shorter (same output)...

http://www.facebook.com/ajax/stream/profile.php?__a=1&profile_id=139878432710216&max_time=1283023194
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文