Facebook 喜欢点播元内容抓取器

发布于 2024-09-03 14:29:24 字数 275 浏览 11 评论 0原文

你们曾经见过 FB 会在您将其粘贴到链接字段后立即抓取您在 facebook 上发布的链接（状态、消息等），并显示各种元数据、图像缩略图、页面链接或页面中的各种图像。来自视频相关链接（如 YouTube）的视频缩略图。

有什么想法可以复制这个函数吗？我正在考虑几个 gearman 工作人员，甚至更好的只是 javascript，它执行 xhr 请求并根据正则表达式或类似的东西解析内容......有什么想法吗？有链接吗？是否有人已经尝试做同样的事情并将其包装在一个很好的课程中？任何事物？：）

谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

穿越时光隧道 2024-09-10 14:29:25

Facebook 会查看您粘贴到链接字段中的页面 HTML 中的各种元信息。 title 和 description 是两个显而易见的内容，但开发人员也可以使用 < /code> 提供首选屏幕截图。我想你可以检查一下这些东西。如果缺少此标签，您始终可以使用网站缩略图生成服务。

回复收藏 0 原文

习惯那些不曾习惯的习惯 2024-09-10 14:29:25

当我正在开发一个这样的项目时，它并不像看起来那么容易，编码问题，用javascript渲染内容，存在如此多的非语义网站是我遇到的大问题之一。特别是提取视频信息并尝试获得自动播放行为总是很棘手，有时甚至是不可能的。你可以在 http://www.embedify.me 中看到一个演示，它是用 .net 编写的，但它有一个服务接口，这样你就可以通过 javascript 调用它，还有 javascript api 来获得与 fb 中相同的 ui/行为。

回复收藏 0 原文

森林迷了鹿 2024-09-10 14:29:24

FB 从 HTML 中抓取元标记。

即，当您输入 URL 时，FB 显示页面标题，后跟 URL（截断的），然后是

的内容。元素。

至于缩略图的选择，我认为FB可能只选择那些超过特定尺寸的缩略图，即跳过按钮图形、1px间隔符等。

编辑：我不知道你到底在寻找什么，但这里有一个函数PHP 用于从页面中抓取相关数据。
http://simplehtmldom.sourceforge.net/
中的简单 HTML DOM 库

这使用了看看FB是如何做的，看起来抓取是在服务器端完成的。

    class ScrapedInfo
    {
        public $url;
        public $title;
        public $description;
        public $imageUrls;
    }

    function scrapeUrl($url)
    {
        $info = new ScrapedInfo();
        $info->url = $url;
        $html = file_get_html($info->url);

        //Grab the page title
        $info->title = trim($html->find('title', 0)->plaintext);

        //Grab the page description
        foreach($html->find('meta') as $meta)
                if ($meta->name == "description")
                        $info->description = trim($meta->content);

        //Grab the image URLs
        $imgArr = array();
        foreach($html->find('img') as $element)
        {
                $rawUrl = $element->src;

                //Turn any relative Urls into absolutes
                if (substr($rawUrl,0,4)!="http")
                        $imgArr[] = $url.$rawUrl;
                else
                        $imgArr[] = $rawUrl;
        }
        $info->imageUrls = $imgArr;

        return $info;
    }

FB scrapes the meta tags from the HTML.

I.e. when you enter a URL, FB displays the page title, followed by the URL (truncated), and then the contents of the <meta name="description"> element.

As for the selection of thumbnails, I think maybe FB chooses only those that exceed certain dimensions, i.e. skipping over button graphics, 1px spacers, etc.

Edit: I don't know exactly what you're looking for, but here's a function in PHP for scraping the relevant data from pages.
This uses the simple HTML DOM library from http://simplehtmldom.sourceforge.net/

I've had a look at how FB does it, and it looks like the scraping is done at server side.

    class ScrapedInfo
    {
        public $url;
        public $title;
        public $description;
        public $imageUrls;
    }

    function scrapeUrl($url)
    {
        $info = new ScrapedInfo();
        $info->url = $url;
        $html = file_get_html($info->url);

        //Grab the page title
        $info->title = trim($html->find('title', 0)->plaintext);

        //Grab the page description
        foreach($html->find('meta') as $meta)
                if ($meta->name == "description")
                        $info->description = trim($meta->content);

        //Grab the image URLs
        $imgArr = array();
        foreach($html->find('img') as $element)
        {
                $rawUrl = $element->src;

                //Turn any relative Urls into absolutes
                if (substr($rawUrl,0,4)!="http")
                        $imgArr[] = $url.$rawUrl;
                else
                        $imgArr[] = $rawUrl;
        }
        $info->imageUrls = $imgArr;

        return $info;
    }

回复收藏 0 原文

~没有更多了~