Facebook 喜欢点播元内容抓取器

发布于 2024-09-03 14:29:24 字数 275 浏览 6 评论 0原文

你们曾经见过 FB 会在您将其粘贴到链接字段后立即抓取您在 facebook 上发布的链接(状态、消息等),并显示各种元数据、图像缩略图、页面链接或页面中的各种图像。来自视频相关链接(如 YouTube)的视频缩略图。

有什么想法可以复制这个函数吗?我正在考虑几个 gearman 工作人员,甚至更好的只是 javascript,它执行 xhr 请求并根据正则表达式或类似的东西解析内容......有什么想法吗?有链接吗?是否有人已经尝试做同样的事情并将其包装在一个很好的课程中?任何事物? :)

谢谢!

you guys ever saw that FB scrapes the link you post on facebook (status, message etc.) live right after you paste it in the link field and displays various metadata, a thumb of the image, various images from the a page link or a video thumb from a video related link (like youtube).

any ideas how one would copy this function? i'm thinking about a couple gearman workers or even better just javascript that does a xhr requests and parses the content based on regex's or something similar... any ideas? any links? did someone already tried to do the same and wrapped it in a nice class? anything? :)

thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

穿越时光隧道 2024-09-10 14:29:25

Facebook 会查看您粘贴到链接字段中的页面 HTML 中的各种元信息。 titledescription 是两个显而易见的内容,但开发人员也可以使用 < /code> 提供首选屏幕截图。我想你可以检查一下这些东西。如果缺少此标签,您始终可以使用网站缩略图生成 服务。

Facebook looks at various meta information in the HTML of the page that you paste into a link field. The title and description are two obvious ones but a developer can also use <link rel="image_src" href="thumbnail.jpg" /> to provide a preferred screengrab. I guess you could check for these things. If this tag is missing you could always use a website thumbnail generation service.

当我正在开发一个这样的项目时,它并不像看起来那么容易,编码问题,用javascript渲染内容,存在如此多的非语义网站是我遇到的大问题之一。特别是提取视频信息并尝试获得自动播放行为总是很棘手,有时甚至是不可能的。你可以在 http://www.embedify.me 中看到一个演示,它是用 .net 编写的,但它有一个服务接口,这样你就可以通过 javascript 调用它,还有 javascript api 来获得与 fb 中相同的 ui/行为。

As I am developing a project like that, it is not as easy as it seems, encoding issues, rendering content with javascript, existence of so many non-semantic websites are one of big problems I encountered. Especially extracting video info and trying to get auto-play behavior is always tricky or sometimes impossible. You can see a demo in http://www.embedify.me , it is written in .net but it has a service interface so you can call it via javascript, also there is javascript api to get the same ui/behavior as in fb.

森林迷了鹿 2024-09-10 14:29:24

FB 从 HTML 中抓取元标记。

即,当您输入 URL 时,FB 显示页面标题,后跟 URL(截断的),然后是

的内容。元素。

至于缩略图的选择,我认为FB可能只选择那些超过特定尺寸的缩略图,即跳过按钮图形、1px间隔符等。

编辑:我不知道你到底在寻找什么,但这里有一个函数PHP 用于从页面中抓取相关数据。
http://simplehtmldom.sourceforge.net/
中的简单 HTML DOM 库

这使用了 看看FB是如何做的,看起来抓取是在服务器端完成的。

    class ScrapedInfo
    {
        public $url;
        public $title;
        public $description;
        public $imageUrls;
    }

    function scrapeUrl($url)
    {
        $info = new ScrapedInfo();
        $info->url = $url;
        $html = file_get_html($info->url);

        //Grab the page title
        $info->title = trim($html->find('title', 0)->plaintext);

        //Grab the page description
        foreach($html->find('meta') as $meta)
                if ($meta->name == "description")
                        $info->description = trim($meta->content);

        //Grab the image URLs
        $imgArr = array();
        foreach($html->find('img') as $element)
        {
                $rawUrl = $element->src;

                //Turn any relative Urls into absolutes
                if (substr($rawUrl,0,4)!="http")
                        $imgArr[] = $url.$rawUrl;
                else
                        $imgArr[] = $rawUrl;
        }
        $info->imageUrls = $imgArr;

        return $info;
    }

FB scrapes the meta tags from the HTML.

I.e. when you enter a URL, FB displays the page title, followed by the URL (truncated), and then the contents of the <meta name="description"> element.

As for the selection of thumbnails, I think maybe FB chooses only those that exceed certain dimensions, i.e. skipping over button graphics, 1px spacers, etc.

Edit: I don't know exactly what you're looking for, but here's a function in PHP for scraping the relevant data from pages.
This uses the simple HTML DOM library from http://simplehtmldom.sourceforge.net/

I've had a look at how FB does it, and it looks like the scraping is done at server side.

    class ScrapedInfo
    {
        public $url;
        public $title;
        public $description;
        public $imageUrls;
    }

    function scrapeUrl($url)
    {
        $info = new ScrapedInfo();
        $info->url = $url;
        $html = file_get_html($info->url);

        //Grab the page title
        $info->title = trim($html->find('title', 0)->plaintext);

        //Grab the page description
        foreach($html->find('meta') as $meta)
                if ($meta->name == "description")
                        $info->description = trim($meta->content);

        //Grab the image URLs
        $imgArr = array();
        foreach($html->find('img') as $element)
        {
                $rawUrl = $element->src;

                //Turn any relative Urls into absolutes
                if (substr($rawUrl,0,4)!="http")
                        $imgArr[] = $url.$rawUrl;
                else
                        $imgArr[] = $rawUrl;
        }
        $info->imageUrls = $imgArr;

        return $info;
    }

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文