如何模仿 Facebook 的“链接共享”使用node.js和javascript的功能

发布于 2024-11-01 08:08:24 字数 268 浏览 3 评论 0原文

所以我想模仿的是Facebook提供的链接分享功能。您只需输入 URL，然后 FB 就会自动从目标网站获取图像、标题和简短描述。如何使用 Node.js 和其他可能需要的 JavaScript 库在 javascript 中对此进行编程？我找到了一个使用 PHP 的 fopen 函数的示例，但我不想在这个项目中包含 PHP。

我问的是网络抓取的例子吗？我需要做的就是从目标网站的元标记内部检索数据，然后使用 CSS 选择器获取图像标记吗？

如果有人能指出我正确的方向，我将不胜感激。谢谢！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

誰認得朕 2024-11-08 08:08:24

请参阅这篇文章。它讨论了使用 Node.js 进行抓取。
此处您有很多有关使用 javascript 和 jquery 进行抓取的先前信息。

也就是说，Facebook 实际上并不猜测标题、描述和预览是什么，他们（至少在大多数情况下）从希望 Facebook 用户更容易访问的网站中存在的元标记中获取该信息。

也许您可以利用现有的元数据来提取标题、描述和图像预览。有关可用元数据的文档位于此处。

回复收藏 0 原文

维持三分热 2024-11-08 08:08:24

是的，需要进行网络抓取，这是最简单的部分。困难的部分是查找标题以及相关文本和图像的通用算法。

如何抓取

您可以使用 jsdom 下载并在服务器中创建 DOM 结构，然后在服务器上使用 jquery 抓取该结构。您可以在 blog.nodejitsu.com/jsdom- 找到一个很好的教程jquery-in-5-lines-on-nodejs 正如上面 @generalhenry 所建议的。

要抓取什么

我想找到标题的一个好方法是： -

var h;
for(var i=6; i<=1; i++)
 if(h = $('h'+i).first()){
  break;
 }

现在 h 将具有标题，如果失败，则 undefined 。另一种方法是简单地获取页面的 title 标记。 :)

至于图像。列出该页面上相当大的所有或前几个图像，即，以便过滤掉用于按钮、箭头等的精灵。

在获取远程数据时，请确保 ProcessExternalResources 标志处于关闭状态。这将确保广告的脚本标记不会污染所获取的页面。

是的，相关文本将位于 h 之后的某些标签中。

Yes web-scraping is required and that's the easy part. The hard part is the generic algo to find headings and relevant texts and images.

How to scrape

You can use jsdom to download and create a DOM structure in your server and scrape that using jquery on your server. You can find a good tutorial at blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs as suggested by @generalhenry above.

What to scrape

I guess a good way to find the heading would be:-

var h;
for(var i=6; i<=1; i++)
 if(h = $('h'+i).first()){
  break;
 }

Now h will have the title or undefined if it fails. The alternative for this could be simply get the page's title tag. :)

As for the images. List all or first few images on that page which are reasonably large, i.e. so as to filter out sprites used for buttons, arrows, etc.

And while fetching the remote data make sure that ProcessExternalResources flag is off. This will ensure that script tags for ads do not pollute the fetched page.

And yes the relevant text would be in some tags after h.

回复收藏 0 原文

~没有更多了~