从网页中提取*相关*图像

发布于 2024-09-07 10:03:51 字数 338 浏览 6 评论 0原文

我有几个由 Twitter 驱动的新闻聚合网站。我一直计划添加我在 Twitter 上找到的文章中的图像。

如果我下载页面并使用 标签提取图像,我会得到一堆图像;并非所有内容都与本文相关。例如,捕获按钮、图标、广告等的图像。如何提取文章附带的图像?我知道有一个解决方案——Facebook 链接共享器做得很好。

Mithun

重复:如何在中查找并提取“主”图像网站

I have a couple of twitter-powered news aggregation website. I have been planning to add images from articles that I find on twitter.

If I download the page and extract image using <img> tag, I get a bunch of images; not all of them relevant to the article. For example, images of button, icons, ads etc are captured. How do I extract the image accompanying the article? I know there is a solution -- Facebook link sharer does this pretty well.

Mithun

Duplicate of : How to find and extract "main" image in website

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

稀香 2024-09-14 10:03:51

从页面下载所有图像,
将来自广告服务器的所有图像列入黑名单。
然后找到一些启发式的方法,让你得到正确的图像...

我想是这样的:

  • 最大分辨率 += 5pts
  • 最大文件大小 += 10 pts
  • Jpeg += 2 pts

然后获取最多点的图像并扔掉其余的

可能有效对于大多数网站。

(不过需要一些启发式的摆弄)

Download all images from the page,
blacklist all images coming from an ad server.
then find some heuristic which will get you the correct image...

I think something like:

  • Biggest resolution += 5pts
  • Biggest filesize += 10 pts
  • Jpeg += 2 pts

then take the image with the most points and throw the rest away

Probably works for majority of sites.

(Would require some fiddling with the heuristics though)

零度℉ 2024-09-14 10:03:51

已经很久了。但这下次可能会有所帮助。

您可以使用这个 API https://urlmeta.org/

使用起来非常简单,结果是我们需要的最好的。

使用 API 的示例:

<?php
$url = "http://timesofindia.indiatimes.com/business/india-business/Raghuram-Rajan-not-fit-to-be-RBI-Governor-Subramanian-Swamy/articleshow/52236298.cms";

$result = file_get_contents('https://api.urlmeta.org/?url='.$url);
$array = json_decode($result,1);
print_r($array['meta']['image']);

?>

这就是您需要的结果。

It's been a long time. But this may help next time.

You can use this API https://urlmeta.org/

It's very simple to use and result is the best we need.

example for using API:

<?php
$url = "http://timesofindia.indiatimes.com/business/india-business/Raghuram-Rajan-not-fit-to-be-RBI-Governor-Subramanian-Swamy/articleshow/52236298.cms";

$result = file_get_contents('https://api.urlmeta.org/?url='.$url);
$array = json_decode($result,1);
print_r($array['meta']['image']);

?>

And that's the result you needed.

北风几吹夏 2024-09-14 10:03:51

我想出了一个有点棘手但对我有用的解决方案。这是我获取缩略图的方法。

  1. 假设我找到的页面标题是“这是一个标题”,
  2. 我使用它作为对 Google Image API 的查询,然后提取我找到的第一个缩略图。

实际上,它对于大多数情况都非常有效。自己检查一下 http://cricketfresh.in

Mithun

ps:我认为这是一个很好的答案。会给那些给出更优雅答案的人以信任。

I kind of came-up with a solution that is a bit hacky but works for me. Here is what I do to get thumbnails.

  1. Say the headline of the page I find is "this is a headline"
  2. I use this as a query to the Google Image API and then extract the first thumbnail I find.

It actually works quite well for a majority of the cases. Check it out for yourself http://cricketfresh.in

Mithun

ps: I think this is a good answer. Will give credit to someone who comes with a more elegant answer.

羁绊已千年 2024-09-14 10:03:51

我猜测 Facebook 有一个针对其支持的各种网站的链接提取器。类似于 id="content" ->图像(第一)。

我猜我错了。 Facebook 似乎使用 开放图谱协议 来定义哪个图像 (og:image) 和哪个元数据使用。

I would guess that Facebook has a link extractor for the various sites it supports. Something like id="content" -> img (1st).

Guess I am wrong. Seems that Facebook uses the Open Graph Protocol to define which image (og:image) and which metadata to use.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文