从网页中提取*相关*图像
我有几个由 Twitter 驱动的新闻聚合网站。我一直计划添加我在 Twitter 上找到的文章中的图像。
如果我下载页面并使用 标签提取图像,我会得到一堆图像;并非所有内容都与本文相关。例如,捕获按钮、图标、广告等的图像。如何提取文章附带的图像?我知道有一个解决方案——Facebook 链接共享器做得很好。
Mithun
I have a couple of twitter-powered news aggregation website. I have been planning to add images from articles that I find on twitter.
If I download the page and extract image using <img>
tag, I get a bunch of images; not all of them relevant to the article. For example, images of button, icons, ads etc are captured. How do I extract the image accompanying the article? I know there is a solution -- Facebook link sharer does this pretty well.
Mithun
Duplicate of : How to find and extract "main" image in website
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
从页面下载所有图像,
将来自广告服务器的所有图像列入黑名单。
然后找到一些启发式的方法,让你得到正确的图像...
我想是这样的:
然后获取最多点的图像并扔掉其余的
可能有效对于大多数网站。
(不过需要一些启发式的摆弄)
Download all images from the page,
blacklist all images coming from an ad server.
then find some heuristic which will get you the correct image...
I think something like:
then take the image with the most points and throw the rest away
Probably works for majority of sites.
(Would require some fiddling with the heuristics though)
已经很久了。但这下次可能会有所帮助。
您可以使用这个 API https://urlmeta.org/
使用起来非常简单,结果是我们需要的最好的。
使用 API 的示例:
这就是您需要的结果。
It's been a long time. But this may help next time.
You can use this API https://urlmeta.org/
It's very simple to use and result is the best we need.
example for using API:
And that's the result you needed.
我想出了一个有点棘手但对我有用的解决方案。这是我获取缩略图的方法。
实际上,它对于大多数情况都非常有效。自己检查一下 http://cricketfresh.in
Mithun
ps:我认为这是一个很好的答案。会给那些给出更优雅答案的人以信任。
I kind of came-up with a solution that is a bit hacky but works for me. Here is what I do to get thumbnails.
It actually works quite well for a majority of the cases. Check it out for yourself http://cricketfresh.in
Mithun
ps: I think this is a good answer. Will give credit to someone who comes with a more elegant answer.
我猜测 Facebook 有一个针对其支持的各种网站的链接提取器。类似于 id="content" ->图像(第一)。
我猜我错了。 Facebook 似乎使用 开放图谱协议 来定义哪个图像 (og:image) 和哪个元数据使用。
I would guess that Facebook has a link extractor for the various sites it supports. Something like id="content" -> img (1st).
Guess I am wrong. Seems that Facebook uses the Open Graph Protocol to define which image (og:image) and which metadata to use.