提取“最佳”内容的技术来自网页的图像

发布于 2024-08-24 20:55:05 字数 287 浏览 7 评论 0原文

我正在尝试为我的网站构建类似于 Facebook 的“共享”功能的功能。

我已经达到了可以接受 URL、抓取元关键字并适当获取标题/描述的程度,但对于确定用户可能想要共享的“可能”照片的最佳方法,我有点困惑。

我目前使用 SimpleXMLElement 将页面转换为可遍历的 DOM,并找到所有标签,将它们转换为绝对 URL。之后,我不知道如何才能找到合适的缩略图。

我是否要全部下载并按文件大小进行下载?我是否使用某种启发式方法,例如“在页面中间遇到”?

还有其他人有任何建议、建议或技巧吗?

I'm trying to build something akin to Facebook's "Share" functionality for my website.

I've gotten to the point where I can accept a URL, scrape it for meta keywords and suitably get titles/descriptions, but I'm a bit stuck as to the best way to determine 'likely' photos the user may want to share.

I currently use the SimpleXMLElement to turn the page into a traversable DOM, and find all the tags, turning them into absolute URLs. After that, I'm not sure how I can go about finding a suitable thumbnail.

Do I download them all, and go by file size? Do I use some sort of heuristic like, "was encountered in the middle of the page"?

Does anyone else have any recommendations, suggestions, or tips?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

双马尾 2024-08-31 20:55:05

我不久前写了类似的东西,从抓取的博客文章中获取图像。我选择图像的标准是获取页面上所有图像的列表,然后分配“优先点”:

  • 忽略从 AdBlocker 列表中获取的黑名单中托管的图像
  • 忽略间接图像,例如从样式表或链接到的图像IFRAME
  • 忽略宽度或高度低于 50 像素的图像
  • 忽略重复多次的
  • 图像 将优先点分配给从主机白名单托管的图像(例如 photobucket、imageshack.us)
  • 将优先点分配给页面上最大的 3 个图像
  • 将优先点分配给同一主机上的图像
  • 将优先级点分配给已定义 ALT 标签的图像
  • 将优先级点分配给出现在 P 标签中的图像

然后选择优先级最高的图像。它当然不是万无一失的,也不是过于科学,但它往往能得到一些有用的东西。

I wrote something similar a while ago to get images from scraped blog posts. My criteria for choosing an image was something along the lines of getting a list of all images on the page then assigning 'priority points':

  • Ignore images hosted from a blacklist taken from AdBlocker's list
  • Ignore indirect images, eg linked to from stylesheets or in an IFRAME
  • Ignore images under 50 pixels wide or high
  • Ignore images which repeat more than once
  • Assign priority points to images hosted from a whitelist of hosts (eg photobucket, imageshack.us)
  • Assign priority points to the largest 3 images on the page
  • Assign priority points to images on the same host
  • Assign priority points to images with an ALT tag defined
  • Assign priority points to images appearing in a P tag

Then pick the one with the most priority points. It certainly wasn't foolproof or overly scientific but it got something useful far more often than not.

羁绊已千年 2024-08-31 20:55:05

我没有任何直接这样做的经验,所以我不确定是否有任何具体的最佳实践,但总的来说,我认为考虑几个因素的启发式方法是有意义的,因为网站实现中发现了可变性。

我会查看两组项目:图像属性和图像放置位置/方式的上下文。

图像属性:

  • 宽度和高度满足最小阈值
  • 宽高比合理(平铺的背景图像可能具有极端的宽高比,这很好地表明图像可能不合适)
  • 图像中存在多种颜色(较难检测,但可能会避免各种背景图像)

图像上下文:

  • 图像在页面上不重复(这避免使用可能重复的图标和其他设计元素)
  • 出现在页面上的 h1、h2 等标签之后;这让你明白了关于来自页面中间的图像的观点,再次避免了设计元素。
  • 有一个 alt 标签(尽管这并没有被一致使用,所以可能不会提供太多有用的信息)

我会给前面的项目分配权重,然后根据每个图像满足规则的程度对你找到的图像进行排名。

另请注意,某些页面可能使用 CSS(或 Flash 等)来显示图像。这些超出了您的图像权限(根据您定义的算法);也许没什么大不了的,但值得考虑。

I don't have any direct experience doing this so I'm not sure that there is any specific best practice, but in general I think a heuristic approach looking at several factors would make sense because of the variability found in website implementations.

I would look at two sets of items: image properties and the context of the where/how the images are placed.

Image Properties:

  • Width and height meet minimum thresholds
  • Aspect ratio is reasonable (background images that tile may have extreme aspect ratios, which provides a good indication that the image may not be suitable)
  • More than one color exists in image (harder to detect, but may avoid various background images)

Image Context:

  • Image does not repeat on page (this avoids using icons and other design elements that may repeat)
  • Occurs after h1, h2, etc tags on page; this gets to your point about the images coming from the middle of the page, again avoiding design elements.
  • Has an alt tag (though this is not consistently used, so perhaps does not provide much useful information)

I would assigns weights to the previous items and then rank the images you find according to how well each image satisfies the rules.

Also, note that some pages may use CSS (or Flash, etc) to display images. These our outside of your purview of images (according to the algorithm you defined); perhaps not a big deal, but something to consider.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文