提取“最佳”内容的技术来自网页的图像
我正在尝试为我的网站构建类似于 Facebook 的“共享”功能的功能。
我已经达到了可以接受 URL、抓取元关键字并适当获取标题/描述的程度,但对于确定用户可能想要共享的“可能”照片的最佳方法,我有点困惑。
我目前使用 SimpleXMLElement 将页面转换为可遍历的 DOM,并找到所有标签,将它们转换为绝对 URL。之后,我不知道如何才能找到合适的缩略图。
我是否要全部下载并按文件大小进行下载?我是否使用某种启发式方法,例如“在页面中间遇到”?
还有其他人有任何建议、建议或技巧吗?
I'm trying to build something akin to Facebook's "Share" functionality for my website.
I've gotten to the point where I can accept a URL, scrape it for meta keywords and suitably get titles/descriptions, but I'm a bit stuck as to the best way to determine 'likely' photos the user may want to share.
I currently use the SimpleXMLElement to turn the page into a traversable DOM, and find all the tags, turning them into absolute URLs. After that, I'm not sure how I can go about finding a suitable thumbnail.
Do I download them all, and go by file size? Do I use some sort of heuristic like, "was encountered in the middle of the page"?
Does anyone else have any recommendations, suggestions, or tips?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不久前写了类似的东西,从抓取的博客文章中获取图像。我选择图像的标准是获取页面上所有图像的列表,然后分配“优先点”:
然后选择优先级最高的图像。它当然不是万无一失的,也不是过于科学,但它往往能得到一些有用的东西。
I wrote something similar a while ago to get images from scraped blog posts. My criteria for choosing an image was something along the lines of getting a list of all images on the page then assigning 'priority points':
Then pick the one with the most priority points. It certainly wasn't foolproof or overly scientific but it got something useful far more often than not.
我没有任何直接这样做的经验,所以我不确定是否有任何具体的最佳实践,但总的来说,我认为考虑几个因素的启发式方法是有意义的,因为网站实现中发现了可变性。
我会查看两组项目:图像属性和图像放置位置/方式的上下文。
图像属性:
图像上下文:
我会给前面的项目分配权重,然后根据每个图像满足规则的程度对你找到的图像进行排名。
另请注意,某些页面可能使用 CSS(或 Flash 等)来显示图像。这些超出了您的图像权限(根据您定义的算法);也许没什么大不了的,但值得考虑。
I don't have any direct experience doing this so I'm not sure that there is any specific best practice, but in general I think a heuristic approach looking at several factors would make sense because of the variability found in website implementations.
I would look at two sets of items: image properties and the context of the where/how the images are placed.
Image Properties:
Image Context:
I would assigns weights to the previous items and then rank the images you find according to how well each image satisfies the rules.
Also, note that some pages may use CSS (or Flash, etc) to display images. These our outside of your purview of images (according to the algorithm you defined); perhaps not a big deal, but something to consider.