与 Reddit 的 r/pic 子 reddit 类似,我想聚合来自各种来源的媒体。有些网站使用 OEmbed 规范在页面上公开媒体,但并非所有网站都这样做。我正在浏览 Reddit 的源代码,因为本质上他们“抓取”用户提交的链接、检索图像、视频等。他们创建缩略图,然后沿着链接显示在他们的网站上。现在,我想做类似的事情,我查看了他们的代码[1],似乎他们为他们识别的每个域都有自定义抓取器,然后他们有一个通用的 Scraper 类,它使用简单的逻辑从任何域获取图像(基本上,他们检索网页,解析 html,然后确定页面上最大的图像,然后使用该图像生成缩略图)。
由于它是开源的,我可能可以在我的应用程序中重用该代码,但不幸的是我选择了 Perl,因为这是一个业余爱好项目,并且我正在尝试学习 Perl。有没有具有类似功能的 Perl 模块?如果没有,是否有类似于Python Imaging Library 的Perl 模块?无需实际下载整个图像即可确定图像大小,这将很方便。缩略图生成。
谢谢!
[1] https://github.com/reddit/ reddit/blob/master/r2/r2/lib/scraper.py
Similar to Reddit's r/pic sub-reddit, I want to aggregate media from various sources. Some sites use OEmbed specs to expose media on the page but not all sites do it. I was browsing through Reddit's source because essentially they 'scrape' links that users submit, retrieve images, videos etc. They create thumbnails which are then displayed along the link on their site. Now, I would like to do something similar and I looked at their code[1] and it seems that they have custom scrapers for each domain that they recognize and then they have a generic Scraper class that uses simple logic to get images from any domain (basically they retrieve the web-page, parse the html and then determine the largest image on the page which they then use to generate a thumbnail).
Since it's open source I can probably reuse the code for my application but unfortunately I have chosen Perl as this is a hobby project and I'm trying to learn Perl. Is there a Perl module which has similar functionality? If not, is there a Perl module that is similar to Python Imaging Library? It would be handy to determine the image sizes without actually downloading the whole image & thumbnail generation.
Thanks!
[1] https://github.com/reddit/reddit/blob/master/r2/r2/lib/scraper.py
发布评论
评论(2)
Image::Size 是用于确定各种格式的图像尺寸的专用模块。从资源中读取前 1000 个八位字节左右应该足够了,足以读取不同的图像标题, 进入缓冲区并对其进行操作。我没有测试过这个。
我不知道有任何通用抓取模块具有用于 HTTP 范围请求的 API,以避免下载整个图像资源,但很容易子类 WWW::机械化。
Image::Size is the specialised module for determining image sizes from various format. It should be enough to read the first 1000 octets or so from a resource, enough for the diverse image headers, into a buffer and operating on that. I have not tested this.
I do not know any general scraping module that has an API for HTTP range requests in order to avoid downloading the whole image resource, but it is easy to subclass WWW::Mechanize.
尝试PerlMagick,那里还列出了安装说明。
Try PerlMagick, installation instruction is also listed there.