从网页上传图像
我想实现类似于此的功能 http://www.tineye.com/parse? url=yahoo.com - 允许用户从任何网页上传图像。
对我来说主要的问题是,包含大量图像的网页需要花费太多时间。
我根据下一个方案在 Django 中执行此操作(使用curl 或 urllib):
抓取页面的 html(对于大页面大约需要 1 秒):
file = urllib.urlopen(requested_url) html_string = 文件.read()
使用 HTML 解析器(BeautifulSoup)解析它,查找 img 标签,并将所有图像的 src 写入列表。 (对于大页面也需要大约 1 秒)
检查列表中所有图像的大小,如果它们足够大,则在 json 响应中返回它们(当网页上大约有 80 个图像时,需要很长的时间大约 15 秒) )。下面是该函数的代码:
def get_image_size(uri):
file = urllib.urlopen(uri)
p = ImageFile.Parser()
data = file.read(1024)
if not data:
return None
p.feed(data)
if p.image:
return p.image.size
file.close()
#not an image
return None
如您所见,我没有加载完整图像来获取其大小,仅加载 1kb。但当有大量图像时,它仍然需要太多时间(我为找到的每个图像调用此函数一次)。
那么我怎样才能让它工作得更快呢?
可能有什么方法可以不对每张图片提出请求吗?
任何帮助将不胜感激。
谢谢!
I want to implement a feature similar to this http://www.tineye.com/parse?url=yahoo.com - allow user upload images from any web page.
Main problem for me is that it takes too much time for web pages with big number of images.
I'm doing this in Django (using curl or urllib) according to the next scheme:
Grab html of the page (takes about 1 sec for big pages):
file = urllib.urlopen(requested_url) html_string = file.read()
Parse it with HTML-parser (BeautifulSoup), looking for img tags, and writing all src of images to a list. (takes about 1 sec too for big pages)
Check sizes of all images in my list and if they are big enough return them in json response (takes very long about 15 sec, when there are about 80 images on a web page). Here's the code of the function:
def get_image_size(uri):
file = urllib.urlopen(uri)
p = ImageFile.Parser()
data = file.read(1024)
if not data:
return None
p.feed(data)
if p.image:
return p.image.size
file.close()
#not an image
return None
As you can see, I'm not loading full image to get it's size, only 1kb of it. But it still takes too much time when there are lot of images (i'm calling this function one time for each image found).
So how can I make it work faster?
May be is there any way for not making a request for every single image?
Any help will be highly appreciated.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我可以想到一些优化:
示例:
i can think of few optimisations:
example of HEAD request:
您可以使用 urllib2.urlopen 返回的文件之类的 headers 属性(我不知道 urllib)。
这是我为其编写的测试。正如您所看到的,它相当快,尽管我认为某些网站会阻止太多重复的请求。
You can use the headers attribute of the file like object returned by urllib2.urlopen (I don't know about urllib).
Here's a test I wrote for it. As you can see, it is rather fast, though I imagine some websites would block too many repeated requests.