从网页上传图像

发布于 2024-10-31 14:59:46 字数 1019 浏览 4 评论 0原文

我想实现类似于此的功能 http://www.tineye.com/parse? url=yahoo.com - 允许用户从任何网页上传图像。

对我来说主要的问题是,包含大量图像的网页需要花费太多时间。

我根据下一个方案在 Django 中执行此操作(使用curl 或 urllib):

  1. 抓取页面的 html(对于大页面大约需要 1 秒):

    file = urllib.urlopen(requested_url)
    html_string = 文件.read()
    
  2. 使用 HTML 解析器(BeautifulSoup)解析它,查找 img 标签,并将所有图像的 src 写入列表。 (对于大页面也需要大约 1 秒)

  3. 检查列表中所有图像的大小,如果它们足够大,则在 json 响应中返回它们(当网页上大约有 80 个图像时,需要很长的时间大约 15 秒) )。下面是该函数的代码:


 def get_image_size(uri):
    file = urllib.urlopen(uri)
    p = ImageFile.Parser()
    data = file.read(1024)
    if not data:
        return None
    p.feed(data)
    if p.image:
        return p.image.size
    file.close()
    #not an image
    return None

如您所见,我没有加载完整图像来获取其大小,仅加载 1kb。但当有大量图像时,它仍然需要太多时间(我为找到的每个图像调用此函数一次)。

那么我怎样才能让它工作得更快呢?

可能有什么方法可以不对每张图片提出请求吗?

任何帮助将不胜感激。

谢谢!

I want to implement a feature similar to this http://www.tineye.com/parse?url=yahoo.com - allow user upload images from any web page.

Main problem for me is that it takes too much time for web pages with big number of images.

I'm doing this in Django (using curl or urllib) according to the next scheme:

  1. Grab html of the page (takes about 1 sec for big pages):

    file = urllib.urlopen(requested_url)
    html_string = file.read()
    
  2. Parse it with HTML-parser (BeautifulSoup), looking for img tags, and writing all src of images to a list. (takes about 1 sec too for big pages)

  3. Check sizes of all images in my list and if they are big enough return them in json response (takes very long about 15 sec, when there are about 80 images on a web page). Here's the code of the function:


 def get_image_size(uri):
    file = urllib.urlopen(uri)
    p = ImageFile.Parser()
    data = file.read(1024)
    if not data:
        return None
    p.feed(data)
    if p.image:
        return p.image.size
    file.close()
    #not an image
    return None

As you can see, I'm not loading full image to get it's size, only 1kb of it. But it still takes too much time when there are lot of images (i'm calling this function one time for each image found).

So how can I make it work faster?

May be is there any way for not making a request for every single image?

Any help will be highly appreciated.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

り繁华旳梦境 2024-11-07 14:59:46

我可以想到一些优化:

  1. 当您从流中读取文件时进行解析
  2. 使用SAX解析器(这对于上面的观点来说非常有用)
  3. 使用HEAD来获取图像的大小
  4. 使用队列来放置图像,然后使用几个线程来连接并获取 HEAD 请求的文件大小

示例:

$ telnet m.onet.pl 80
Trying 213.180.150.45...
Connected to m.onet.pl.
Escape character is '^]'.
HEAD /_m/33fb7563935e11c0cba62f504d91675f,59,29,134-68-525-303-0.jpg HTTP/1.1
host: m.onet.pl

HTTP/1.0 200 OK
Server: nginx/0.8.53
Date: Sat, 09 Apr 2011 18:32:44 GMT
Content-Type: image/jpeg
Content-Length: 37545
Last-Modified: Sat, 09 Apr 2011 18:29:22 GMT
Expires: Sat, 16 Apr 2011 18:32:44 GMT
Cache-Control: max-age=604800
Accept-Ranges: bytes
Age: 6575
X-Cache: HIT from emka1.m10r2.onet
Via: 1.1 emka1.m10r2.onet:80 (squid)
Connection: close

Connection closed by foreign host.

i can think of few optimisations:

  1. parse as you are reading a file from the stream
  2. use SAX parser (which will be great with point above)
  3. use HEAD to get size of the images
  4. use queue to put your images, then use few threads to connect and get file sizes

example of HEAD request:

$ telnet m.onet.pl 80
Trying 213.180.150.45...
Connected to m.onet.pl.
Escape character is '^]'.
HEAD /_m/33fb7563935e11c0cba62f504d91675f,59,29,134-68-525-303-0.jpg HTTP/1.1
host: m.onet.pl

HTTP/1.0 200 OK
Server: nginx/0.8.53
Date: Sat, 09 Apr 2011 18:32:44 GMT
Content-Type: image/jpeg
Content-Length: 37545
Last-Modified: Sat, 09 Apr 2011 18:29:22 GMT
Expires: Sat, 16 Apr 2011 18:32:44 GMT
Cache-Control: max-age=604800
Accept-Ranges: bytes
Age: 6575
X-Cache: HIT from emka1.m10r2.onet
Via: 1.1 emka1.m10r2.onet:80 (squid)
Connection: close

Connection closed by foreign host.
貪欢 2024-11-07 14:59:46

您可以使用 urllib2.urlopen 返回的文件之类的 headers 属性(我不知道 urllib)。

这是我为其编写的测试。正如您所看到的,它相当快,尽管我认为某些网站会阻止太多重复的请求。

|milo|laurie|¥ cat test.py
import urllib2
uri = "http://download.thinkbroadband.com/1GB.zip"

def get_file_size(uri):
    file = urllib2.urlopen(uri)
    content_header, = [header for header in file.headers.headers if header.startswith("Content-Length")]
    _, str_length = content_header.split(':')
    length = int(str_length.strip())
    return length

if __name__ == "__main__":
    get_file_size(uri)
|milo|laurie|¥ time python2 test.py
python2 test.py  0.06s user 0.01s system 35% cpu 0.196 total

You can use the headers attribute of the file like object returned by urllib2.urlopen (I don't know about urllib).

Here's a test I wrote for it. As you can see, it is rather fast, though I imagine some websites would block too many repeated requests.

|milo|laurie|¥ cat test.py
import urllib2
uri = "http://download.thinkbroadband.com/1GB.zip"

def get_file_size(uri):
    file = urllib2.urlopen(uri)
    content_header, = [header for header in file.headers.headers if header.startswith("Content-Length")]
    _, str_length = content_header.split(':')
    length = int(str_length.strip())
    return length

if __name__ == "__main__":
    get_file_size(uri)
|milo|laurie|¥ time python2 test.py
python2 test.py  0.06s user 0.01s system 35% cpu 0.196 total
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文