使用 pycurl 获取很多页面？

发布于 2024-08-15 18:06:46 字数 260 浏览 13 评论 0原文

我想从网站获取许多页面，例如

curl "http://farmsubsidy.org/DE/browse?page=[0000-3603]" -o "de.#1"

在 python 中获取页面数据，而不是磁盘文件。有人可以发布 pycurl 代码来执行此操作吗，
或者如果可能的话，快速 urllib2 （不是一次一个），
或者说“算了，curl 更快更健壮”？谢谢

原文

I want to get many pages from a website, like

curl "http://farmsubsidy.org/DE/browse?page=[0000-3603]" -o "de.#1"

but get the pages' data in python, not disk files.
Can someone please post pycurl code to do this,
or fast urllib2 (not one-at-a-time) if that's possible,
or else say "forget it, curl is faster and more robust" ? Thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

终难遇 2024-08-22 18:06:46

所以你有两个问题，让我用一个例子向你展示。请注意，pycurl 已经完成了多线程/而不是一次一个，无需您的努力。

#! /usr/bin/env python

import sys, select, time
import pycurl,StringIO

c1 = pycurl.Curl()
c2 = pycurl.Curl()
c3 = pycurl.Curl()
c1.setopt(c1.URL, "http://www.python.org")
c2.setopt(c2.URL, "http://curl.haxx.se")
c3.setopt(c3.URL, "http://slashdot.org")
s1 = StringIO.StringIO()
s2 = StringIO.StringIO()
s3 = StringIO.StringIO()
c1.setopt(c1.WRITEFUNCTION, s1.write)
c2.setopt(c2.WRITEFUNCTION, s2.write)
c3.setopt(c3.WRITEFUNCTION, s3.write)

m = pycurl.CurlMulti()
m.add_handle(c1)
m.add_handle(c2)
m.add_handle(c3)

# Number of seconds to wait for a timeout to happen
SELECT_TIMEOUT = 1.0

# Stir the state machine into action
while 1:
    ret, num_handles = m.perform()
    if ret != pycurl.E_CALL_MULTI_PERFORM:
        break

# Keep going until all the connections have terminated
while num_handles:
    # The select method uses fdset internally to determine which file descriptors
    # to check.
    m.select(SELECT_TIMEOUT)
    while 1:
        ret, num_handles = m.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM:
            break

# Cleanup
m.remove_handle(c3)
m.remove_handle(c2)
m.remove_handle(c1)
m.close()
c1.close()
c2.close()
c3.close()
print "http://www.python.org is ",s1.getvalue()
print "http://curl.haxx.se is ",s2.getvalue()
print "http://slashdot.org is ",s3.getvalue()

最后，这些代码主要基于 pycurl 站点上的示例 =.=

可能您真的应该阅读文档。人们花了大量的时间在上面。

So you have 2 problem and let me show you in one example. Notice the pycurl already did the multithreading/not one-at-a-time w/o your hardwork.

#! /usr/bin/env python

import sys, select, time
import pycurl,StringIO

c1 = pycurl.Curl()
c2 = pycurl.Curl()
c3 = pycurl.Curl()
c1.setopt(c1.URL, "http://www.python.org")
c2.setopt(c2.URL, "http://curl.haxx.se")
c3.setopt(c3.URL, "http://slashdot.org")
s1 = StringIO.StringIO()
s2 = StringIO.StringIO()
s3 = StringIO.StringIO()
c1.setopt(c1.WRITEFUNCTION, s1.write)
c2.setopt(c2.WRITEFUNCTION, s2.write)
c3.setopt(c3.WRITEFUNCTION, s3.write)

m = pycurl.CurlMulti()
m.add_handle(c1)
m.add_handle(c2)
m.add_handle(c3)

# Number of seconds to wait for a timeout to happen
SELECT_TIMEOUT = 1.0

# Stir the state machine into action
while 1:
    ret, num_handles = m.perform()
    if ret != pycurl.E_CALL_MULTI_PERFORM:
        break

# Keep going until all the connections have terminated
while num_handles:
    # The select method uses fdset internally to determine which file descriptors
    # to check.
    m.select(SELECT_TIMEOUT)
    while 1:
        ret, num_handles = m.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM:
            break

# Cleanup
m.remove_handle(c3)
m.remove_handle(c2)
m.remove_handle(c1)
m.close()
c1.close()
c2.close()
c3.close()
print "http://www.python.org is ",s1.getvalue()
print "http://curl.haxx.se is ",s2.getvalue()
print "http://slashdot.org is ",s3.getvalue()

Finally, these code is mainly based on an example on the pycurl site =.=

may be you should really read doc. ppl spend huge time on it.

回复收藏 0 原文

世界等同你 2024-08-22 18:06:46

这是一个基于 urllib2 和线程的解决方案。

import urllib2
from threading import Thread

BASE_URL = 'http://farmsubsidy.org/DE/browse?page='
NUM_RANGE = range(0000, 3603)
THREADS = 2

def main():
    for nums in split_seq(NUM_RANGE, THREADS):
        t = Spider(BASE_URL, nums)
        t.start()

def split_seq(seq, num_pieces):
    start = 0
    for i in xrange(num_pieces):
        stop = start + len(seq[i::num_pieces])
        yield seq[start:stop]
        start = stop

class Spider(Thread):
    def __init__(self, base_url, nums):
        Thread.__init__(self)
        self.base_url = base_url
        self.nums = nums
    def run(self):
        for num in self.nums:
            url = '%s%s' % (self.base_url, num)
            data = urllib2.urlopen(url).read()
            print data

if __name__ == '__main__':
    main()

here is a solution based on urllib2 and threads.

import urllib2
from threading import Thread

BASE_URL = 'http://farmsubsidy.org/DE/browse?page='
NUM_RANGE = range(0000, 3603)
THREADS = 2

def main():
    for nums in split_seq(NUM_RANGE, THREADS):
        t = Spider(BASE_URL, nums)
        t.start()

def split_seq(seq, num_pieces):
    start = 0
    for i in xrange(num_pieces):
        stop = start + len(seq[i::num_pieces])
        yield seq[start:stop]
        start = stop

class Spider(Thread):
    def __init__(self, base_url, nums):
        Thread.__init__(self)
        self.base_url = base_url
        self.nums = nums
    def run(self):
        for num in self.nums:
            url = '%s%s' % (self.base_url, num)
            data = urllib2.urlopen(url).read()
            print data

if __name__ == '__main__':
    main()

回复收藏 0 原文

烟花肆意 2024-08-22 18:06:46

您可以将其放入 for 循环内的 bash 脚本中。

然而，使用 python 解析每个页面可能会取得更好的成功。
http://www.securitytube.net/爬网以获得乐趣和利润-video.aspx
您将能够获取准确的数据并将其同时保存到数据库中。
http:// www.securitytube.net/Storing-Mined-Data-from-the-Web-for-Fun-and-Profit-video.aspx

回复收藏 0 原文

最后的乘客 2024-08-22 18:06:46

如果你想使用 python 抓取网站，你应该看看 scrapy http://scrapy.org

回复收藏 0 原文

晨曦慕雪 2024-08-22 18:06:46

使用 BeautifulSoup4 和 requests -

抓取头页：

page = Soup(requests.get(url='http://rootpage.htm').text)

创建请求数组：

from requests import async

requests = [async.get(url.get('href')) for url in page('a')]
responses = async.map(requests)

[dosomething(response.text) for response in responses]

顺便说一句，请求需要 gevent 来执行此操作。

Using BeautifulSoup4 and requests -

Grab head page:

page = Soup(requests.get(url='http://rootpage.htm').text)

Create an array of requests:

from requests import async

requests = [async.get(url.get('href')) for url in page('a')]
responses = async.map(requests)

[dosomething(response.text) for response in responses]

Requests requires gevent to do this btw.

回复收藏 0 原文

神魇的王 2024-08-22 18:06:46

我可以推荐您使用 human_curl 的异步模块

看示例：

from urlparse import urljoin 
from datetime import datetime

from human_curl.async import AsyncClient 
from human_curl.utils import stdout_debug

def success_callback(response, **kwargs):
    """This function call when response successed
    """
    print("success callback")
    print(response, response.request)
    print(response.headers)
    print(response.content)
    print(kwargs)

def fail_callback(request, opener, **kwargs):
    """Collect errors
    """
    print("fail callback")
    print(request, opener)
    print(kwargs)

with AsyncClient(success_callback=success_callback,
                 fail_callback=fail_callback) as async_client:
    for x in xrange(10000):
        async_client.get('http://google.com/', params=(("x", str(x)),)
        async_client.get('http://google.com/', params=(("x", str(x)),),
                        success_callback=success_callback, fail_callback=fail_callback)

用法非常简单。然后页面成功加载失败的 async_client 调用您的回调。您还可以指定并行连接的数量。

I can recommend you to user async module of human_curl

Look example:

from urlparse import urljoin 
from datetime import datetime

from human_curl.async import AsyncClient 
from human_curl.utils import stdout_debug

def success_callback(response, **kwargs):
    """This function call when response successed
    """
    print("success callback")
    print(response, response.request)
    print(response.headers)
    print(response.content)
    print(kwargs)

def fail_callback(request, opener, **kwargs):
    """Collect errors
    """
    print("fail callback")
    print(request, opener)
    print(kwargs)

with AsyncClient(success_callback=success_callback,
                 fail_callback=fail_callback) as async_client:
    for x in xrange(10000):
        async_client.get('http://google.com/', params=(("x", str(x)),)
        async_client.get('http://google.com/', params=(("x", str(x)),),
                        success_callback=success_callback, fail_callback=fail_callback)

Usage very simple. Then page success loaded of failed async_client call you callback. Also you can specify number on parallel connections.

回复收藏 0 原文

~没有更多了~