在 Eventlet 页面抓取中维护会话?

发布于 2024-08-22 07:35:15 字数 813 浏览 10 评论 0原文

我正在尝试对需要身份验证(而不是 http 身份验证)的网站进行一些抓取。我使用的脚本基于此 eventlet 示例。基本上,

urls = ["https://mysecuresite.com/data.aspx?itemid=blah1",
     "https://mysecuresite.com/data.aspx?itemid=blah2",
     "https://mysecuresite.com/data.aspx?itemid=blah3"]

import eventlet
from eventlet.green import urllib2  

def fetch(url):
  print "opening", url
  body = urllib2.urlopen(url).read()
  print "done with", url
  return url, body

pool = eventlet.GreenPool(10)
for url, body in pool.imap(fetch, urls):
  print "got body from", url, "of length", len(body)

建立会话一点也不简单;我必须加载登录页面,从登录表单中提取一些变量,然后发送包含身份验证详细信息和这些变量的 POST 请求。会话良好后,其余请求都是简单的 GET 请求。

使用上面的代码作为参考点,我将如何创建一个池的其余部分将使用的会话? (我需要并行提出后续请求)

I'm trying to do some scraping of a site that requires authentication (not http auth). The script I'm using is based on this eventlet example. Basically,

urls = ["https://mysecuresite.com/data.aspx?itemid=blah1",
     "https://mysecuresite.com/data.aspx?itemid=blah2",
     "https://mysecuresite.com/data.aspx?itemid=blah3"]

import eventlet
from eventlet.green import urllib2  

def fetch(url):
  print "opening", url
  body = urllib2.urlopen(url).read()
  print "done with", url
  return url, body

pool = eventlet.GreenPool(10)
for url, body in pool.imap(fetch, urls):
  print "got body from", url, "of length", len(body)

Establishing the session is not simple at all; I have to load the login page, extract some variables from the login form, then send a POST request with the auth details and those variables. After the session is good, the rest of the requests are simple GET requests.

Using the above code as a reference point, how would I create a session that the rest of the pool would use? (I need the subsequent requests to be made in parallel)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

奢华的一滴泪 2024-08-29 07:35:15

无论如何,我不是这方面的专家,但看起来使用 urllib2 维护会话状态的标准方法是为每个会话创建一个自定义开启器实例。看起来像这样:

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())

然后,您使用该开启器执行您必须执行的任何身份验证,并且所有会话状态都将保留在开启器对象本身内。然后,您可以将 opener 对象作为并行请求的参数传递。

下面是一个示例脚本,它为多个用户并行登录 secondarylife.com,并为每个用户并行发出多个页面请求。这个特定站点的登录过程很棘手,因为它涉及从第一个请求捕获 CSRF 令牌,然后才能使用第二个请求登录。因此,登录方法相当混乱。不过,对于您感兴趣的任何网站,原则应该是相同的。

import eventlet
from eventlet.green import urllib2
import re

login_url = 'https://secure-web28.secondlife.com/my/account/login.php?lang=en&type=second-life-member&nextpage=/my/index.php?lang=en'

pool = eventlet.GreenPool(10)

def fetch_title(opener, url):
    match = re.search(r'<title>(.*)</title>', opener.open(url).read())
    if match:
        return match.group(1)
    else:
        return "no title"

def login(login_url, fullname, password):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
    login_page = opener.open(login_url).read()
    csrf_token = re.search(r'<input type="hidden" name="CSRFToken" value="(.*)"/>', login_page).group(1)
    username, lastname = fullname.split()
    auth = "CSRFToken=%s&form[type]=second-life-member&form[nextpage]=/my/index.php?lang=en&form[persistent]=Y&form[form_action]=Log%%20In&form[form_lang]=en&form[username]=%s&form[lastname]=%s&form[password]=%s&submit=Submit" % (
        csrf_token, username, lastname, password)
    logged_in = opener.open(login_url, auth).read()
    return opener


def login_and_fetch(login_url, fullname, password, page_urls):
    opener = login(login_url, fullname, password)
    # note that this deliberately uses the global pool
    pile = eventlet.GreenPile(pool)
    for url in page_urls:
        pile.spawn(fetch_title, opener, url)

    return pile

login_urls = [login_url] *2
usernames = [...]
passwords = [...]
page_urls = [['https://secure-web28.secondlife.com/my/account/?lang=en-US',
        'https://secure-web28.secondlife.com/my/community/events/index.php?lang=en-US']] * 2

for user_iter in pool.imap(login_and_fetch, login_urls, usernames, passwords, page_urls):
    for title in user_iter:
        print "got title", title

I'm not an expert on this by any means, but it looks like the standard way to maintain session state with urllib2 is to create a custom opener instance for each session. That looks like this:

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())

Then you use that opener to do whatever authentication you have to, and all the session state will remain within the opener object itself. Then you can pass the opener object as an argument for the parallel requests.

Here is an example script that logs in to secondlife.com for multiple users in parallel, and makes multiple page requests for each user, also in parallel. The login procedure for this particular site is tricky because it involves capturing a CSRF token from the first request before being able to log in with the second. For that reason, the login method is quite messy. The principle should be the same, though, for whatever site you're interested in.

import eventlet
from eventlet.green import urllib2
import re

login_url = 'https://secure-web28.secondlife.com/my/account/login.php?lang=en&type=second-life-member&nextpage=/my/index.php?lang=en'

pool = eventlet.GreenPool(10)

def fetch_title(opener, url):
    match = re.search(r'<title>(.*)</title>', opener.open(url).read())
    if match:
        return match.group(1)
    else:
        return "no title"

def login(login_url, fullname, password):
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
    login_page = opener.open(login_url).read()
    csrf_token = re.search(r'<input type="hidden" name="CSRFToken" value="(.*)"/>', login_page).group(1)
    username, lastname = fullname.split()
    auth = "CSRFToken=%s&form[type]=second-life-member&form[nextpage]=/my/index.php?lang=en&form[persistent]=Y&form[form_action]=Log%%20In&form[form_lang]=en&form[username]=%s&form[lastname]=%s&form[password]=%s&submit=Submit" % (
        csrf_token, username, lastname, password)
    logged_in = opener.open(login_url, auth).read()
    return opener


def login_and_fetch(login_url, fullname, password, page_urls):
    opener = login(login_url, fullname, password)
    # note that this deliberately uses the global pool
    pile = eventlet.GreenPile(pool)
    for url in page_urls:
        pile.spawn(fetch_title, opener, url)

    return pile

login_urls = [login_url] *2
usernames = [...]
passwords = [...]
page_urls = [['https://secure-web28.secondlife.com/my/account/?lang=en-US',
        'https://secure-web28.secondlife.com/my/community/events/index.php?lang=en-US']] * 2

for user_iter in pool.imap(login_and_fetch, login_urls, usernames, passwords, page_urls):
    for title in user_iter:
        print "got title", title
掐死时间 2024-08-29 07:35:15

按照下面的建议,使用 mechanize。它会处理低级细节,例如为您管理 cookie。

但是,要使第 3 方库与 eventlet 一起工作,您需要将 stdlib 中的套接字和 ssl 对象替换为底层的异步对象。

这在 eventlet 中是可行的,但在这里不是很简单。
我建议使用 gevent,您所要做的就是

从 gevent 导入猴子;猴子.patch_all()

然后第 3 方库应该可以工作。

这是一个示例

Like suggested below, use mechanize. It'll take care of the low-level details, like cookie management for you.

However, to make a 3rd party library work with eventlet you need to replace socket and ssl objects from stdlib with something that is asynchronous under the hood.

This is doable in eventlet, but it's not very straightforward here.
I recommend using gevent, where all you have to do is

from gevent import monkey; monkey.patch_all()

and then 3rd party libraries should just work.

Here's an example.

这个俗人 2024-08-29 07:35:15

您可以使用mechanize来使会话建立更容易,然后使用不同的线程/多处理技术之一,例如线程池配方(Google 上首次点击,可能有点矫枉过正,请务必阅读评论)。

You can use the mechanize library to make the session establishing easier, then use one of the different threading/multiprocessing techniques like this threading pool recipe (first hit on Google, probably a bit overkill, make sure you read the comments).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文