为什么我在 Perl WWW::Mechanize 脚本中每次获取页面时都会得到一个新的会话 ID?
因此,我正在抓取一个可以通过 HTTPS 访问的网站,我可以登录并启动该过程,但每次我点击新页面 (URL) 时,cookie 会话 ID 都会发生变化。如何保留登录的 Cookie 会话 ID?
#!/usr/bin/perl -w
use strict;
use warnings;
use WWW::Mechanize;
use HTTP::Cookies;
use LWP::Debug qw(+);
use HTTP::Request;
use LWP::UserAgent;
use HTTP::Request::Common;
my $un = 'username';
my $pw = 'password';
my $url = 'https://subdomain.url.com/index.do';
my $agent = WWW::Mechanize->new(cookie_jar => {}, autocheck => 0);
$agent->{onerror}=\&WWW::Mechanize::_warn;
$agent->agent('Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.3) Gecko/20100407 Ubuntu/9.10 (karmic) Firefox/3.6.3');
$agent->get($url);
$agent->form_name('form');
$agent->field(username => $un);
$agent->field(password => $pw);
$agent->click("Log In");
print "After Login Cookie: ";
print $agent->cookie_jar->as_string();
print "\n\n";
my $searchURL='https://subdomain.url.com/search.do';
$agent->get($searchURL);
print "After Search Cookie: ";
print $agent->cookie_jar->as_string();
print "\n";
输出:
After Login Cookie: Set-Cookie3: JSESSIONID=367C6D; path="/thepath"; domain=subdomina.url.com; path_spec; secure; discard; version=0
After Search Cookie: Set-Cookie3: JSESSIONID=855402; path="/thepath"; domain=subdomain.com.com; path_spec; secure; discard; version=0
我还认为该网站需要 CERT(在浏览器中确实如此),这是添加它的正确方法吗?
$ENV{HTTPS_CERT_FILE} = 'SUBDOMAIN.URL.COM'; ## Insert this after the use HTTP::Request...
另外,对于 CERT 使用此列表中的第一个选项,这是正确的吗?
X.509 Certificate (PEM)
X.509 Certificate with chain (PEM)
X.509 Certificate (DER)
X.509 Certificate (PKCS#7)
X.509 Certificate with chain (PKCS#7)
So I'm scraping a site that I have access to via HTTPS, I can login and start the process but each time I hit a new page (URL) the cookie Session Id changes. How do I keep the logged in Cookie Session Id?
#!/usr/bin/perl -w
use strict;
use warnings;
use WWW::Mechanize;
use HTTP::Cookies;
use LWP::Debug qw(+);
use HTTP::Request;
use LWP::UserAgent;
use HTTP::Request::Common;
my $un = 'username';
my $pw = 'password';
my $url = 'https://subdomain.url.com/index.do';
my $agent = WWW::Mechanize->new(cookie_jar => {}, autocheck => 0);
$agent->{onerror}=\&WWW::Mechanize::_warn;
$agent->agent('Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.3) Gecko/20100407 Ubuntu/9.10 (karmic) Firefox/3.6.3');
$agent->get($url);
$agent->form_name('form');
$agent->field(username => $un);
$agent->field(password => $pw);
$agent->click("Log In");
print "After Login Cookie: ";
print $agent->cookie_jar->as_string();
print "\n\n";
my $searchURL='https://subdomain.url.com/search.do';
$agent->get($searchURL);
print "After Search Cookie: ";
print $agent->cookie_jar->as_string();
print "\n";
The output:
After Login Cookie: Set-Cookie3: JSESSIONID=367C6D; path="/thepath"; domain=subdomina.url.com; path_spec; secure; discard; version=0
After Search Cookie: Set-Cookie3: JSESSIONID=855402; path="/thepath"; domain=subdomain.com.com; path_spec; secure; discard; version=0
Also I think the site requires a CERT (Well in the browser it does), would this be the correct way to add it?
$ENV{HTTPS_CERT_FILE} = 'SUBDOMAIN.URL.COM'; ## Insert this after the use HTTP::Request...
Also for the CERT In using the first option in this list, is this correct?
X.509 Certificate (PEM)
X.509 Certificate with chain (PEM)
X.509 Certificate (DER)
X.509 Certificate (PKCS#7)
X.509 Certificate with chain (PKCS#7)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
设置 cookie jar,类似于:
Setup the cookie jar, something akin to this:
当您的用户代理没有执行您认为应该执行的操作时,请将其请求与交互式浏览器的请求进行比较。 Firefox 插件对于此类事情很方便。
您可能错过了服务器期望的部分过程。您可能没有正确登录或交互,这可能是出于各种原因。例如,页面上可能存在 WWW::Mechanize 不存在的 JavaScript处理。
当您可以查明交互式浏览器正在做什么而您没有做什么时,您就会知道哪里需要改进脚本。
在脚本中,您还可以通过在 LWP 中打开调试来观察正在发生的情况, Mech 的构建基础:
rjh 已经回答了您问题的证书部分。
When your user-agent isn't doing something you think it should be doing, compare it's requests with that of an interactive browser. A Firefox plugin are handy for this sort of thing.
You're probably missing part of the process that the server expects. You probably aren't logging in or interacting correctly, and that could be for all sorts of reasons. For instance, there might be JavaScript on the page that WWW::Mechanize isn't handling.
When you can pinpoint what an interactive browser is doing that you are not, you'll know where you need to improve your script.
In your script, you can also watch what is happening by turning on debugging in LWP, which Mech is built on:
rjh already answered the certificate part of your question.
如果您的会话 cookie 在每次页面加载时都会发生变化,那么您可能没有正确登录。但您可以尝试强制每个请求的 JSESSIONID 都相同。构建您自己的 cookie jar 并告诉 WWW::Mechanize 使用它:
ignore_discard => 1
意味着即使是会话 cookie 也会保存到磁盘(通常出于安全原因它们会被丢弃)。然后,登录后,调用:
然后,在每个请求之后:
或者,您可以对 HTTP::Cookies 进行子类化并重写
set_cookie
方法,以拒绝重新设置会话 cookie(如果会话 cookie 已存在)。即使不需要,某些浏览器(例如 Internet Explorer)也会提示输入安全证书。如果您没有收到任何错误并且响应内容看起来不错,则可能不需要设置错误。
如果您有证书文件,请检查 POD 中的 Crypt: :SSLeay。您的证书是 PEM0 编码的,因此您需要将 $ENV{HTTPS_CERT_FILE} 设置为证书的路径。您可能需要设置
$ENV{HTTPS_DEBUG} = 1
来查看发生了什么。If your session cookie changes every page load, then likely you are not logging in correctly. But you could try forcing the JSESSIONID to be the same for each request. Construct your own cookie jar and tell WWW::Mechanize to use it:
The
ignore_discard => 1
means that even session cookies are saved to disk (normally they are discarded for security reasons).Then, after logging in, call:
Then, after each request:
Alternately, you could sub-class HTTP::Cookies and override the
set_cookie
method to reject re-setting the session cookie if it already exists.Some browsers (Internet Explorer for example) prompt for a security certificate even if one is not needed. If you are not getting any errors and the response content looks good, you probably don't need to set one.
If you do have a certificate file, check the POD for Crypt::SSLeay. Your certificate is PEM0-encoded so yes, you want to set
$ENV{HTTPS_CERT_FILE}
to the path of your cert. You might want to set$ENV{HTTPS_DEBUG} = 1
to see what's happening.