为什么我在 Perl WWW::Mechanize 脚本中每次获取页面时都会得到一个新的会话 ID?

发布于 2024-08-30 09:59:07 字数 1753 浏览 2 评论 0原文

因此,我正在抓取一个可以通过 HTTPS 访问的网站,我可以登录并启动该过程,但每次我点击新页面 (URL) 时,cookie 会话 ID 都会发生变化。如何保留登录的 Cookie 会话 ID?

#!/usr/bin/perl -w
use strict;
use warnings;
use WWW::Mechanize;
use HTTP::Cookies;
use LWP::Debug qw(+);
use HTTP::Request;
use LWP::UserAgent;
use HTTP::Request::Common;

my $un = 'username';
my $pw = 'password';

my $url = 'https://subdomain.url.com/index.do';

my $agent = WWW::Mechanize->new(cookie_jar => {}, autocheck => 0);
$agent->{onerror}=\&WWW::Mechanize::_warn;
$agent->agent('Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.3) Gecko/20100407 Ubuntu/9.10 (karmic) Firefox/3.6.3');
$agent->get($url);

$agent->form_name('form');
$agent->field(username => $un);
$agent->field(password => $pw);
$agent->click("Log In");

print "After Login Cookie: ";
print $agent->cookie_jar->as_string();
print "\n\n";

my $searchURL='https://subdomain.url.com/search.do';
$agent->get($searchURL);    

print "After Search Cookie: ";
print $agent->cookie_jar->as_string();
print "\n";

输出:

After Login Cookie: Set-Cookie3: JSESSIONID=367C6D; path="/thepath"; domain=subdomina.url.com; path_spec; secure; discard; version=0

After Search Cookie: Set-Cookie3: JSESSIONID=855402; path="/thepath"; domain=subdomain.com.com; path_spec; secure; discard; version=0

我还认为该网站需要 CERT(在浏览器中确实如此),这是添加它的正确方法吗?

$ENV{HTTPS_CERT_FILE} = 'SUBDOMAIN.URL.COM'; ## Insert this after the use HTTP::Request...

另外,对于 CERT 使用此列表中的第一个选项,这是正确的吗?

X.509 Certificate (PEM)
X.509 Certificate with chain (PEM)
X.509 Certificate (DER)
X.509 Certificate (PKCS#7)
X.509 Certificate with chain (PKCS#7)

So I'm scraping a site that I have access to via HTTPS, I can login and start the process but each time I hit a new page (URL) the cookie Session Id changes. How do I keep the logged in Cookie Session Id?

#!/usr/bin/perl -w
use strict;
use warnings;
use WWW::Mechanize;
use HTTP::Cookies;
use LWP::Debug qw(+);
use HTTP::Request;
use LWP::UserAgent;
use HTTP::Request::Common;

my $un = 'username';
my $pw = 'password';

my $url = 'https://subdomain.url.com/index.do';

my $agent = WWW::Mechanize->new(cookie_jar => {}, autocheck => 0);
$agent->{onerror}=\&WWW::Mechanize::_warn;
$agent->agent('Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.3) Gecko/20100407 Ubuntu/9.10 (karmic) Firefox/3.6.3');
$agent->get($url);

$agent->form_name('form');
$agent->field(username => $un);
$agent->field(password => $pw);
$agent->click("Log In");

print "After Login Cookie: ";
print $agent->cookie_jar->as_string();
print "\n\n";

my $searchURL='https://subdomain.url.com/search.do';
$agent->get($searchURL);    

print "After Search Cookie: ";
print $agent->cookie_jar->as_string();
print "\n";

The output:

After Login Cookie: Set-Cookie3: JSESSIONID=367C6D; path="/thepath"; domain=subdomina.url.com; path_spec; secure; discard; version=0

After Search Cookie: Set-Cookie3: JSESSIONID=855402; path="/thepath"; domain=subdomain.com.com; path_spec; secure; discard; version=0

Also I think the site requires a CERT (Well in the browser it does), would this be the correct way to add it?

$ENV{HTTPS_CERT_FILE} = 'SUBDOMAIN.URL.COM'; ## Insert this after the use HTTP::Request...

Also for the CERT In using the first option in this list, is this correct?

X.509 Certificate (PEM)
X.509 Certificate with chain (PEM)
X.509 Certificate (DER)
X.509 Certificate (PKCS#7)
X.509 Certificate with chain (PKCS#7)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

安人多梦 2024-09-06 09:59:17

设置 cookie jar,类似于:

my $cookie = HTTP::Cookies->new(file => 'cookie',autosave => 1,);
my $mech = WWW::Mechanize->new(cookie_jar => $cookie, ....);

Setup the cookie jar, something akin to this:

my $cookie = HTTP::Cookies->new(file => 'cookie',autosave => 1,);
my $mech = WWW::Mechanize->new(cookie_jar => $cookie, ....);
蓝海 2024-09-06 09:59:16

当您的用户代理没有执行您认为应该执行的操作时,请将其请求与交互式浏览器的请求进行比较。 Firefox 插件对于此类事情很方便。

您可能错过了服务器期望的部分过程。您可能没有正确登录或交互,这可能是出于各种原因。例如,页面上可能存在 WWW::Mechanize 不存在的 JavaScript处理。

当您可以查明交互式浏览器正在做什么而您没有做什么时,您就会知道哪里需要改进脚本。

在脚本中,您还可以通过在 LWP 中打开调试来观察正在发生的情况, Mech 的构建基础:

 use LWP::Debug qw(+); 

rjh 已经回答了您问题的证书部分。

When your user-agent isn't doing something you think it should be doing, compare it's requests with that of an interactive browser. A Firefox plugin are handy for this sort of thing.

You're probably missing part of the process that the server expects. You probably aren't logging in or interacting correctly, and that could be for all sorts of reasons. For instance, there might be JavaScript on the page that WWW::Mechanize isn't handling.

When you can pinpoint what an interactive browser is doing that you are not, you'll know where you need to improve your script.

In your script, you can also watch what is happening by turning on debugging in LWP, which Mech is built on:

 use LWP::Debug qw(+); 

rjh already answered the certificate part of your question.

空名 2024-09-06 09:59:16

如果您的会话 cookie 在每次页面加载时都会发生变化,那么您可能没有正确登录。但您可以尝试强制每个请求的 JSESSIONID 都相同。构建您自己的 cookie jar 并告诉 WWW::Mechanize 使用它:

my $cookie_jar = HTTP::Cookies->new(file => 'cookies', autosave => 1, ignore_discard => 1);
my $agent = WWW::Mechanize->new(cookie_jar => $cookie_jar, autocheck => 0);

ignore_discard => 1 意味着即使是会话 cookie 也会保存到磁盘(通常出于安全原因它们会被丢弃)。

然后,登录后,调用:

$cookie_jar->save;

然后,在每个请求之后:

$cookie_jar->revert;  # re-loads the save

或者,您可以对 HTTP::Cookies 进行子类化并重写 set_cookie 方法,以拒绝重新设置会话 cookie(如果会话 cookie 已存在)。


此外,我认为该网站需要 CERT(在浏览器中确实如此),这是添加它的正确方法吗?

即使不需要,某些浏览器(例如 Internet Explorer)也会提示输入安全证书。如果您没有收到任何错误并且响应内容看起来不错,则可能不需要设置错误。

如果您有证书文件,请检查 POD 中的 Crypt: :SSLeay。您的证书是 PEM0 编码的,因此您需要将 $ENV{HTTPS_CERT_FILE} 设置为证书的路径。您可能需要设置 $ENV{HTTPS_DEBUG} = 1 来查看发生了什么。

If your session cookie changes every page load, then likely you are not logging in correctly. But you could try forcing the JSESSIONID to be the same for each request. Construct your own cookie jar and tell WWW::Mechanize to use it:

my $cookie_jar = HTTP::Cookies->new(file => 'cookies', autosave => 1, ignore_discard => 1);
my $agent = WWW::Mechanize->new(cookie_jar => $cookie_jar, autocheck => 0);

The ignore_discard => 1 means that even session cookies are saved to disk (normally they are discarded for security reasons).

Then, after logging in, call:

$cookie_jar->save;

Then, after each request:

$cookie_jar->revert;  # re-loads the save

Alternately, you could sub-class HTTP::Cookies and override the set_cookie method to reject re-setting the session cookie if it already exists.


Also I think the site requires a CERT (Well in the browser it does), would this be the correct way to add it?

Some browsers (Internet Explorer for example) prompt for a security certificate even if one is not needed. If you are not getting any errors and the response content looks good, you probably don't need to set one.

If you do have a certificate file, check the POD for Crypt::SSLeay. Your certificate is PEM0-encoded so yes, you want to set $ENV{HTTPS_CERT_FILE} to the path of your cert. You might want to set $ENV{HTTPS_DEBUG} = 1 to see what's happening.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文