Perl WWW::Mechanize cookie 问题
我正在尝试自动从首先要求验证码的网站收集链接。 为此,我捕获验证码图像,以便可以在外部解决它,然后将解决方案作为表单字段的一部分提交。 不知怎的,它不起作用。我怀疑 cookie 有问题,但我不确定,如果有人能解决这个问题,我将不胜感激。
这是代码。首先,我创建了 mech 对象及其 cookie jar:
$cookie_jar = HTTP::Cookies->new;
$agent = WWW::Mechanize->new(cookie_jar => $cookie_jar);
$agent->get("http://www.site.com/page.html");
我找到了感兴趣的链接:
$link = $agent->find_link(tag => "a", text_regex => qr{regex});
$url = $link->url;
$agent->get($url);
在此阶段,该网站会显示一个验证码。我提取图像并保存它,以便人工可以解决它,然后输入解决方案以继续:
$captcha = $agent->find_image(url_regex => qr{captcha\.php});
$agent->get($captcha->url, ':content_file' => 'captcha.jpg');
print "Please solve captcha at http://my.own.site/captcha.jpg\n";
$agent->back;
print "Enter answer: ";
$solved = <>;
现在脚本已手动输入验证码解决方案,它可以通过提交表单来继续:
$agent->form_with_fields('code');
$agent->set_fields(code => $solved, action => 'download');
$agent->submit;
但是这不起作用。结果是页面再次询问验证码,而不是包含我想要的信息的预期页面。
我想知道当我保存验证码图像后执行 $agent->back 时,cookie 是否会丢失/重置?
感谢您的任何提示!
I am trying to automate the collection of links from a site that asks for a captcha first.
For this, I capture the captcha image so it can be solved externally, and then submit the solution as part of the form fields.
Somehow it doesn't work. I suspect a cookie problem but I'm not sure and would appreciate if anyone could figure this out.
Here is the code. First I create the mech object along with its cookie jar:
$cookie_jar = HTTP::Cookies->new;
$agent = WWW::Mechanize->new(cookie_jar => $cookie_jar);
$agent->get("http://www.site.com/page.html");
I find the link of interest:
$link = $agent->find_link(tag => "a", text_regex => qr{regex});
$url = $link->url;
$agent->get($url);
At this stage the site presents a captcha. I extract the image and save it so it can be solved by a human, which then enter the solution to continue:
$captcha = $agent->find_image(url_regex => qr{captcha\.php});
$agent->get($captcha->url, ':content_file' => 'captcha.jpg');
print "Please solve captcha at http://my.own.site/captcha.jpg\n";
$agent->back;
print "Enter answer: ";
$solved = <>;
Now that the script has the captcha solution entered manually, it can continue by submitting the form:
$agent->form_with_fields('code');
$agent->set_fields(code => $solved, action => 'download');
$agent->submit;
However this doesn't work. The result is the page asking the captcha again, rather than the expected page with the info I'm after.
I am wondering if the cookie gets lost/reset when I do the $agent->back after saving the captcha image?
Thanks for any hints!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我找到了一种更简单的方法来处理这个问题。如下:
就像一个魅力一样。
I found a much easier way to handle this problem. Here it is:
Works like a charm.
您正在访问的网站很可能有某种手段来检测并阻碍免费冲浪;例如,这意味着返回一页或多页,然后再次前进。这通常是通过为每个页面关联一个唯一的 ID 来完成的,这样当您提交该 ID 两次时,很明显您已经浏览回来,然后又从那里继续前进。正如你所说,这与使用
back
有关。我想知道您是否真的需要
返回
。关键是在代理外部下载图像,以便代理状态不会被修改。您可以使用第二个代理或curl
,因为您有图像的直接 URL...It is highly possible that the site you are accessing has got some means to detect and hinders free surfing; that means, for example, going back one or more page and then forward again. This is usually done by associating to each page a unique id, so that when you submit the id twice, it is clear that you surfed back and then moved on again from there. As you say, this is related to using
back
.What I wonder is if you really need going
back
. The key is doing the downloading of the image outside of the agent, so that the agent state does not get modified. You could use a second agent for that orcurl
, since you have the direct URL to the image...