尝试用perl获取网页的源代码
我正在尝试使用 Perl“get”函数获取网页的 html 源。我 5 个月前就写好了代码,运行得很好,但昨天我做了一个小编辑,但之后无论我如何尝试,它都无法运行。 这是我尝试过的代码。
#!usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $link = 'www.google.com';
my $sou = get($link) or die "cannot retrieve code\n";
print $sou;
该代码工作正常,但它无法检索源,而是显示
cannot retrieve code
I'm trying to get a html source of a webpage using the Perl "get" function. I have written the code 5 months back and it was working fine, but yesterday I made a small edit, but it failed to work after that, no matter how much I tried.
Here is the code I tried.
#!usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $link = 'www.google.com';
my $sou = get($link) or die "cannot retrieve code\n";
print $sou;
The code works fine , but its not able to retrieve the source, instead it displays
cannot retrieve code
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这可能有点晚了,
我一直在努力解决同样的问题,我想我已经弄清楚为什么会发生这种情况。我通常使用 python 来抓取网站,我发现最好在 get 请求中包含一些额外的标头信息。这会欺骗网站,让其认为机器人是一个人,并让机器人访问该网站,但不会调用 400 错误请求状态代码。
因此,我将这种想法应用到我的 Perl 脚本中,该脚本与您的脚本类似,只是添加了一些额外的标头信息。结果毫不费力地给了我该网站的源代码。
这是代码:
我有 LWP::Useragent,因为它可以让您添加额外的标头信息。
我希望这对
我有帮助。
附言。抱歉,如果您已经有了答案,我只是想提供帮助。
This might be a bit late,
I have been struggling with the same problem and I think I have figured why this occurs. I usually web-scrape websites with python and I have figured out that It is ideal to include some extra header info to the get requests.This fools the website into thinking the bot is a person and gives the bot access to the website and does not invoke a 400 bad request status code.
So I applied this thinking to my Perl script, which was similar to yours, and just added some extra header info. The result gave me the source code for the website with no strugle.
Here is the code:
I have LWP::Useragent as this has the ability for you to add extra header infomation.
I hope this helped,
ME.
PS. Sorry if you already have the answer for this, just wanted to help.