尝试用perl获取网页的源代码

发布于 2024-12-18 01:38:06 字数 352 浏览 1 评论 0原文

我正在尝试使用 Perl“get”函数获取网页的 html 源。我 5 个月前就写好了代码，运行得很好，但昨天我做了一个小编辑，但之后无论我如何尝试，它都无法运行。这是我尝试过的代码。

#!usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $link = 'www.google.com';
my $sou = get($link) or die "cannot retrieve code\n";
print $sou;

该代码工作正常，但它无法检索源，而是显示

cannot retrieve code

原文

I'm trying to get a html source of a webpage using the Perl "get" function. I have written the code 5 months back and it was working fine, but yesterday I made a small edit, but it failed to work after that, no matter how much I tried.
Here is the code I tried.

#!usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $link = 'www.google.com';
my $sou = get($link) or die "cannot retrieve code\n";
print $sou;

The code works fine , but its not able to retrieve the source, instead it displays

cannot retrieve code

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

命硬 2024-12-25 01:38:06

my $link = 'http://www.google.com';

my $link = 'http://www.google.com';

回复收藏 0 原文

百思不得你姐 2024-12-25 01:38:06

这可能有点晚了，

我一直在努力解决同样的问题，我想我已经弄清楚为什么会发生这种情况。我通常使用 python 来抓取网站，我发现最好在 get 请求中包含一些额外的标头信息。这会欺骗网站，让其认为机器人是一个人，并让机器人访问该网站，但不会调用 400 错误请求状态代码。

因此，我将这种想法应用到我的 Perl 脚本中，该脚本与您的脚本类似，只是添加了一些额外的标头信息。结果毫不费力地给了我该网站的源代码。

这是代码：

#!/usr/bin/perl

# This line specifies the LWP version and if not put in place the code will fail.
use LWP 5.64;
 
# This line defines the virtual browser.
$browser = LWP::UserAgent->new;


# This line defines the header infomation that will be given to the website (eg. google) incase the website invokes a 400 bad request status code.
@ns_headers = (
 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
 'Accept-Charset' => 'iso-8859-1,*,utf-8',
 'Accept-Language' => 'en-US',
 );

# This line defines the url that the user agent will browse. 
$url = 'https://www.google.com/';
 
# This line is used to request data from the specified url above.
$response = $browser->get($url, @ns_headers)  or die "cannot retrieve code\n";;



# Decodes responce so the HTML source code is visable.
$HTML = $response->decoded_content;

print($HTML);

我有 LWP::Useragent，因为它可以让您添加额外的标头信息。

我希望这对

我有帮助。

附言。抱歉，如果您已经有了答案，我只是想提供帮助。

This might be a bit late,

I have been struggling with the same problem and I think I have figured why this occurs. I usually web-scrape websites with python and I have figured out that It is ideal to include some extra header info to the get requests.This fools the website into thinking the bot is a person and gives the bot access to the website and does not invoke a 400 bad request status code.

So I applied this thinking to my Perl script, which was similar to yours, and just added some extra header info. The result gave me the source code for the website with no strugle.

Here is the code:

#!/usr/bin/perl

# This line specifies the LWP version and if not put in place the code will fail.
use LWP 5.64;
 
# This line defines the virtual browser.
$browser = LWP::UserAgent->new;


# This line defines the header infomation that will be given to the website (eg. google) incase the website invokes a 400 bad request status code.
@ns_headers = (
 'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
 'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
 'Accept-Charset' => 'iso-8859-1,*,utf-8',
 'Accept-Language' => 'en-US',
 );

# This line defines the url that the user agent will browse. 
$url = 'https://www.google.com/';
 
# This line is used to request data from the specified url above.
$response = $browser->get($url, @ns_headers)  or die "cannot retrieve code\n";;



# Decodes responce so the HTML source code is visable.
$HTML = $response->decoded_content;

print($HTML);

I have LWP::Useragent as this has the ability for you to add extra header infomation.

I hope this helped,

ME.

PS. Sorry if you already have the answer for this, just wanted to help.

回复收藏 0 原文

~没有更多了~