我可以将网页源从curl 传输到perl 吗?

发布于 2024-12-19 22:32:09 字数 581 浏览 1 评论 0原文

我正在解析许多网站的源代码,这是一个包含数千页的庞大网络。现在我想在 perĺ 中搜索内容,我想查找关键字出现的次数。

为了解析网页,我使用curl并将输出通过管道传输到“grep -c”,但这不起作用,所以我想使用perl。可以完全利用perl来抓取页面吗?

例如

cat RawJSpiderOutput.txt | grep parsed | awk -F " " '{print $2}' | xargs -I replaceStr curl replaceStr?myPara=en | perl -lne '$c++while/myKeywordToSearchFor/g;END{print$c}' 

说明:在上面的文本文件中,我有可用和不可用的 URL。通过“Grep parsed”我获取可用的 URL。使用 awk,我选择包含纯可用 URL 的第二列。到目前为止,一切都很好。现在回答这个问题:使用 Curl,我获取源代码(也附加一些参数)并将每个页面的整个源代码通过管道传输到 perl,以便计算“myKeywordToSearchFor”的出现次数。如果可能的话,我很乐意在 Perl 中执行此操作。

谢谢!

I'm parsing the sourcecode of many websites, an entire huge web with thousands of pages. Now I want to search for stuff in perĺ, I want to find the number of occurrences of a keyword.

For parsing the webpages I use curl and pipe the output to "grep -c" which doesn't work, so I want to use perl. Can be perl utilised completely to crawl a page?

E.g.

cat RawJSpiderOutput.txt | grep parsed | awk -F " " '{print $2}' | xargs -I replaceStr curl replaceStr?myPara=en | perl -lne '$c++while/myKeywordToSearchFor/g;END{print$c}' 

Explanation: In the textfile above I have usable and unusable URLs. With "Grep parsed" I fetch the usable URLs. With awk I select the 2nd column with contains the pure usable URL. So far so good. Now to this question: With Curl I fetch the source (appending some parameter, too) and pipe the whole source code of each page to perl in order to count "myKeywordToSearchFor" occurrences. I would love to do this in perl only if it is possible.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

笙痞 2024-12-26 22:32:09

这仅使用 Perl(未经测试):

use strict;
use warnings;

use File::Fetch;

my $count;
open my $SPIDER, '<', 'RawJSpiderOutput.txt' or die $!;
while (<$SPIDER>) {
    chomp;
    if (/parsed/) {
        my $url = (split)[1];
        $url .= '?myPara=en';
        my $ff = File::Fetch->new(uri => $url);
        $ff->fetch or die $ff->error;
        my $fetched = $ff->output_file;
        open my $FETCHED, '<', $fetched or die $!;
        while (<$FETCHED>) {
            $count++ if /myKeyword/;
        }
        unlink $fetched;
    }
}
print "$count\n";

This uses Perl only (untested):

use strict;
use warnings;

use File::Fetch;

my $count;
open my $SPIDER, '<', 'RawJSpiderOutput.txt' or die $!;
while (<$SPIDER>) {
    chomp;
    if (/parsed/) {
        my $url = (split)[1];
        $url .= '?myPara=en';
        my $ff = File::Fetch->new(uri => $url);
        $ff->fetch or die $ff->error;
        my $fetched = $ff->output_file;
        open my $FETCHED, '<', $fetched or die $!;
        while (<$FETCHED>) {
            $count++ if /myKeyword/;
        }
        unlink $fetched;
    }
}
print "$count\n";
遗失的美好 2024-12-26 22:32:09

尝试一些更像,

   perl -e 'while(<>){my @words = split ' ';for my $word(@words){if(/myKeyword/){++$c}}} print "$c\n"'

   while (<>)               # as long as we're getting input (into “$_”)
   { my @words = split ' '; # split $_ (implicit) into whitespace, so we examine each word
     for my $word (@words)  #  (and don't miss two keywords on one line)
     { if (/myKeyword/)     # whenever it's found,
       { ++$c } } }         # increment the counter (auto-vivified)
   print "$c\n"             # and after end of file is reached, print the counter

或,拼写出strict之类的内容

   use strict;
   my $count = 0;
   while (my $line = <STDIN>) # except that <> is actually more magical than this
   { my @words = split ' ' => $line;
     for my $word (@words)
     { ++$count; } } }
   print "$count\n";

Try something more like,

   perl -e 'while(<>){my @words = split ' ';for my $word(@words){if(/myKeyword/){++$c}}} print "$c\n"'

i.e.

   while (<>)               # as long as we're getting input (into “$_”)
   { my @words = split ' '; # split $_ (implicit) into whitespace, so we examine each word
     for my $word (@words)  #  (and don't miss two keywords on one line)
     { if (/myKeyword/)     # whenever it's found,
       { ++$c } } }         # increment the counter (auto-vivified)
   print "$c\n"             # and after end of file is reached, print the counter

or, spelled out strict-like

   use strict;
   my $count = 0;
   while (my $line = <STDIN>) # except that <> is actually more magical than this
   { my @words = split ' ' => $line;
     for my $word (@words)
     { ++$count; } } }
   print "$count\n";
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文