从 HTML 中提取文本 (Perl)

发布于 2024-09-08 12:36:47 字数 2634 浏览 10 评论 0原文

我正在编写一个脚本,该脚本输入一个页面并从中提取和提取信息。 我用 Perl 编写的脚本。

问题:不是如何开始运行脚本,因为当我启动时,它会选择这样的 url,而这不是我想要的

DizzyDollarsGPT

我想得到这个:

Xray-cash

全部代码在这里:

#!/usr/bin/perl
#=======================================================================
#
# FILE: ValePTR.pl
#
# USAGE: ./ValePTR.pl user password
#
# DESCRIPTION:
#
# OPTIONS: ---
# REQUIREMENTS: libgetopt-declare-perl
# BUGS: ---
# NOTAS: ---
# AUTOR: Alejandro
# VERSION: 1.0
# CREATED: Lunes 5 de julio del 2010
# REVISION: 1
#=======================================================================

use warnings;
use strict;
use HTML::TreeBuilder;
use WWW::Mechanize;
use Getopt::Long;
my($content, $search_result, @search_results);

    #Constructor del explorador con un UserAgent falso.
    my $Explorador = WWW::Mechanize->new( agent => 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624' );
    $Explorador->get("file://home/alejandro/enter.php.html"); #Se procede a acceder a la dirección url para hacer el HTTP Post
    #$Explorador->field('username','miuser'); # Busca el campo username y pone el usuario
    #$Explorador->field('password','mipass'); # Busca el campo password y pone la contraseña
    #$Explorador->submit(); # Hace el HTTP POST

        #print $Explorador->content();
#parse $content with treebuilder
my $page = HTML::TreeBuilder->new();
$page->parse($Explorador->content());
$page->eof();


@search_results= $page->look_down(
sub{ $_[0]-> tag() eq 'a' and ($_[0]->attr('href'))}
);

foreach $search_result (@search_results){
my($url, $title, $summary);

$title = $page->look_down(
sub{ $_[0]-> tag() eq 'a' and ($_[0]->attr('href'))}
);
if($title)
{
print 'title: '.$title->as_HTML,"\n";
}
}


$page->delete;

全部 HTML 代码在这里: http://gist.github.com/465568

PD:请帮助我,我在这里呆了大约 3 个小时,但没有成功。

最终发生的事情是,拿走现有的一切

< code>http://valeptr.com/scripts/runner.php?BA=

我想要的是:

http://valeptr.com/scripts/runner.php?PA=< /代码>

I'm doing a script that enters a page and extract and extract information from it.
The script I'm doing it in Perl.

Problem: Not how to start running the script because when I start it picks up the url like this and this is not what I want

<a href="http://valeptr.com/scripts/runner.php?BA=6672&hash=08c5c66839a468a11b7574e6ce02e0&url=http%3A%2F%2Fdizzydollarsgpt.com%2Fmembers%2Fregister.php%3Fref%3Dthomasd24" target="_blank"><img alt="DizzyDollarsGPT" border="0" src="enter.php_files/runner.jpeg" /></a>

And I want get this:

<a href="http://valeptr.com/scripts/runner.php?PA=33425"
target="_ptc" onclick="javascript:reloadpage(11)">
<img src="1appsearch.php_files/runner_007.gif"
alt="Xray-cash" border="0">

The all of code is here:

#!/usr/bin/perl
#=======================================================================
#
# FILE: ValePTR.pl
#
# USAGE: ./ValePTR.pl user password
#
# DESCRIPTION:
#
# OPTIONS: ---
# REQUIREMENTS: libgetopt-declare-perl
# BUGS: ---
# NOTAS: ---
# AUTOR: Alejandro
# VERSION: 1.0
# CREATED: Lunes 5 de julio del 2010
# REVISION: 1
#=======================================================================

use warnings;
use strict;
use HTML::TreeBuilder;
use WWW::Mechanize;
use Getopt::Long;
my($content, $search_result, @search_results);

    #Constructor del explorador con un UserAgent falso.
    my $Explorador = WWW::Mechanize->new( agent => 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030624' );
    $Explorador->get("file://home/alejandro/enter.php.html"); #Se procede a acceder a la dirección url para hacer el HTTP Post
    #$Explorador->field('username','miuser'); # Busca el campo username y pone el usuario
    #$Explorador->field('password','mipass'); # Busca el campo password y pone la contraseña
    #$Explorador->submit(); # Hace el HTTP POST

        #print $Explorador->content();
#parse $content with treebuilder
my $page = HTML::TreeBuilder->new();
$page->parse($Explorador->content());
$page->eof();


@search_results= $page->look_down(
sub{ $_[0]-> tag() eq 'a' and ($_[0]->attr('href'))}
);

foreach $search_result (@search_results){
my($url, $title, $summary);

$title = $page->look_down(
sub{ $_[0]-> tag() eq 'a' and ($_[0]->attr('href'))}
);
if($title)
{
print 'title: '.$title->as_HTML,"\n";
}
}


$page->delete;

The all of HTML code is here: http://gist.github.com/465568

PD:Please help me I've been here like 3 hours without success

Definitively what happens is that to take everything what there is one

http://valeptr.com/scripts/runner.php?BA=

and what I want to take is :

http://valeptr.com/scripts/runner.php?PA=

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

撕心裂肺的伤痛 2024-09-15 12:36:47

您对 look_down() 的调用无法区分您想要的链接和不需要的链接。尝试更强的过滤器,例如

@search_results = $page->look_down(
    sub {$_[0]->{tag} eq 'a'  &&
         $_[0]->attr('href') =~ /\?PA=/}); # only match http://...?PA=...

Your call to look_down() can't distinguish between the links you want and the links you don't. Try a stronger filter like

@search_results = $page->look_down(
    sub {$_[0]->{tag} eq 'a'  &&
         $_[0]->attr('href') =~ /\?PA=/}); # only match http://...?PA=...
玻璃人 2024-09-15 12:36:47

我倾向于使用 HTML::TokeParser::Simple这只是为了避免构建文档树的开销:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new('t.html');

while ( my $tag = $parser->get_tag('a') ) {
    my $href = $tag->get_attr('href');
    next unless $href =~ /runner\.php\?PA=[0-9]+\z/;

    print $tag->as_is;

    while ( my $token = $parser->get_token ) {
        print $token->as_is;
        last if $token->is_end_tag('/a');
    }
    print "\n";
}

输出:


Xray-cash
...等等

I would be inclined to use HTML::TokeParser::Simple for this just to avoid the overhead of building a document tree:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new('t.html');

while ( my $tag = $parser->get_tag('a') ) {
    my $href = $tag->get_attr('href');
    next unless $href =~ /runner\.php\?PA=[0-9]+\z/;

    print $tag->as_is;

    while ( my $token = $parser->get_token ) {
        print $token->as_is;
        last if $token->is_end_tag('/a');
    }
    print "\n";
}

Output:

<a href="http://valeptr.com/scripts/runner.php?PA=33425"
target="_ptc" onclick="javascript:reloadpage(11)">
<img src="1appsearch.php_files/runner_007.gif"
alt="Xray-cash" border="0">
</a> ... etc

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文