Perl 网络抓取工具,从 DIV 中提取仅具有“样式”的内容标签?

发布于 2024-09-10 03:14:30 字数 1333 浏览 0 评论 0原文

我被困在这个问题上一整天了..我对 Perl 中的解析/抓取还很陌生,但我以为我已经掌握了它直到这..我一直在尝试使用不同的 Perl 模块(tokeparser,tokeparser:简单,网络解析器和其他一些)...我有以下字符串(实际上是整个 HTML 页面,但这只是显示相关部分..我正在尝试提取“text1”和“text1_a”。等等(“text1”等只是作为示例放在那里)...所以基本上我认为我需要先从每个中提取这个:

"<span style="float: left;">test1</span>test1_a"

然后解析它以获得两个值..我不'不知道为什么这给我带来了这么多麻烦,因为我以为我可以在 tokeparser:simple 中完成它,但我似乎无法返回 DIV 内部的值,我想知道它是否因为它包含另一组标签(标签)

字符串(代表html网页)

<div id="dataID" style="font-size: 8.5pt; width: 250px; color: rgb(0, 51, 102); margin-right: 10px; float: right;">
<div style="width: 250px; text-align: right;"><span style="float: left;">test1</span>test1_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test2</span>test2_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test3</span>test3_a</div>

我在perl web解析器模块中的尝试:

my $uri  = URI->new($theurl);

my $proxyscraper = scraper {
process 'div[style=~"width: 250px; text-align: right;"]',
'proxiesextracted[]' => scraper {
process '.style',  style => 'TEXT';
};
result 'proxiesextracted';

我只是盲目地试图理解web:parser模块,因为基本上没有关于它的文档,所以我只是从他们在模块中包含的示例以及我在互联网上找到的示例..非常感谢任何建议。

I'm stuck on this and have been all day.. I'm still pretty new to parsing / scraping in perl but I thought I had it down until this.. I have been trying this with different perl modules (tokeparser, tokeparser:simple, web parser and some others)... I have the following string (which in reality is actually an entire HTML page, but this is just showing the relevant part.. I am trying to extract "text1" and "text1_a".. and so on (the "text1", etc is just put in there as an example)... so basically I think I need to extract this first from each:

"<span style="float: left;">test1</span>test1_a"

Then to parse this to get the 2 values.. I don't know why this is giving me so much trouble as I thought I could just do it in tokeparser:simple but I couldn't seem to return the value inside of the DIV, I wonder if its because it contains another set of tags (the tags)

string (represents html web page)

<div id="dataID" style="font-size: 8.5pt; width: 250px; color: rgb(0, 51, 102); margin-right: 10px; float: right;">
<div style="width: 250px; text-align: right;"><span style="float: left;">test1</span>test1_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test2</span>test2_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test3</span>test3_a</div>

my attempt in perl web parser module:

my $uri  = URI->new($theurl);

my $proxyscraper = scraper {
process 'div[style=~"width: 250px; text-align: right;"]',
'proxiesextracted[]' => scraper {
process '.style',  style => 'TEXT';
};
result 'proxiesextracted';

I'm just kind of blindly trying to make sense of the web:parser module as there is essentially no documentation on it so I just pieced that together from the examples they included with the module and one I found on the internet.. any advice is greatly appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

镜花水月 2024-09-17 03:14:30

如果你想要一个 DOM 解析器(更容易使用树浏览,稍微慢一些)。尝试 HTML::TreeBuilder

HTML::Element 手册页(包含模块)

另请注意,look_down 认为“”(空字符串)和 undef 是

不同的东西,在属性值中。
所以这个:

 $h->look_down("alt", "")

让我们得出你的答案:

use HTML::TreeBuilder;

# check html::treebuilder pod, there are a few ways to construct (file, fh, html string)
my $tb = HTML::TreeBuilder->new_from_(constructor)

$tb->look_down( _tag => 'div', style => '' )->as_text;

If you want a DOM parser (easier to use tree browsing, slightly slower). Try HTML::TreeBuilder

HTML::Element man page (module is included)

Note also that look_down considers "" (empty-string) and undef to be

different things, in attribute values.
So this:

  $h->look_down("alt", "")

Which leads us to your answer:

use HTML::TreeBuilder;

# check html::treebuilder pod, there are a few ways to construct (file, fh, html string)
my $tb = HTML::TreeBuilder->new_from_(constructor)

$tb->look_down( _tag => 'div', style => '' )->as_text;
小清晰的声音 2024-09-17 03:14:30

使用 Web::Scraper,尝试:

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper::Simple;
use Web::Scraper;

$Data::Dumper::Indent = 1;

my $html = '<div id="dataID" style="font-size: 8.5pt; width: 250px; color: rgb(0, 51, 102); margin-right$
<div style="width: 250px; text-align: right;"><span style="float: left;">test1</span>test1_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test2</span>test2_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test3</span>test3_a</div>';


my $proxyscraper = scraper {
    process '//div[@id="dataID"]/div', 'proxiesextracted[]' => scraper {
       process '//span', 'data1' => 'TEXT';
       process '//text()', 'data2' => 'TEXT';
     }
};

my $results = $proxyscraper->scrape( $html );

print Dumper($results);

它给出:

$results = {
  'proxiesextracted' => [
    {
      'data2' => 'test1_a',
      'data1' => 'test1'
    },
    {
      'data2' => 'test2_a',
      'data1' => 'test2'
    },
    {
      'data2' => 'test3_a',
      'data1' => 'test3'
    }
  ]
};

希望这有帮助

using Web::Scraper, try :

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper::Simple;
use Web::Scraper;

$Data::Dumper::Indent = 1;

my $html = '<div id="dataID" style="font-size: 8.5pt; width: 250px; color: rgb(0, 51, 102); margin-right$
<div style="width: 250px; text-align: right;"><span style="float: left;">test1</span>test1_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test2</span>test2_a</div>
<div style="width: 250px; text-align: right;"><span style="float: left;">test3</span>test3_a</div>';


my $proxyscraper = scraper {
    process '//div[@id="dataID"]/div', 'proxiesextracted[]' => scraper {
       process '//span', 'data1' => 'TEXT';
       process '//text()', 'data2' => 'TEXT';
     }
};

my $results = $proxyscraper->scrape( $html );

print Dumper($results);

It give :

$results = {
  'proxiesextracted' => [
    {
      'data2' => 'test1_a',
      'data1' => 'test1'
    },
    {
      'data2' => 'test2_a',
      'data1' => 'test2'
    },
    {
      'data2' => 'test3_a',
      'data1' => 'test3'
    }
  ]
};

Hope this helps

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文