Perl:为什么这个网络抓取正则表达式工作不一致?

发布于 2025-01-03 18:52:17 字数 2271 浏览 0 评论 0原文

我遇到了与我试图抓取的网站相关的另一个问题。

基本上我已经从页面内容中删除了大部分我不想要的内容,并且感谢提供的一些帮助 此处已成功隔离我想要的日期。尽管最初出现一些与不间断空格匹配的问题,但大部分似乎都工作正常。但是,我现在对最终的正则表达式遇到了困难,该正则表达式旨在将每行数据拆分为字段。每条线代表股价指数的价格。每行上的字段是:

  1. 由拉丁字母表中的字符组成的任意长度的名称,有时是逗号或与号,没有数字。
  2. 小数点后两位数字(索引的绝对值)。
  3. 小数点后两位数字(值的变化)。
  4. 小数点后两位数字,后跟百分号(值的百分比变化)。

这是分割之前的示例字符串: 渔业、农业和林业243.45-1.91-0.78%采矿业360.74-4.15-1.14%建筑业465.36-1.01-0.22%食品783.2511.281.46%纺织和服装412.070.540.13%纸浆和工业Paper333.31-0.29-0.09% Chemicals729.406.010.83% "

我用来分割这一行的正则表达式是这样的:

$mystr =~ s/\n(.*?)(\d{1,4 }\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;

它有时有效,但有时无效,我无法弄清楚为什么会这样。 (下面示例输出中的双等号用于使字段分割更容易可见。)

Fishery, Agriculture & Forestry == 243.45 == -1.91 == -0.78%
Mining360.74-4.15-1.14%
Construction == 465.36 == -1.01 == -0.22%
Foods783.2511.281.46%

我认为负号对于那些指数价格出现负变化的指数来说是一个问题,但有时它仍然有效减号。

问:为什么下面显示的最终正则表达式无法一致地分割字段?

示例代码如下。

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;

my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";

my $content = get($url_full);
# get dates:
(my @dates) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;
foreach my $date (@dates) { # convert to yyyy-mm-dd
    $date =~ s/\//-/ig;
}
my $tree = HTML::Tree->new();
$tree->parse($content);
my $mystr = $tree->as_text;

$mystr =~ s/\xA0//gi; # remove non-breaking spaces
# remove first chunk of text:
$mystr =~
  s/^(TSE.*?)IndustryIndexChange ?/IndustryIndexChange\n$dates[0]\n\n/gi;
$mystr =~ s/IndustryIndexChange ?/IndustryIndexChange/ig;
$mystr =~ s/IndustryIndexChange/Industry Index Change\n/ig;
$mystr =~ s/% /%\n/gi; # percent symbol is market for end of line
# indicate breaks between days:
$mystr =~ s/Stock.*?IndustryIndexChange/\nDAY DELIMITER\n/gi;
$mystr =~ s/Exemption from Liability.*$//g; # remove boilerplate at bottom

# and here's the problem regex...
# try to split it:
$mystr =~
  s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;

print $mystr;

I have run into another problem in relation to a site I am trying to scrape.

Basically I have stripped most of what I don't want from the page content and thanks to some help given here have managed to isolate the dates I wanted. Most of it seems to be working fine, despite some initial problems matching a non-breaking space. However, I am now having difficulty with the final regex, which is intended to split each line of data into fields. Each line represents the price of a share price index. The fields on each line are:

  1. A name of arbitrary length made from characters from the latin alphabet and sometimes a comma or ampersand, no numerics.
  2. A number with two digits after the decimal point (the absolute value of the index).
  3. A number with two digits after the decimal point (the change in the value).
  4. A number with two digits after the decimal point followed by a percent sign (the percentage change in value).

Here is an example string, before splitting:
"Fishery, Agriculture & Forestry243.45-1.91-0.78% Mining360.74-4.15-1.14% Construction465.36-1.01-0.22% Foods783.2511.281.46% Textiles & Apparels412.070.540.13% Pulp & Paper333.31-0.29-0.09% Chemicals729.406.010.83% "

The regex I am using to split this line is this:

$mystr =~ s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;

It works sometimes but not other times and I cannot work out why this should be. (The doubled equal signs in the example output below are used to make the field split more easily visible.)

Fishery, Agriculture & Forestry == 243.45 == -1.91 == -0.78%
Mining360.74-4.15-1.14%
Construction == 465.36 == -1.01 == -0.22%
Foods783.2511.281.46%

I thought the minus sign was an issue for those indices that saw a negative change in the price of the index, but sometimes it works despite the minus sign.

Q. Why is the final regex shown below failing to split the fields consistently?

Example code follows.

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;

my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";

my $content = get($url_full);
# get dates:
(my @dates) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;
foreach my $date (@dates) { # convert to yyyy-mm-dd
    $date =~ s/\//-/ig;
}
my $tree = HTML::Tree->new();
$tree->parse($content);
my $mystr = $tree->as_text;

$mystr =~ s/\xA0//gi; # remove non-breaking spaces
# remove first chunk of text:
$mystr =~
  s/^(TSE.*?)IndustryIndexChange ?/IndustryIndexChange\n$dates[0]\n\n/gi;
$mystr =~ s/IndustryIndexChange ?/IndustryIndexChange/ig;
$mystr =~ s/IndustryIndexChange/Industry Index Change\n/ig;
$mystr =~ s/% /%\n/gi; # percent symbol is market for end of line
# indicate breaks between days:
$mystr =~ s/Stock.*?IndustryIndexChange/\nDAY DELIMITER\n/gi;
$mystr =~ s/Exemption from Liability.*$//g; # remove boilerplate at bottom

# and here's the problem regex...
# try to split it:
$mystr =~
  s/\n(.*?)(\d{1,4}\.\d{2})(\-?\d{1,3}\.\d{2})(.*?%)\n/\n$1 == $2 == $3 == $4\n/ig;

print $mystr;

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

毁虫ゝ 2025-01-10 18:52:17

它似乎正在做所有其他的事情。

我的猜测是,您的记录之间有一个 \n,但您的模式以 \n 开始和结束。因此,第一个匹配的最终 \n 消耗了第二个匹配查找第二条记录所需的 \n。最终结果是它追上了所有其他记录。

您最好将模式包装在 ^$ (而不是 \n\n)中,并在 s/// 上使用 m 标志。

It appears to be doing every other one.

My guess is that your records have a single \n between them, but your pattern starts and ends with a \n. So the final \n on the first match consumes the \n that the second match needed to find the second record. The net result is that it picks up every other record.

You might be better off wrapping your pattern in ^ and $ (instead of \n and \n), and using the m flag on the s///.

傲鸠 2025-01-10 18:52:17

问题是正则表达式的开头和结尾都有 \n

考虑这样的事情:

$s = 'abababa';
$s =~ s/aba/axa/g;

这会将 $s 设置为 axabaxa不是 axaxaxa,因为只有两个非-aba 的重叠出现。

The problem is that you have \n both at the start and at the end of the regex.

Consider something like this:

$s = 'abababa';
$s =~ s/aba/axa/g;

that will set $s to axabaxa, not axaxaxa, because there are only two non-overlapping occurrences of aba.

等待圉鍢 2025-01-10 18:52:17

我的解释(伪代码)-

one   = [a-zA-Z,& ]+
two   = \d{1,4}.\d\d
three = <<two>>
four  = <<two>>%

regex = (<<one>>)(<<two>>)(<<three>>)(<<four>>)
      = ([a-zA-Z,& ]+)(\d{1,4}.\d\d)(\d{1,4}.\d\d)(\d{1,4}.\d\d%)

但是,您已经看到了 HTML 形式的“结构化”数据。为什么不利用这一点呢?

perl 中的 HTML 解析 参考 MOJO
对于 perl 中基于 DOM 的解析,除非有严重的性能原因,
我强烈推荐这种方法。

My interpretation (pseudocode) -

one   = [a-zA-Z,& ]+
two   = \d{1,4}.\d\d
three = <<two>>
four  = <<two>>%

regex = (<<one>>)(<<two>>)(<<three>>)(<<four>>)
      = ([a-zA-Z,& ]+)(\d{1,4}.\d\d)(\d{1,4}.\d\d)(\d{1,4}.\d\d%)

However, you are already presented with 'structured' data in the form of HTML. Why not take advantage of this?

HTML parsing in perl references MOJO
for DOM based parsing in perl, and unless there are serious performance reasons,
I'd highly recommend such an approach.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文