如何使用 LWP 和正则表达式抓取 javascript 函数的日期参数?

发布于 2025-01-03 22:02:12 字数 752 浏览 3 评论 0原文

我很难从特定网页中抓取日期,因为该日期显然是传递给 JavaScript 函数的参数。我过去写过一些简单的抓取工具,没有任何重大问题,所以我没想到会出现问题,但我正在努力解决这个问题。该页面有 5-6 个日期,采用常规 yyyy/mm/dd 格式,如 dateFormat('2012/02/07')

理想情况下,我想删除所有内容除了六个日期,我想保存在数组中。到了现在,我连一次约会都无法成功,更不用说全部了。这可能只是一个格式错误的正则表达式,我已经找了很长时间了,以至于我再也找不到了。

Q1.为什么我没有得到与下面的正则表达式的匹配?

Q2。根据上面的问题,如何将所有日期抓取到数组中?我正在考虑假设页面上有 x 个日期,for 循环 x 次并将捕获的组分配给每个循环的数组,但这看起来相当笨重。

问题代码如下。

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;

my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";
my $content = get($url_full);
#dateFormat('2012/02/07');
$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/; # get any date without regard to greediness etc

I'm having difficulty scraping dates from a specific web page because the date is apparently an argument passed to a javascript function. I have in the past written a few simple scrapers without any major issues so I didn't expect problems but I am struggling with this. The page has 5-6 dates in regular yyyy/mm/dd format like this dateFormat('2012/02/07')

Ideally I would like to remove everything except the half-dozen dates, which I want to save in an array. At this point, I can't even successfully get one date, let alone all of them. It is probably just a malformed regex that I have been looking it so long that I can't spot any more.

Q1. Why am I not getting a match with the regex below?

Q2. Following on from the above question how can I scrape all the dates into an array? I was thinking of assuming x number of dates on the page, for-looping x times and assigning the captured group to an array each loop, but that seems rather clunky.

Problem code follows.

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;

my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";
my $content = get($url_full);
#dateFormat('2012/02/07');
$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/; # get any date without regard to greediness etc

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

¢蛋碎的人ぎ生 2025-01-10 22:02:12

为什么你的模式中有两个空白字符?

$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/;
                                                 ^^^^^

它们不在您的格式示例 'dateFormat('2012/02/07')' 中,

我想说这就是您的模式不匹配的原因。

捕获所有日期

您可以简单地将所有匹配项放入一个数组中,如下所示

( my @Result ) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;

(?<=dateFormat\(') 是一个肯定的后向断言,可确保存在 dateFormat\(' 位于日期模式之前(但这不包含在您的匹配中)

(?='\)) 是一个正向先行断言,可确保存在 '\) 在模式

之后g 修饰符让您的模式搜索字符串中的所有匹配项。

Why do you have two whitespace characters in your pattern?

$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/;
                                                 ^^^^^

they are not in your format example 'dateFormat('2012/02/07')'

I would say this is the reason why your pattern does not match.

Capture all dates

You can simply get all matches into an array like this

( my @Result ) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;

(?<=dateFormat\(') is a positive lookbehind assertion that ensures that there is dateFormat\(' before your date pattern (but this is not included in your match)

(?='\)) is a positive lookahead assertion that ensures that there is '\) after the pattern

The g modifier let your pattern search for all matches in the string.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文