如何使用 LWP 和正则表达式抓取 javascript 函数的日期参数？

发布于 2025-01-03 22:02:12 字数 752 浏览 3 评论 0原文

我很难从特定网页中抓取日期，因为该日期显然是传递给 JavaScript 函数的参数。我过去写过一些简单的抓取工具，没有任何重大问题，所以我没想到会出现问题，但我正在努力解决这个问题。该页面有 5-6 个日期，采用常规 yyyy/mm/dd 格式，如 dateFormat('2012/02/07')

理想情况下，我想删除所有内容除了六个日期，我想保存在数组中。到了现在，我连一次约会都无法成功，更不用说全部了。这可能只是一个格式错误的正则表达式，我已经找了很长时间了，以至于我再也找不到了。

Q1.为什么我没有得到与下面的正则表达式的匹配？

Q2。根据上面的问题，如何将所有日期抓取到数组中？我正在考虑假设页面上有 x 个日期，for 循环 x 次并将捕获的组分配给每个循环的数组，但这看起来相当笨重。

问题代码如下。

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;

my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";
my $content = get($url_full);
#dateFormat('2012/02/07');
$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/; # get any date without regard to greediness etc

原文

I'm having difficulty scraping dates from a specific web page because the date is apparently an argument passed to a javascript function. I have in the past written a few simple scrapers without any major issues so I didn't expect problems but I am struggling with this. The page has 5-6 dates in regular yyyy/mm/dd format like this dateFormat('2012/02/07')

Ideally I would like to remove everything except the half-dozen dates, which I want to save in an array. At this point, I can't even successfully get one date, let alone all of them. It is probably just a malformed regex that I have been looking it so long that I can't spot any more.

Q1. Why am I not getting a match with the regex below?

Q2. Following on from the above question how can I scrape all the dates into an array? I was thinking of assuming x number of dates on the page, for-looping x times and assigning the captured group to an array each loop, but that seems rather clunky.

Problem code follows.

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;

my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";
my $content = get($url_full);
#dateFormat('2012/02/07');
$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/; # get any date without regard to greediness etc

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

￠蛋碎的人ぎ生 2025-01-10 22:02:12

为什么你的模式中有两个空白字符？

$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/;
                                                 ^^^^^

它们不在您的格式示例 'dateFormat('2012/02/07')' 中，

我想说这就是您的模式不匹配的原因。

捕获所有日期

您可以简单地将所有匹配项放入一个数组中，如下所示

( my @Result ) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;

(?<=dateFormat\(') 是一个肯定的后向断言，可确保存在 dateFormat\(' 位于日期模式之前（但这不包含在您的匹配中）

(?='\)) 是一个正向先行断言，可确保存在 '\) 在模式

之后g 修饰符让您的模式搜索字符串中的所有匹配项。

Why do you have two whitespace characters in your pattern?

$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/;
                                                 ^^^^^

they are not in your format example 'dateFormat('2012/02/07')'

I would say this is the reason why your pattern does not match.

Capture all dates

You can simply get all matches into an array like this

( my @Result ) = $content =~ /(?<=dateFormat\(')\d{4}\/\d{2}\/\d{2}(?='\))/g;

(?<=dateFormat\(') is a positive lookbehind assertion that ensures that there is dateFormat\(' before your date pattern (but this is not included in your match)

(?='\)) is a positive lookahead assertion that ensures that there is '\) after the pattern

The g modifier let your pattern search for all matches in the string.

回复收藏 0 原文

~没有更多了~