如何使用 LWP 和正则表达式抓取 javascript 函数的日期参数?
我很难从特定网页中抓取日期,因为该日期显然是传递给 JavaScript 函数的参数。我过去写过一些简单的抓取工具,没有任何重大问题,所以我没想到会出现问题,但我正在努力解决这个问题。该页面有 5-6 个日期,采用常规 yyyy/mm/dd 格式,如 dateFormat('2012/02/07')
理想情况下,我想删除所有内容除了六个日期,我想保存在数组中。到了现在,我连一次约会都无法成功,更不用说全部了。这可能只是一个格式错误的正则表达式,我已经找了很长时间了,以至于我再也找不到了。
Q1.为什么我没有得到与下面的正则表达式的匹配?
Q2。根据上面的问题,如何将所有日期抓取到数组中?我正在考虑假设页面上有 x 个日期,for 循环 x 次并将捕获的组分配给每个循环的数组,但这看起来相当笨重。
问题代码如下。
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;
my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";
my $content = get($url_full);
#dateFormat('2012/02/07');
$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/; # get any date without regard to greediness etc
I'm having difficulty scraping dates from a specific web page because the date is apparently an argument passed to a javascript function. I have in the past written a few simple scrapers without any major issues so I didn't expect problems but I am struggling with this. The page has 5-6 dates in regular yyyy/mm/dd format like this dateFormat('2012/02/07')
Ideally I would like to remove everything except the half-dozen dates, which I want to save in an array. At this point, I can't even successfully get one date, let alone all of them. It is probably just a malformed regex that I have been looking it so long that I can't spot any more.
Q1. Why am I not getting a match with the regex below?
Q2. Following on from the above question how can I scrape all the dates into an array? I was thinking of assuming x number of dates on the page, for-looping x times and assigning the captured group to an array each loop, but that seems rather clunky.
Problem code follows.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::Tree;
my $url_full = "http://www.tse.or.jp/english/market/STATISTICS/e06_past.html";
my $content = get($url_full);
#dateFormat('2012/02/07');
$content =~ s/.*dateFormat\('(\d{4}\/\d{2}\/\d{2}\s{2})'\);.*/$1/; # get any date without regard to greediness etc
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
为什么你的模式中有两个空白字符?
它们不在您的格式示例 'dateFormat('2012/02/07')' 中,
我想说这就是您的模式不匹配的原因。
捕获所有日期
您可以简单地将所有匹配项放入一个数组中,如下所示
(?<=dateFormat\(')
是一个肯定的后向断言,可确保存在dateFormat\('
位于日期模式之前(但这不包含在您的匹配中)(?='\))
是一个正向先行断言,可确保存在'\)
在模式之后
g
修饰符让您的模式搜索字符串中的所有匹配项。Why do you have two whitespace characters in your pattern?
they are not in your format example 'dateFormat('2012/02/07')'
I would say this is the reason why your pattern does not match.
Capture all dates
You can simply get all matches into an array like this
(?<=dateFormat\(')
is a positive lookbehind assertion that ensures that there isdateFormat\('
before your date pattern (but this is not included in your match)(?='\))
is a positive lookahead assertion that ensures that there is'\)
after the patternThe
g
modifier let your pattern search for all matches in the string.