如何使用 Perl 从 JavaScript 动态生成的网页中抓取文本?

发布于 2024-08-28 19:31:34 字数 649 浏览 8 评论 0原文

我正在尝试从 Perl 中提取一个网站,但是我需要的页面部分是使用 javascript 生成的,因此您在源代码中看到的是:

<div id="results"></div>

我需要以某种方式提取该 div 的内容并保存它使用 Perl/proxys/whatever 到一个文件。例如,我想保存的信息是

document.getElementById('results').innerHTML;

我不确定这是否可能,或者是否有人有任何想法或方法来做到这一点。 我在其他页面上使用了 lynx 源转储,但由于我无法直接屏幕抓取此页面,所以我来这里询问它!

如果有人感兴趣,该页面是 http:// downloadcenter.trendmicro.com/index.php?clk=left_nav&clkval=pattern_file&regs=NABU 我试图获取的信息是有关 ConsumerOPR 的行

There is a website I am trying to pull information from in Perl, however the section of the page I need is being generated using javascript so all you see in the source is:

<div id="results"></div>

I need to somehow pull out the contents of that div and save it to a file using Perl/proxies/whatever. e.g. the information I want to save would be

document.getElementById('results').innerHTML;

I am not sure if this is possible or if anyone had any ideas or a way to do this.
I was using a lynx source dump for other pages but since I cant straight forward screen scrape this page I came here to ask about it!

If anyone is interested, the page is http://downloadcenter.trendmicro.com/index.php?clk=left_nav&clkval=pattern_file®s=NABU and the info I am trying to get is the row about the ConsumerOPR

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

掩于岁月 2024-09-04 19:31:34

您需要对 Javascript 正在执行的操作进行逆向工程。它是否会触发 AJAX 请求来填充

?如果是这样,使用 Firebug 嗅探请求应该很容易,然后使用 LWP::UserAgentWWW::Mechanize 获取信息。

如果 Javascript 只是进行纯 DOM 操作,那么这意味着数据必须已经存在于页面或 Javascript 中的其他位置。所以弄清楚它从哪里来并抓住它。

最后,如果这些选项都不够,您可能需要使用真正的浏览器来完成此操作。有一些用于自动化浏览器行为的选项,例如 WWW::Mechanize::FirefoxWin32::IE::Mechanize

You'll need to reverse-engineer what the Javascript is doing. Does it fire off an AJAX request to populate the <div>? If so, it should be pretty easy to sniff the request using Firebug and then duplicate it with LWP::UserAgent or WWW::Mechanize to get the information.

If the Javascript is just doing pure DOM manipulation, then that means the data must exist somewhere else in the page or the Javascript already. So figure out where it's coming from and grab it.

Finally, if none of those options are adequate, you may need to just use a real browser to do it. There are a few options for automating browser behavior, like WWW::Mechanize::Firefox or Win32::IE::Mechanize.

痴意少年 2024-09-04 19:31:34

由于页面内容是由某些 Javascript 生成的,因此您需要能够:

  • 执行一些 Javascript 代码
    • 甚至可能是一些复杂的 JS 代码,执行 Ajax 请求等等?
  • 并使用支持浏览器中存在的功能/方法的引擎(例如 DOM 操作)

解决方案可能是真正启动浏览器来导航到该页面,然后解析它加载的页面,以提取信息?

我从来没有用过这个来抓取,但是 Selenium 套件 可能会有所帮助,在这里:使用 Selenium RC,你可以开始一个真正的浏览器,并试用它 - 然后,您就可以使用从中获取数据的功能。

它不是很快,而且相当重(它必须启动浏览器!),但它工作得很好:例如,您将使用 Firefox 来导航到您的页面 - 这意味着一个真正的 Javascript 引擎,很多人每天都在使用它;-)

As the content of your page is generated by some Javascript, you need the ability to :

  • Execute some Javascript code
    • Even, possibly, some complex JS code, doing Ajax requests and all that ?
  • And do it with an engine that supports the functions/methods that are present in a browser (like DOM manipulations)

A solution could be to actually really start a browser to navigate to that page, and, then, parse the page that's loaded by it, to extract the information ?

I've never used this for grabbing, but the Selenium suite might help, here : using Selenium RC, you can start a real browser, and pilot it -- then, you have functions to get data from it.

It's not quite fast, and it's pretty heavy (it has to start a browser !), but it works quite well : you'll be using Firefox, for example, to navigate to your page -- which means a real Javascript engine, that's used every day by a lot of people ;-)

彼岸花似海 2024-09-04 19:31:34

这可能就是您要寻找的内容(在 PHP 中):

$url = 'http://downloadcenter.trendmicro.com/ajx/pattern_result.php';

$ch = curl_init();
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'q=patresult_page®=NABU');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$content = curl_exec($ch);
curl_close($ch);

echo $content;
exit;

一旦获得内容,您可以使用以下内容: http://code.google.com/p/phpquery/ 来解析您需要的结果或类似的 perl 等价物???

和/或自己进行解析。

仅供参考:我所做的只是使用 firebug 检查请求并使用 PHP/CURL 重新创建它......

This might be what your looking for (in PHP):

$url = 'http://downloadcenter.trendmicro.com/ajx/pattern_result.php';

$ch = curl_init();
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'q=patresult_page®=NABU');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$content = curl_exec($ch);
curl_close($ch);

echo $content;
exit;

once you get the content you can use something like: http://code.google.com/p/phpquery/ to parse the results you need or a similar perl equivalent???

And/or do the parsing yourself.

FYI: all I did was use firebug to inspect the requests and recreated it with PHP/CURL...

你的往事 2024-09-04 19:31:34

要使用动态创建的 HTML,您可以使用 FireFox Chickenfoot 插件。
或者,如果您需要从命令行脚本运行的东西,请使用 Perl 的绑定。我以前用 Python 做过这个。

to work with the dynamically created HTML you can use the FireFox Chickenfoot plugin.
Or if you need something that works from a command line script use bindings to Perl. I have done this with Python before.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文