如何使用 Perl 从 JavaScript 动态生成的网页中抓取文本？

发布于 2024-08-28 19:31:34 字数 649 浏览 8 评论 0原文

我正在尝试从 Perl 中提取一个网站，但是我需要的页面部分是使用 javascript 生成的，因此您在源代码中看到的是：

<div id="results"></div>

我需要以某种方式提取该 div 的内容并保存它使用 Perl/proxys/whatever 到一个文件。例如，我想保存的信息是

document.getElementById('results').innerHTML;

我不确定这是否可能，或者是否有人有任何想法或方法来做到这一点。我在其他页面上使用了 lynx 源转储，但由于我无法直接屏幕抓取此页面，所以我来这里询问它！

如果有人感兴趣，该页面是 http:// downloadcenter.trendmicro.com/index.php?clk=left_nav&clkval=pattern_file&regs=NABU 我试图获取的信息是有关 ConsumerOPR 的行

原文

There is a website I am trying to pull information from in Perl, however the section of the page I need is being generated using javascript so all you see in the source is:

<div id="results"></div>

I need to somehow pull out the contents of that div and save it to a file using Perl/proxies/whatever. e.g. the information I want to save would be

document.getElementById('results').innerHTML;

I am not sure if this is possible or if anyone had any ideas or a way to do this.
I was using a lynx source dump for other pages but since I cant straight forward screen scrape this page I came here to ask about it!

If anyone is interested, the page is http://downloadcenter.trendmicro.com/index.php?clk=left_nav&clkval=pattern_file®s=NABU and the info I am trying to get is the row about the ConsumerOPR

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

掩于岁月 2024-09-04 19:31:34

您需要对 Javascript 正在执行的操作进行逆向工程。它是否会触发 AJAX 请求来填充

？如果是这样，使用 Firebug 嗅探请求应该很容易，然后使用 LWP::UserAgent 或 WWW::Mechanize 获取信息。

如果 Javascript 只是进行纯 DOM 操作，那么这意味着数据必须已经存在于页面或 Javascript 中的其他位置。所以弄清楚它从哪里来并抓住它。

最后，如果这些选项都不够，您可能需要使用真正的浏览器来完成此操作。有一些用于自动化浏览器行为的选项，例如 WWW::Mechanize::Firefox 或 Win32::IE::Mechanize。

回复收藏 0 原文

蒲公英的约定 2024-09-04 19:31:34

将浏览器连接到服务器作者：John Resig</a> 可能会有用。

回复收藏 0 原文

痴意少年 2024-09-04 19:31:34

由于页面内容是由某些 Javascript 生成的，因此您需要能够：

执行一些 Javascript 代码
- 甚至可能是一些复杂的 JS 代码，执行 Ajax 请求等等？
并使用支持浏览器中存在的功能/方法的引擎（例如 DOM 操作）

解决方案可能是真正启动浏览器来导航到该页面，然后解析它加载的页面，以提取信息？

我从来没有用过这个来抓取，但是 Selenium 套件可能会有所帮助，在这里：使用 Selenium RC，你可以开始一个真正的浏览器，并试用它 - 然后，您就可以使用从中获取数据的功能。

它不是很快，而且相当重（它必须启动浏览器！），但它工作得很好：例如，您将使用 Firefox 来导航到您的页面 - 这意味着一个真正的 Javascript 引擎，很多人每天都在使用它;-)

回复收藏 0 原文

彼岸花似海 2024-09-04 19:31:34

这可能就是您要寻找的内容（在 PHP 中）：

$url = 'http://downloadcenter.trendmicro.com/ajx/pattern_result.php';

$ch = curl_init();
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'q=patresult_page®=NABU');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$content = curl_exec($ch);
curl_close($ch);

echo $content;
exit;

一旦获得内容，您可以使用以下内容： http://code.google.com/p/phpquery/ 来解析您需要的结果或类似的 perl 等价物？？？

和/或自己进行解析。

仅供参考：我所做的只是使用 firebug 检查请求并使用 PHP/CURL 重新创建它......

This might be what your looking for (in PHP):

$url = 'http://downloadcenter.trendmicro.com/ajx/pattern_result.php';

$ch = curl_init();
curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt ($ch, CURLOPT_URL, $url);
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt ($ch, CURLOPT_POSTFIELDS, 'q=patresult_page®=NABU');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$content = curl_exec($ch);
curl_close($ch);

echo $content;
exit;

once you get the content you can use something like: http://code.google.com/p/phpquery/ to parse the results you need or a similar perl equivalent???

And/or do the parsing yourself.

FYI: all I did was use firebug to inspect the requests and recreated it with PHP/CURL...

回复收藏 0 原文