解析 html 并跟踪 javascript 链接

发布于 2024-12-11 05:36:15 字数 614 浏览 0 评论 0原文

一位学术同事要求我从一个网站中提取信息,在该网站中我需要链接表格中的网页内容 - 对于只能访问的文本文件的内容来说并不太难(据我所知)通过单击 javascript 链接...例如,

<a id="tk1" href="javascript:__doPostBack('tk1$ContentPlaceHolder1$grid$tk$OpenFileButton','')">

该表方便地位于 id='tk1' 的表内,这很好...但是我如何跟踪提取文本文件的链接。

理想情况下,我想在 R 中执行此操作...我可以通过说来获取文本格式的相关表

u <- the url of interest...
library(XML)
tables = readHTMLTable(u)
interestingTable <- tables[grep('tk1', names(tables))]

,这将给出表中的文本,但是如何获取该特定表的 html?以及如何“单击”按钮并获取其后面的文本文件?

我注意到有一个带有大量隐藏值的表单 - 该网站似乎是由 asp.net 驱动并使用难以穿透的 URL。

非常感谢!

I have been asked to extract info by an academic colleague from a website where I need to link the content of a webpage in a table - not too hard with the contents of a text file which is only reacheable (as far as I can tell) by clicking on a javascript link... e.g.

<a id="tk1" href="javascript:__doPostBack('tk1$ContentPlaceHolder1$grid$tk$OpenFileButton','')">

The table is conveniently inside a table with id='tk1' which is nice... but how do I follow the link which pulls the text file.

Ideally I'd like to do this in R... I can grab the relevant table in text format by saying

u <- the url of interest...
library(XML)
tables = readHTMLTable(u)
interestingTable <- tables[grep('tk1', names(tables))]

And this will give the text in the table, but how do I grab the html for that particular table? and how do I "click" on the button and get the text file behind it?

I note that there is a form with massive hidden values - the site appears to be asp.net driven and uses impenetrable URLs.

Many thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

愛上了 2024-12-18 05:36:15

这有点棘手,并且没有完全集成在 R 中,但是一些 system() 摆弄可以帮助您入门。


var page = new WebPage();
page.open('http://www.menne-biomed.de/uni/JavaButton.html', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var ua = page.evaluate(function () {
            var t =  document.getElementById('tk1').href;
            var re = new RegExp('\((.*)\)');
            return eval(re.exec(t)[1]);
}); console.log(ua);// Outputs http://cran.at.r-project.org/ } phantom.exit(); });

  • 当 phantomjs 在路径上时,调用

    phantomjs javabutton.js

链接将显示在控制台上。使用任何方法将其放入 Rcurl 中。

不优雅,但也许有一天有人将 phantomjs 包装到 R 中。如果 JaveButton.html 的链接丢失,这里是代码。

<!DOCTYPE html >
<head>
<script>
inaccesibleJavascriptVar = 'http://' + 'cran.at.r-project.org/';
function doPostBack(myref)
          {
            window.location.href= myref;
            return false;
        }
</script>
</head>
<body>
<a id="tk1" href="javascript:doPostBack(inaccesibleJavascriptVar)" >Click here</a>
</body>
</html>

This is somewhat tricky, and not fully integrated in R, but some system()-fiddling will get you started.


var page = new WebPage();
page.open('http://www.menne-biomed.de/uni/JavaButton.html', function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var ua = page.evaluate(function () {
            var t =  document.getElementById('tk1').href;
            var re = new RegExp('\((.*)\)');
            return eval(re.exec(t)[1]);
}); console.log(ua);// Outputs http://cran.at.r-project.org/ } phantom.exit(); });

  • With phantomjs on path, call

    phantomjs javabutton.js

The link will be displayed on the console. Use any method to get it into Rcurl.

Not elegant, but maybe someones wraps phantomjs into R one day. In case the link to JaveButton.html should be lost, here it is as code.

<!DOCTYPE html >
<head>
<script>
inaccesibleJavascriptVar = 'http://' + 'cran.at.r-project.org/';
function doPostBack(myref)
          {
            window.location.href= myref;
            return false;
        }
</script>
</head>
<body>
<a id="tk1" href="javascript:doPostBack(inaccesibleJavascriptVar)" >Click here</a>
</body>
</html>
尐籹人 2024-12-18 05:36:15

Have a look at the RCurl package:

http://www.omegahat.org/RCurl/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文