解析 html 并跟踪 javascript 链接
一位学术同事要求我从一个网站中提取信息,在该网站中我需要链接表格中的网页内容 - 对于只能访问的文本文件的内容来说并不太难(据我所知)通过单击 javascript 链接...例如,
<a id="tk1" href="javascript:__doPostBack('tk1$ContentPlaceHolder1$grid$tk$OpenFileButton','')">
该表方便地位于 id='tk1' 的表内,这很好...但是我如何跟踪提取文本文件的链接。
理想情况下,我想在 R 中执行此操作...我可以通过说来获取文本格式的相关表
u <- the url of interest...
library(XML)
tables = readHTMLTable(u)
interestingTable <- tables[grep('tk1', names(tables))]
,这将给出表中的文本,但是如何获取该特定表的 html?以及如何“单击”按钮并获取其后面的文本文件?
我注意到有一个带有大量隐藏值的表单 - 该网站似乎是由 asp.net 驱动并使用难以穿透的 URL。
非常感谢!
I have been asked to extract info by an academic colleague from a website where I need to link the content of a webpage in a table - not too hard with the contents of a text file which is only reacheable (as far as I can tell) by clicking on a javascript link... e.g.
<a id="tk1" href="javascript:__doPostBack('tk1$ContentPlaceHolder1$grid$tk$OpenFileButton','')">
The table is conveniently inside a table with id='tk1' which is nice... but how do I follow the link which pulls the text file.
Ideally I'd like to do this in R... I can grab the relevant table in text format by saying
u <- the url of interest...
library(XML)
tables = readHTMLTable(u)
interestingTable <- tables[grep('tk1', names(tables))]
And this will give the text in the table, but how do I grab the html for that particular table? and how do I "click" on the button and get the text file behind it?
I note that there is a form with massive hidden values - the site appears to be asp.net driven and uses impenetrable URLs.
Many thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这有点棘手,并且没有完全集成在 R 中,但是一些 system() 摆弄可以帮助您入门。
当 phantomjs 在路径上时,调用
phantomjs javabutton.js
链接将显示在控制台上。使用任何方法将其放入 Rcurl 中。
不优雅,但也许有一天有人将 phantomjs 包装到 R 中。如果 JaveButton.html 的链接丢失,这里是代码。
This is somewhat tricky, and not fully integrated in R, but some system()-fiddling will get you started.
With phantomjs on path, call
phantomjs javabutton.js
The link will be displayed on the console. Use any method to get it into Rcurl.
Not elegant, but maybe someones wraps phantomjs into R one day. In case the link to JaveButton.html should be lost, here it is as code.
查看 RCurl 包:
http://www.omegahat.org/RCurl/
Have a look at the RCurl package:
http://www.omegahat.org/RCurl/