使用 javascript 链接下载 PhantomJS
我正在尝试抓取以下网站:
如果您单击表格右上角标题为“导出数据”的小按钮,则会运行 JavaScript 脚本,并且我的浏览器会下载 .csv 格式的文件。我希望能够编写一个 PhantomJS 脚本来自动执行此操作。有什么想法吗?
上面的按钮是这样编码成 HTML 的:
<a id="LB_cmdCSV" href="javascript:__doPostBack('LB$cmdCSV','')">Export Data</a></div>
我还在 HTML 源代码中找到了这个函数:
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['form1'];
if (!theForm) {
theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
</script>
我对 PhantomJS/Javascript 非常陌生,可以在这里使用一些指针。我想我已经找到了自动执行此操作所需的所有信息(如果我错了,请纠正我),但只是不确定从哪里开始编码。感谢您的任何帮助。
编辑 - 这就是我的脚本现在的样子:
var page = new WebPage();
url = 'http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2011&month=0&season1=2011&ind=0&team=0&rost=0& players=0';
page.open(encodeURI(url), function (status){
if (status !== "success") {
console.log("Unable to access website");
} else {
page.evaluate(function() {
__doPostBack('LB$cmdCSV', '');
});
}
phantom.exit(0);
});
I am attempting to scrape the below website:
If you click the small button at the top-right of the table titled "export data", a javascript script runs and my browser downloads the file in .csv form. I'd like to be able to write a PhantomJS script that can do this automatically. Any ideas?
The above button is coded into HTML as such:
<a id="LB_cmdCSV" href="javascript:__doPostBack('LB$cmdCSV','')">Export Data</a></div>
I also found this function in the HTML source code:
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['form1'];
if (!theForm) {
theForm = document.form1;
}
function __doPostBack(eventTarget, eventArgument) {
if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
theForm.__EVENTTARGET.value = eventTarget;
theForm.__EVENTARGUMENT.value = eventArgument;
theForm.submit();
}
}
//]]>
</script>
I'm very new to PhantomJS/Javascript and could use some pointers here. I think I've found all the info I need to do this automatically (correct me if I'm wrong), but just not sure where to start on coding it. Thanks for any help.
EDIT - This is what my script looks like right now:
var page = new WebPage();
url = 'http://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2011&month=0&season1=2011&ind=0&team=0&rost=0& players=0';
page.open(encodeURI(url), function (status){
if (status !== "success") {
console.log("Unable to access website");
} else {
page.evaluate(function() {
__doPostBack('LB$cmdCSV', '');
});
}
phantom.exit(0);
});
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对我来说非常有效的是模拟鼠标点击所需的元素。
What have worked very well for me is simulating mouse clicks on the desired element.
您不能在网页上下文中运行代码
__doPostBack('LeaderBoard1$cmdCSV','');
吗?像这样的事情:
我还没有在 PhantomJS 中测试过这段代码,但理论上它应该可以工作,因为从 Google Chrome 的开发者控制台运行 __doPostBack 方法是有效的。如果对在 PhantomJS 中运行 JavaScript 代码有疑问,Google Chrome 的开发者控制台是测试代码的好方法,因为它像 PhantomJS 一样在 WebKit 上运行。我希望这有帮助。
Couldn't you just run the code,
__doPostBack('LeaderBoard1$cmdCSV','');
, within the context of the webpage?Something like this:
I haven't tested this code within PhantomJS, but theoretically it should work since running the __doPostBack method from Google Chrome's developer console worked. If in doubt about running JavaScript code in PhantomJS, Google Chrome's developer console is a great way to test out the code as it runs on WebKit like PhantomJS. I hope this helps.
这是一个由 ASP 驱动的网站,因此这比大多数网站要复杂一些,您必须使用 cURL 命令来模仿 POST 整个表单视图状态和视图状态。事件验证字符串返回服务器。直接从您拥有的页面中提取数据可能会更容易。
It's an ASP powered website so this is going to be a tad trickier than most and you will have to use cURL commands to mimic POSTing the entire form viewstate & eventvalidation strings back to server. Probably just be easier just to lift the data straight out of the page you have.
我正在使用 Ruby on Rails 和 Watir Webdriver (https://github.com/watir/watir-webdriver< /a>)。
我已经确定该工具使用 ASP.NET,当使用“doPostBack”时,该工具与客户定义的用户代理所使用的浏览器相同。使用 PhantomJS 时,用户代理被识别为“Mozilla/5.0(未知;Linux i686)AppleWebKit/534.34(KHTML,如 Gecko)Safari/534.34 PhantomJS/1.9.1”。
因此,在访问该页面之前有必要更改用户代理客户端。 Rails 并做了类似的事情:
I'm using Ruby on Rails and Watir Webdriver (https://github.com/watir/watir-webdriver).
I have identified that the tool using the ASP.NET when using the "doPostBack" identical browser used by the User Agent defined by the customer. When using PhantomJS the user agent is identified as something "Mozilla/5.0 (Unknown; Linux i686) AppleWebKit/534.34 (KHTML, like Gecko) Safari/534.34 PhantomJS/1.9.1".
Therefore it is necessary to change the user agent client before accessing the page. Rails and did something like: