使用 CasperJS 和 PhantomJS 抓取 Google 关键字工具
我目前正在尝试使用 CasperJS 和 PhantomJS(这两个优秀的工具,感谢 n1k0 和 Ariya)抓取 Google 关键字工具,但是我无法让它工作。
这是我当前的流程:
- 使用我的 Google 帐户登录(以避免关键字工具中出现验证码)。
- 导航至关键字工具页面。
- 填写搜索表单并按
搜索
。
我陷入了第 3 步:搜索表单不是常规 HTML 表单,我无法使用 Casper#fill()
,因此我直接访问字段。以下是我尝试更改“单词或短语”字段的值的一些语法:
this.evaluate(function() {
// Trying to change the value...
document.querySelector('textarea.sP3.sBFB').value = 'MY SUPER KEYWORDS';
document.querySelector('textarea.sP3.sBFB').setAttribute('value', 'MY SUPER KEYWORDS');
document.querySelector('textarea').value = 'MY SUPER KEYWORDS'; // there's only one <textarea> on the page
// Trying to change other attributes...
document.querySelector('textarea.sP3.sBFB').textContent = 'MY SUPER KEYWORDS';
document.querySelector('textarea').style.backgroundColor = 'yellow';
});
没有任何作用。之后我正在执行 Casper#capture()
来查看该字段包含的内容。 如您所见,它确认我位于正确的页面并且我已登录,但是
奇怪的是,我可以访问 DOM 的其他部分:我可以将 Advanced Options and Filters
链接的文本更改为 ___VINCE SAYS HELLO___
(请参阅捕获),方法是:以下:
this.evaluate(function() {
document.querySelector('a.sLAB').textContent = '___VINCE SAYS HELLO___';
});
PS。我知道抓取 Google 关键字工具违反了 TOS,但我认为任何试图抓取 JavaScript/Ajax 网站的人可能会对这个问题感兴趣。
I'm currently trying to scrape Google Keyword Tools with CasperJS and PhantomJS (both excellent tools, thanks n1k0 and Ariya), but I can't get it to work.
Here is my current process:
- Log in with my Google Account (to avoid captchas in the Keyword Tools).
- Navigate to the Keyword Tools page.
- Fill in the search form and press
Search
.
I'm stuck at step 3: the search form is not a regular HTML form, I can't use Casper#fill()
, so instead I'm accessing the fields directly. Here are some of the syntaxes I tried to change the value of the Word or phrase
field:
this.evaluate(function() {
// Trying to change the value...
document.querySelector('textarea.sP3.sBFB').value = 'MY SUPER KEYWORDS';
document.querySelector('textarea.sP3.sBFB').setAttribute('value', 'MY SUPER KEYWORDS');
document.querySelector('textarea').value = 'MY SUPER KEYWORDS'; // there's only one <textarea> on the page
// Trying to change other attributes...
document.querySelector('textarea.sP3.sBFB').textContent = 'MY SUPER KEYWORDS';
document.querySelector('textarea').style.backgroundColor = 'yellow';
});
Nothing works. I'm doing a Casper#capture()
right after to see what the field contains. As you can see, it confirms I am on the right page and that I am logged in, but the <textarea>
is empty.
Strangely, I can access other parts of the DOM: I could change the text of a link that said Advanced Options and Filters
to ___VINCE SAYS HELLO___
(see capture), by doing the following:
this.evaluate(function() {
document.querySelector('a.sLAB').textContent = '___VINCE SAYS HELLO___';
});
PS. I know scraping Google Keyword Tools is against the TOS, but I'm thinking this question might be of interest to anyone trying to scrape a JavaScript/Ajax-heavy site.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您不能在文本区域上使用
elt.value
。您尝试过elt.textContent
吗?You can't use
elt.value
on a textarea. Did you try withelt.textContent
?为什么你要尝试去抓取结果。 Google 已经为我们创建了一个 csv 文件。
尝试下载它。该链接选择器必须类似于 $('.gux-combo gux-dropdown-c .sJK')
您会用它来实现自动化吗?
Why do you try to scrape the results. Google already creating a csv file for us.
Try downloading that. That links selector must be like $('.gux-combo gux-dropdown-c .sJK')
Will you use that for automating things ?
我不确定这里到底发生了什么,但是您用于定位的类对我来说是不同的。我假设您尝试定位的
OneBoxKeywordsInputPanel-input
文本区域有第二个类sPFB
,并且没有其他类。这些神秘的类可能在某种程度上是动态的。我建议使用更具描述性的类名称。以下对我来说效果很好:I'm not sure exactly what's happening here, but the classes that you're using for targeting are different for me. The
OneBoxKeywordsInputPanel-input
textarea that I assume you're attempting to target has a second class,sPFB
, and no other classes. It's possible that these cryptic classes are dynamic in some way. I'd recommend using the more descriptive class names instead. The following works just fine for me: