使用 Python Mechanize 对 aspx 进行屏幕美化 - Javascript 表单提交

发布于 2024-11-09 19:08:44 字数 1336 浏览 0 评论 0原文

我正在尝试抓取英国食品评级机构数据 aspx 搜索结果页面(例如 http:// ratings.food.gov.uk/QuickSearch.aspx?q=po30 )在 scraperwiki 上使用 Mechanize/Python ( http://scraperwiki.com/scrapers/food_standards_agency/ ),但在尝试遵循以下形式的“下一页”链接时遇到问题:

<input type="submit" name="ctl00$ContentPlaceHolder1$uxResults$uxNext" value="Next >" id="ctl00_ContentPlaceHolder1_uxResults_uxNext" title="Next >" />

表单处理程序如下所示:

<form method="post" action="QuickSearch.aspx?q=po30" onsubmit="javascript:return WebForm_OnSubmit();" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_ContentPlaceHolder1_buttonSearch')" id="aspnetForm">
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />

当我手动单击“下一步”链接时,HTTP 跟踪显示 __EVENTTARGET 为空?我在其他抓取工具上找到的所有婴儿床都显示了 __EVENTTARGET 的操作作为处理下一页的方式。

事实上,我不确定我想要抓取的页面如何加载下一页?无论我向抓取工具扔什么,它都只能加载第一个结果页面。 (即使能够更改每页结果的数量也会很有用,但我也不知道如何做到这一点!)

那么 - 关于如何抓取第 1+N 个结果页面 N>0 的任何想法?

I'm trying to scrape UK Food Ratings Agency data aspx seach results pages (e.,g http://ratings.food.gov.uk/QuickSearch.aspx?q=po30 ) using Mechanize/Python on scraperwiki ( http://scraperwiki.com/scrapers/food_standards_agency/ ) but coming up with a problem when trying to follow "next" page links which have the form:

<input type="submit" name="ctl00$ContentPlaceHolder1$uxResults$uxNext" value="Next >" id="ctl00_ContentPlaceHolder1_uxResults_uxNext" title="Next >" />

The form handler looks like:

<form method="post" action="QuickSearch.aspx?q=po30" onsubmit="javascript:return WebForm_OnSubmit();" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_ContentPlaceHolder1_buttonSearch')" id="aspnetForm">
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__LASTFOCUS" id="__LASTFOCUS" value="" />

An HTTP trace when I manually click Next links shows __EVENTTARGET as empty? All the cribs I can find on other scrapers show the manipulation of __EVENTTARGET as the way of handling Next pages.

Indeed, I'm not sure how the page I want to scrape ever loads the next page? Whatever I throw at the scraper, it only ever manages to load the first results page. (Even being able to change the number of results per page would be useful, but I can't see how to do that either!)

So - any ideas on how to scrape the 1+N'th results pages for N>0?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

旧人哭 2024-11-16 19:08:44

Mechanize 不处理 javascript,但对于这种特殊情况,不需要它。

首先我们用 mechanize 打开结果页面

url = 'http://ratings.food.gov.uk/QuickSearch.aspx?q=po30'
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open(url)
response = br.response().read()

然后我们选择 aspnet 表单:

br.select_form(nr=0) #Select the first (and only) form - it has no name so we reference by number

该表单有 5 个提交按钮 - 我们想要提交一个将我们带到下一个结果页面的按钮:

response = br.submit(name='ctl00$ContentPlaceHolder1$uxResults$uxNext').read()  #"Press" the next submit button

表单中的其他提交按钮是:

ctl00$uxLanguageSwitch # Switch language to Welsh
ctl00$ContentPlaceHolder1$uxResults$Button1 # Search submit button
ctl00$ContentPlaceHolder1$uxResults$uxFirst # First result page
ctl00$ContentPlaceHolder1$uxResults$uxPrevious # Previous result page
ctl00$ContentPlaceHolder1$uxResults$uxLast # Last result page

在 mechanize 中,我们可以获取这样的表单信息:

for form in br.forms():
    print form

Mechanize doesn´t handle javascript, but for this particular case it isn´t needed.

First we open the result page with mechanize

url = 'http://ratings.food.gov.uk/QuickSearch.aspx?q=po30'
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.open(url)
response = br.response().read()

Then we select the aspnet form:

br.select_form(nr=0) #Select the first (and only) form - it has no name so we reference by number

The form has 5 submit buttons - we want to submit the one that takes us to the next result page:

response = br.submit(name='ctl00$ContentPlaceHolder1$uxResults$uxNext').read()  #"Press" the next submit button

The other submit buttons in the form are:

ctl00$uxLanguageSwitch # Switch language to Welsh
ctl00$ContentPlaceHolder1$uxResults$Button1 # Search submit button
ctl00$ContentPlaceHolder1$uxResults$uxFirst # First result page
ctl00$ContentPlaceHolder1$uxResults$uxPrevious # Previous result page
ctl00$ContentPlaceHolder1$uxResults$uxLast # Last result page

In mechanize we can get form info like this:

for form in br.forms():
    print form
红墙和绿瓦 2024-11-16 19:08:44

Mechanicalize 不处理 JavaScript。

然而,有很多方法可以处理这个问题,包括 QtWebKit , python-spidermonkey, HtmlUnit(使用 Jython),或 SeleniumRC

以下是如何使用 SeleniumRC 完成此操作:

import selenium
sel=selenium.selenium("localhost",4444,"*firefox", "http://ratings.food.gov.uk")   
sel.start()
sel.open("QuickSearch.aspx?q=po30")
sel.click('ctl00$ContentPlaceHolder1$uxResults$uxNext')

另请参阅这些相关的 SO 问题:

  1. 如何单击包含
    JavaScript
  2. 单击其中的 JavaScript 链接
    Python

Mechanize does not handle JavaScript.

There are many ways to handle this, however, including QtWebKit, python-spidermonkey, HtmlUnit (using Jython), or SeleniumRC.

Here is how it might be done with SeleniumRC:

import selenium
sel=selenium.selenium("localhost",4444,"*firefox", "http://ratings.food.gov.uk")   
sel.start()
sel.open("QuickSearch.aspx?q=po30")
sel.click('ctl00$ContentPlaceHolder1$uxResults$uxNext')

See also these related SO questions:

  1. How to click a link that has
    JavaScript
  2. Click on a JavaScript link within
    Python
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文