抓取 ASP.net 网站:需要使用 Python Mechanize 对 Gridview 进行分页
我正在尝试抓取一个 asp.net 页面,我需要在其中翻阅 gridview 控件中的项目列表。我从未使用过 asp.net,但一直在网上搜索指针,但现在我遇到了困难。页面链接的形式为:
javascript:__doPostBack('ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems','Page$2')
我目前正在尝试使用 Python 中的 Mechanize 来实现此功能。我最初尝试了以下操作,假设 VIEWSTATE 变量将由 mechanize 处理。
br.form.set_all_readonly(False)
br['__EVENTTARGET'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTARGUMENT'] = 'Page$2'
response = br.submit(name="ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$itemLocator$btnItemSearch")
html = br.response().read()
使用网络监视器(Fiddler2),我注意到另外两个变量被填充,所以我也添加了这些变量:
br.select_form(nr=0)
br.form.new_control('hidden','ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1',attrs = dict(name='ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1'))
br.form.new_control('hidden','hiddenInputToUpdateATBuffer_CommonToolkitScripts',attrs = dict(name='hiddenInputToUpdateATBuffer_CommonToolkitScripts'))
br.form.new_control('hidden','__ASYNCPOST',attrs = dict(name='__ASYNCPOST'))
br.form.set_all_readonly(False)
br['hiddenInputToUpdateATBuffer_CommonToolkitScripts'] = '1'
br['__ASYNCPOST'] = 'TRUE'
br['ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$SearchResultsUpdatePanel|ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTTARGET'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTARGUMENT'] = 'Page$2'
response = br.submit(name="ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$itemLocator$btnItemSearch")
html = br.response().read()
对于这两个变量,我返回的 html 仍然仅适用于第 1 页。
我认为可能存在一些潜在问题:
我不确定我是否正确提交。页面上有多个提交按钮,因此我正在搜索的按钮是“搜索”按钮,这是我以前用来访问第一页的按钮。我可以看到这就是显示第一页的原因。如果我使用没有名称的 br.submit() ,那么它会使用另一个提交控件,将您带到其他地方。
当您在浏览器中单击页码时,gridview 控件会更新,而无需重新加载页面。由于我没有运行 Javascript,也许我无法获取该数据,但我至少希望能够从 POST 中获取数据并进行解析。
任何帮助将不胜感激!
I'm trying to scrape an asp.net page where I need to page through the items a list of items that are in a gridview control. I've never used asp.net but have been searching the Net for pointers but now I've hit a brick wall. The page links are of the form:
javascript:__doPostBack('ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems','Page$2')
I'm currently trying to get this working using Mechanize in Python. I initially tried the following, assuming that the VIEWSTATE variables would be handled by mechanize.
br.form.set_all_readonly(False)
br['__EVENTTARGET'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTARGUMENT'] = 'Page$2'
response = br.submit(name="ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$itemLocator$btnItemSearch")
html = br.response().read()
Using a network monitor(Fiddler2), I noticed that two more variables were populated so I added these in too:
br.select_form(nr=0)
br.form.new_control('hidden','ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1',attrs = dict(name='ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1'))
br.form.new_control('hidden','hiddenInputToUpdateATBuffer_CommonToolkitScripts',attrs = dict(name='hiddenInputToUpdateATBuffer_CommonToolkitScripts'))
br.form.new_control('hidden','__ASYNCPOST',attrs = dict(name='__ASYNCPOST'))
br.form.set_all_readonly(False)
br['hiddenInputToUpdateATBuffer_CommonToolkitScripts'] = '1'
br['__ASYNCPOST'] = 'TRUE'
br['ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ScriptManager1'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$SearchResultsUpdatePanel|ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTTARGET'] = 'ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$gridViewItems'
br['__EVENTARGUMENT'] = 'Page$2'
response = br.submit(name="ctl00$ctl00$ctl00$ContentPlaceHolderEverything$ContentPlaceHolderFullWidthContent$ContentPlaceHolderMain$itemLocator$btnItemSearch")
html = br.response().read()
With both of these the html I get back is still for page 1 only.
I think there may be a couple of potential issues:
I'm not sure I'm doing the submit right. There are multiple submit buttons on the page so the one I'm searching for is the "search" button, which is what I previously used to get to the first page. I could see that being why the first page is displayed. If I use br.submit() without a name then it uses another submit control that takes you somewhere else.
When you click a page number in a browser, the gridview control updates without a page reload. As I'm not running Javascript, maybe I can't get that but I would at least expect to be able to get back the data from the POST and parse that.
Any help would be much appreciated!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
通过根据此处的答案构建 xmlhttprequest 来设法做到这一点:
使用Python和Mechanize提交表单数据并进行身份验证
Managed to to it by building an xmlhttprequest per the answer here:
Using Python and Mechanize to submit form data and authenticate