屏幕抓取 ASP.NET 网页以检索网格视图中显示的数据

发布于 2024-07-15 23:17:12 字数 320 浏览 6 评论 0原文

我正在使用 RUBY 来屏幕截图一个网页(在 asp.net 中创建),该网页使用 gridview 来显示数据。 我能够成功读取网格第 1 页上显示的数据,但无法弄清楚如何移动到网格中的下一页来读取所有数据。

问题是页码超链接不是普通的超链接(带有 URL),而是 javascript 超链接,它会导致回发到同一页面。

超链接的示例:-

<a href="javascript:__doPostBack('gvw_offices','Page$6')" style="color:Black;">6</a>

I am using RUBY to screen scrap a web page (created in asp.net) which uses gridview to display data. I am successfully able to read the data displayed on page-1 of the grid but unable to figure out how I can move to the next page in the grid to read all the data.

Problem is the page number hyperlinks are not normal hyperlinks (with URL) but instead are javascript hyperlink which causes postback to the same page..

An example of the hyperlink:-

<a href="javascript:__doPostBack('gvw_offices','Page$6')" style="color:Black;">6</a>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

鱼忆七猫命九 2024-07-22 23:17:12

如果您已经使用 ruby​​ 进行处理,我建议使用 Watir,这是一个专为浏览器测试而设计的 ruby​​ 库。 一方面,它为您提供了一个更好的页面上 DOM 元素的界面,并且使单击这样的链接变得更容易:

ie.link(:text, '6').click

然后,当然您也有更简单的方法来导航表格。 自动化这个过程很容易:

1..total_number_of_pages.each do |next_page|

  ie.link(:text, next_page).click
  # table processing goes here

end

我不知道你的用例,但这种方法有它的优点和缺点。 一方面,它实际上运行一个浏览器实例,因此如果您需要经常以完全自动化的方式在后台安静地运行它,这可能不是最好的方法。 另一方面,如果可以启动浏览器实例,那么您不必担心所有回发废话,您可以像用户一样单击链接。

瓦蒂尔:http://wtr.rubyforge.org/

I recommend using Watir, a ruby library designed for browser testing, if you're already using ruby for processing. For one thing, it gives you a much nicer interface to the DOM elements on the page, and it makes clicking links like this easier:

ie.link(:text, '6').click

Then, of course you have easier methods for navigating the table as well. It's easy enough to automate this process:

1..total_number_of_pages.each do |next_page|

  ie.link(:text, next_page).click
  # table processing goes here

end

I don't know your use case, but this approach has its advantages and disadvantages. For one thing, it actually runs a browser instance, so if this is something you need to frequently run quietly in the background in completely automated way, this may not be the best approach. On the other hand, if it's ok to launch a browser instance, then you don't have to worry about all that postback nonsense, and you can just click the link as if you were a user.

Watir: http://wtr.rubyforge.org/

时光是把杀猪刀 2024-07-22 23:17:12

您需要找出实际的 URL。

选项 1a:在具有良好开发人员支持的浏览器(例如带有 Web 开发工具的 firefox)中打开页面,并浏览源代码以查找定义 _doPostBack 的位置。 找出它正在构建的 URL。 请注意,它可能不在主页源中,而是在页面加载的内容中。

选项 1b:同上,但让 ruby​​ 来做。 如果您使用 Net:HTTP 获取页面,您已经拥有了可以查找 __doPostBack 定义的工具(字符串形式的正文、ruby 的 grep 以及请求其他文件的能力,例如如脚本标签中的那些)。

选项 2:监视浏览器和页面之间的流量(例如使用日志代理)以找出 URL 是什么。

选项 3:询问网页的所有者。

选项4:猜猜。 这可能不像听起来那么糟糕(例如,如果原始 URL 以“...?page=1”或其他内容结尾),但一般来说这是最不可能起作用的。

编辑(回应您对另一个问题的评论):

假设您使用的是 Net:HTTP 库,您只需将 get 替换为 post 即可进行回发>,例如 my_http.post(my_url) 而不是 my_http.get(my_url)

编辑(回应 danieltalsky 的回答):

watir 可能是对你来说这是一个非常好的解决方案(我因为没有想到它而自责),但请注意你可能必须 手动触发事件或通过其他步骤来获得你想要的东西。 作为一个特定的问题,对于任何像这样的异步获取,您需要确保在抓取之前完整的响应已返回; 当您自己内联执行请求时,这不是问题。

You'll need to figure out the actual URL.

Option 1a: Open the page in a browser with good developer support (e.g. firefox with the web development tools) and look through the source to find where _doPostBack is defined. Figure out what URL it's constructing. Note that it might not be in the main page source, but instead in something that the page loads.

Option 1b: Ditto, but have ruby do it. If you're fetching the page with Net:HTTP you've got the tools to find the definition of __doPostBack already (the body as a string, ruby's grep, and the ability to request additional files, such as those in script tags).

Option 2: Monitor the traffic between a browser and the page (e.g. with a logging proxy) to find out what the URL is.

Option 3: Ask the owner of the web page.

Option 4: Guess. This may not be as bad as it sounds (e.g. if the original URL ends with "...?page=1" or something) but in general this is the least likely to work.

Edit (in response to your comment on the other question):

Assuming you're using the Net:HTTP library, you can do a postback by just replacing your get with a post, e.g. my_http.post(my_url) instead of my_http.get(my_url)

Edit (in response to danieltalsky's answer):

watir may be a really good solution for you (I'm kicking myself for not having thought of it), but be aware that you may have to manually fire the event or go through other hoops to get what you want. As a specific gotcha, with any asynchronous fetch like this you need to make sure that the full response has come back before you scrape it; that isn't a problem when you're doing the request inline yourself.

久伴你 2024-07-22 23:17:12

您必须执行回发。 数据通过表单 POST 传递回服务器。 就像 Markus 所说,使用 FireBug 或 IE 8 中的开发人员工具和 fiddler 来监视流量。 但老实说,这是一个使用臃肿的 GridView 的 Web 表单,您将经历一次有趣的冒险。 ;)

You will have to perform the postback. The data is pass with a form POST back to the server. Like Markus said use something like FireBug or the Developer Tools in IE 8 and fiddler to watch the traffic. But honestly this is a web form using the bloated GridView and you will be in for a fun adventure. ;)

淡莣 2024-07-22 23:17:12

您需要进行一些调查才能确定 javascript 执行正在执行什么 HTTP 请求。 我使用 Mozilla 浏览器和 Firebug 插件以及“Live HTTP Headers”插件来帮助确定发生了什么。 您可能会清楚需要发出哪些请求才能遍历到下一页。 请务必注意任何设置的 cookie。

我使用 Mechanize 进行抓取取得了非常好的成功。 它封装了所有的 HTTP 通信、html 解析和搜索(使用 Nokogiri)、重定向和保存到饼干上。 但它不知道如何执行 Javascript,这就是为什么你需要自己弄清楚要执行什么 http 请求。

You'll need to do some investigation in order to figure out what HTTP request the javascript execution is performing. I've used the Mozilla browser with the Firebug plugin and also the "Live HTTP Headers" plugin to help determine what is going on. It will likely become clear to you which requests you will need to make in order to traverse to the next page. Make sure you pay attention to any cookies getting set.

I've had really good success using Mechanize for scraping. It wraps all of the HTTP communication, html parsing and searching(using Nokogiri), redirection, and holding onto cookies. But it doesn't know how to execute Javascript, which is why you will need to figure out what http request to perform on your own.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文