如何抓取将所有交互都作为回发进行的 ASP.NET 站点?
我使用 Python 为 ASP.NET 站点(特别是 Jenzabar 课程搜索 portlet)构建了一个抓取工具,它将创建一个新会话、加载第一个搜索页面,然后通过回发所需字段来模拟搜索。然而,有些事情发生了变化,我不知道是什么,现在我得到了对所有内容的 HTTP 500 响应。我可以看到浏览器的 POST 数据中没有新字段。
理想情况下,我想弄清楚如何修复我自己的抓取工具,但是如果不包含大量特定上下文,这可能很难在 StackOverflow 上询问,所以我想知道是否有一种方法可以将页面视为黑匣子,并且只需在我想要的回发链接上触发单击事件,然后获取结果的 HTML。
我在这里看到了一些关于使用 JavaScript 进行抓取的答案,但它们似乎主要关注等待 JavaScript 加载,然后返回页面的规范化表示。我想模拟浏览器实际单击链接并遵循相同的路径来执行请求。
Using Python, I built a scraper for an ASP.NET site (specifically a Jenzabar course searching portlet) that would create a new session, load the first search page, then simulate a search by posting back the required fields. However, something changed, and I can't figure out what, and now I get HTTP 500 responses to everything. There are no new fields in the browser's POST data that I can see.
I would ideally like to figure out how to fix my own scraper, but that is probably difficult to ask about on StackOverflow without including a ton of specific context, so I was wondering if there was a way to treat the page as a black box and just fire click events on the postback links I want, then get the HTML of the result.
I saw some answers on here about scraping with JavaScript, but they mostly seem to focus on waiting for javascript to load and then returning a normalized representation of the page. I want to simulate the browser actually clicking on the links and following the same path to execute the request.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在不知道任何细节的情况下,我的预感是您正在使用硬编码的会话 ID,并且 Web 服务器的应用程序域回收并创建了新的加密/解密密钥,从而使您的硬编码会话 ID(由旧密钥加密)无用。
Without knowing any specifics, my hunch is that you are using a hardcoded session id and the web server's app domain recycled and created new encryption/decryption keys, rendering your hardcoded session id (which was encrypted by the old keys) useless.
您可以尝试使用 Firebugs NET 选项卡来监视所有请求,手动浏览,然后将您生成的请求与屏幕抓取工具生成的请求进行比较。
You could try using Firebugs NET tab to monitor all requests, browse around manually and then diff the requests that you generate with ones that your screen scraper is generating.
如果您只是想模拟负载,您可能需要查看类似 selenium 的东西,它通过浏览器运行并像浏览器一样处理回发。
If you are just trying to simulate load, you might want to check out something like selenium, which runs through a browser and handles postbacks like a browser does.