自动生成 HTTP 屏幕抓取 Java 代码
我需要从网站上筛选一些数据,因为无法通过他们的网络服务获得这些数据。 当我以前需要这样做时,我自己使用 Apache 的 HTTP 客户端库编写了 Java 代码,以进行相关的 HTTP 调用来下载数据。 我通过在使用 Charles Web 代理 记录相应的 HTTP 调用。
正如您可以想象的那样,这是一个相当乏味的过程,我想知道是否有一个工具可以真正生成与浏览器会话相对应的 Java 代码。 我预计生成的代码不会像手动编写的代码那么漂亮,但我总是可以事后整理它。 有谁知道这样的工具是否存在? Selenium 是我知道的一种可能性,尽管我不确定它是否支持这个确切的用例。
谢谢, 大学教师
I need to screen scrape some data from a website, because it isn't available via their web service. When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls.
As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. Does anyone know if such a tool exists? Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case.
Thanks,
Don
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我还会为 HtmlUnit 添加+1,因为它的功能非常强大:如果您需要“就像真正的浏览器正在抓取和使用页面一样”的行为,那么这绝对是可用的最佳选择。 HtmlUnit 执行(如果您愿意)页面中的 Javascript。
目前它对所有主要 Javascript 具有全功能支持库并将使用它们执行 JS 代码。 与此相对应,您可以在测试中以编程方式获取页面中 Javascript 对象的句柄。
然而,如果您想要做的事情的范围较小,更多的是阅读一些 HTML 元素并且您不太关心 Javascript,那么使用 NekoHTML 应该足够了。 它类似于 JDom 提供对树的编程访问(而不是 XPath)。 您可能需要使用 Apache 的 HttpClient 来检索页面。
I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. HtmlUnit executes (if you want it to) the Javascript in the page.
It currently has full featured support for all the main Javascript libraries and will execute JS code using them. Corresponding with that you can get handles to the Javascript objects in page programmatically within your test.
If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. Its similar to JDom giving programmatic - rather than XPath - access to the tree. You would probably need to use Apache's HttpClient to retrieve pages.
manageability.org 博客有一个条目,其中列出了一大堆用于 Java 的网页抓取工具。 但是,我现在似乎无法访问它,但我确实在 Google 缓存中找到了纯文本表示 在这里。
The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here.
您应该看看 HtmlUnit - 它是为测试网站而设计的,但对于屏幕抓取和导航非常有用多个页面。 它负责处理 cookie 和其他与会话相关的内容。
You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. It takes care of cookies and other session-related stuff.
我想说我个人喜欢使用 HtmlUnit 和 Selenium 作为我最喜欢的两个屏幕抓取工具。
I would say I personally like to use HtmlUnit and Selenium as my 2 favorite tools for Screen Scraping.
名为 The Grinder 的工具允许您编写会话脚本通过代理访问某个网站。 输出是 Python(可在 Jython 中运行)。
A tool called The Grinder allows you to script a session to a site by going through its proxy. The output is Python (runnable in Jython).