自动记录和网页抓取
我有一项任务需要自动登录并抓取特定的网站。
我见过有人建议使用 Java 的 htmlUnit 和 HttpClient。 htmlUnit 看起来像一个测试工具。我不知道该怎么办。是否有一个示例解释使用 htmlUnit
或 httpClient
自动登录和网页抓取?
我是一名 Java 开发人员。与它密切合作的人可以分享任何想法吗?
I have a task where I need to autologin and scrape a particualr website.
I have seen people suggesting htmlUnit and HttpClient mostly with Java. htmlUnit looks like a testing tool. I am not sure what to do with that. Is there an example that explains auto login and web scraping with htmlUnit
or httpClient
?
Im a Java developer. Can anyone who closely works with it please share any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的问题可以分解为
因此,对于第一部分:
安装 livehttp header firefox addon 并读取所有 http
您的浏览器在尝试时发送和接收的标头
登录。
尝试使用您的java代码发送这些标头,基本上您已经
使用您的 java 代码模拟
HTTP POST
请求。为此google->make post request from java
登录网站后,使用您选择的 API 废弃数据。
我个人使用
htmlcleaner
HtmlCleaner。要抓取数据,您可以将 XPath 表达式与 htmlcleaner 结合使用。
看看 Xpath+htmlcleaner 和这里还有
还可以使用
JSoup
代替htmlcleaner
。使用 JSoup 的优点是它可以处理登录[POST 请求]和数据抓取
。看看这里 http://pastebin.com/E0WzpuhF我知道这似乎有很多工作,我已经提供了您有两种替代解决方案来解决您的问题,但将您的问题分成更小的块,然后尝试解决它。
Your problem can be broken down to
So, for the first part-:
Install livehttp header firefox addon and than read all the http
headers that were sent and received by your browser while trying to
login.
Try to send these headers using your java code, basically you have
to emulate a
HTTP POST
request using your java code. For thatgoogle->make post request from java
After you have login into the website, than scrap the data using the API of your choice.
I personally use
htmlcleaner
HtmlCleaner.To scrape data you can use
XPath expressions
with htmlcleaner.Take a look at Xpath+htmlcleaner and here also
You can also use
JSoup
instead ofhtmlcleaner
. Advantage of using JSoup is it can handle bothlogin[POST Request] and Data scraping
. Take a look here http://pastebin.com/E0WzpuhFI know it seems a lot of work, i have provided you with two alternative solution for your problem but divide your problem into smaller chunks and than try to solve it.