从网页中提取通用文章
我将开始我的文章提取工作。
我要做的任务是提取不同网页中发布的酒店评论(例如 1. http://www.tripadvisor.ca/Hotel_Review-g32643-d1097955-Reviews-San_Mateo_County_Memorial_Park_Campground-Loma_Mar_California.html, 2. http://www.travelpod.com/hotel/Comfort_Suites_Sfo_Airport-San_Mateo.html )
我需要用 Java 执行该任务并且在过去的几个月里,我只是在使用 Java。
这是我关于这些的问题。
是否有可能以通用方式从不同网页中单独提取评论。
请告诉我是否有任何API支持Java中的任务。
另外,请让我知道您的想法/来源,这将更有利于我完成上述任务。
更新
如果网络上有任何类型的相关示例,请发布相同的示例,因为这可能很有用。
Am going to begin my work in article extraction.
The task that I will be doing is to extract the hotel reviews that is posted in different web pages(eg. 1. http://www.tripadvisor.ca/Hotel_Review-g32643-d1097955-Reviews-San_Mateo_County_Memorial_Park_Campground-Loma_Mar_California.html, 2. http://www.travelpod.com/hotel/Comfort_Suites_Sfo_Airport-San_Mateo.html )
I need to do the task in Java and I am just working with Java for the past couple of months alone..
And here comes my questions regarding these.
Is there possibility to extract reviews alone from different web pages in a generic way.
Kindly let me know if there are any API that supports the task in Java.
Also, let me know of your thoughts/sources which will be more helpful for me to attain the task mentioned above.
UPDATE
If any sort of related examples available in net, please post the same since that could be of great use.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可能需要一个用于 Java 的屏幕抓取实用程序,例如 TagSoup 或 NekoHTML。 JSoup 也很受欢迎。
但是,从 Tripadvisor 等第三方网站提取数据时,您还需要考虑更大的法律因素。他们的政策允许吗?
You probably need a screen scraping utility for Java like TagSoup or NekoHTML. JSoup is also popular.
However, you also have a bigger legal consideration here when extracting data from a 3rd party website like tripadvisor. Does their policy allow it?