如何从网页中提取文本内容?
我正在用java开发一个应用程序,它可以从不同的网页获取文本信息并将其汇总到一个页面中。例如,假设我在不同的网页上有一条新闻,如印度教、印度时报、政治家等。现在我的应用程序应该从每个页面中提取重要点,并将它们组合在一起作为一条新闻。该应用程序基于网页内容挖掘的概念。作为该领域的初学者,我不知道从哪里开始.我已经阅读了研究论文,其中解释了消除噪音是构建此应用程序的第一步。
因此,如果给我一个新闻网页,第一步就是从页面中提取主要新闻,排除超链接、广告、无用的图像等。我的问题是我该怎么做?请给我一些很好的教程,解释使用网页内容挖掘来实现此类应用程序。或者至少给我一些提示如何完成它?
I'm developing an application in java which can take textual information from different web pages and will summarize it into one page.For example,suppose I have a news on different web pages like Hindu,Times of India,Statesman,etc.Now my application is supposed to extract important points from each one of these pages and will put them together as a single news.The application is based on concepts of web content mining.As a beginner to this field,I can't understand where to start off.I have gone through research papers which explains noise removal as first step in buiding this application.
So,if I'm given a news web page the very first step is to extract main news from the page excluding hyperlinks,advertisements,useless images,etc. My question is how can I do this ? Please give me some good tutorials which explains the implementation of such kind of application using web content mining.Or at least give me some hint how to accomplish it ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用可读性或boilerpipe,用于此任务的两个开源工具。对于教程,您应该阅读代码和这两个项目的文档。
You can use readability or boilerpipe, two open source tools for this task. For a tutorial you should read the code & documentation for those two projects.