我应该用什么来爬取许多新闻文章?
我有一个自然语言处理项目,但为此我需要从雅虎新闻、谷歌新闻或博客等来源抓取许多网络文章......
我是一名java开发人员(所以我宁愿使用java工具)。我想我可以自己解析每个源网站并使用 HttpClient / XPath 提取文章,但我有点懒:) 有没有办法让我不必为每个源创建一个解析器?
(我不仅对新文章感兴趣,还对2000年至今的文章感兴趣)
I've a project of natural language processing but for that i need to crawl many web articles from some sources like Yahoo news, Google news or blogs...
I'm a java developper (so i'd rather use java tools). I guess i can parse each source website on my own and extract the articles with HttpClient / XPath but i'm a bit lazy :) is there a way so that i won't have to make a parser per source?
(I'm not only interested by new articles but articles from 2000 to now too)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
NLP 最困难的部分是获取可以使用的数据。其他一切都只是数学。
由于涉及版权问题,除了每个新闻来源的网站之外,可能很难找到大量新闻文章。如果您不需要最新新闻,最好的选择可能是查看语言数据联盟的 英语Gigaword语料库;如果您在大学,可能已经存在可供您免费使用数据的关系。
如果您需要实际抓取和解析网站,现在您可能会发现必须为各种新闻网站编写特定的解析器,以确保获得正确的文本。然而,一旦更多的网站开始使用 HTML5,通过使用 文章标签。
要进行实际的爬行,上一个问题可以为您指出在一些有用的方向。
The hardest part of NLP is getting data you can use. Everything else is just math.
It may be hard to find a large collection of news articles other than on each news source's website because of all the copyright issues involved. If you don't need recent news, your best bet is probably to look at the Linguistic Data Consortium's English Gigaword corpus; if you are at a university, there may already be an existing relationship for you to use the data for free.
If you need to actually crawl and parse websites, for now you'll probably find you have to write specific parsers for the various news websites to make sure you get the right text. However, once more websites start using HTML5, it will be easier to pull out the relevant text through the use of the article tag.
To do the actual crawling, this previous question can point you in some useful directions.