高效的网络爬行
java中有什么好方法可以实现我为数据库获取新网络数据而提出的以下方法?比较一堆数组元素是正确的方法吗?有些想法会很棒。
Crawler imdbCrawler = new Crawler(files.getLocalTitles("C:\\Movies"));
//add these titles to the database
//query to get existing DB titles, get directory titles and crawl negated union of these titles
What is a good way in java to achieve the following method i have come up with for fetching NEW web data for my database? Would comparing a bunch of array elements be the way to go? Some ideas would be great.
Crawler imdbCrawler = new Crawler(files.getLocalTitles("C:\\Movies"));
//add these titles to the database
//query to get existing DB titles, get directory titles and crawl negated union of these titles
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您知道,IMDB 免费提供他们的数据库...当然对于商业用途有一些注意事项。
其次,最好使用某种结构/集合,因此如果集合中的对象有数据,则意味着您已经抓取了它。如果没有,仍然需要爬行。如果您找到新链接,只需将其添加到集合中(没有数据),您的数据收集线程稍后就会找到这些链接。
德克尔
You know, IMDB offers their database for free... with some caveats of course for commercial use.
Second, using some kind of structure/Collection would be best, so if the object in the collection HAS data, it means you've crawled it already. If it does not, it still needs crawling. If you find a new link, just add it to the collection (with no data), and your data collection thread will find those later.
Dekker