如何使用 Java 搜索网站的损坏链接?
我想扫描一些网站寻找损坏的链接,最好使用 Java。有什么提示我如何开始这样做吗?
(我知道有一些网站可以这样做,但我想制作自己的个性化日志文件)
I would like to scan some websites looking for broken links, preferably using Java. Any hint how can I start doing this?
(I know there are some websites that do this, but I want to make my own personalized log file)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
编写网络爬虫并不像读取静态 HTML 那么简单,如果页面使用 JavaScript 来修改 DOM,那么它就会变得复杂。您还需要查找您已经访问过的页面(又名蜘蛛陷阱)?如果该网站是纯静态 HTML,那么就去做吧...但是如果该网站使用 Jquery 并且很大,那么它会很复杂。
如果您的网站都是静态的、小型的并且很少或没有 JS,那么请使用已经列出的答案。
或者
您可以使用Heritrix,然后解析它的crawl.log以查找404。 关于crawl.log的Heritrix文档
或者如果你最常写你的own:
你可以使用类似 HTMLUnit (它有一个 JavaScript 引擎)的东西来加载页面,然后查询 DOM 对象链接。然后将每个链接放入“未访问”队列中,然后从未访问队列中提取链接以获取下一个要加载的网址,如果页面加载失败,请报告。
为了避免重复页面(蜘蛛陷阱),您可以对每个链接进行哈希处理并保留访问过的页面的哈希表(请参阅CityHash< /a>)。在将链接放入未访问队列之前,请根据已访问哈希表检查它。
为了避免离开您的站点,请在将 URL 添加到未访问队列之前检查该 URL 是否位于安全域列表中。如果您想确认离域链接是否良好,请将它们保留在离域队列中。然后稍后使用 URL.getContent(url) 加载此队列中的每个链接以查看它们是否有效(比使用 HTMLUnit 更快,并且您无论如何都不需要解析页面。)。
Writing a web-crawler isn't as simple as just reading the static HTML, if the page uses JavaScript to modify the DOM then it gets complex. You will also need to look for pages you've already visited aka Spider Traps? If the site is pure static HTML, then go for it... But if the site uses Jquery and is large, expect it to be complex.
If your site is all static, small and has little or no JS then use the answers already listed.
Or
You could use Heritrix and then later parsed it's crawl.log for 404's. Heritrix doc on crawl.log
Or If you most write your own:
You could use something like HTMLUnit (it has a JavaScript engine) to load the page, then query the DOM object for links. Then place each link in a "unvisited" queue, then pull links from the unvisited queue to get your next url to load, if the page fails to load, report it.
To avoid duplicate pages (spider traps) you could hash each link and keep a HashTable of visited pages (see CityHash ). Before placing a link into the unvisited queue check it against the visited hashtable.
To avoid leaving your site check that the URL is in a safe domain list before adding it to the unvisited queue. If you want to confirm that the off domain links are good, then keep them in a offDomain queue. Then later load each link from this queue using URL.getContent(url) to see if they work (faster than using HTMLUnit and you don't need to parse the page anyway.).
编写一个递归检查链接的函数。
伪代码:
根据链接,您必须通过添加相对于当前 URL 的 url 来完成传递到下一个递归的链接。
Write a function which recursively checks links.
Pseudo Code:
Depending on the Links you have to complete the link passed to the next recursion by adding the url relative to the current URL.
标记,获取其内容并尝试连接到它。
如有必要,如果
中的 URL 属于您的网站,请递归重复。确保将已处理过的 URL 存储在地图中,这样就不会重复执行此操作。
<a>
tag, get its content and attempt to connect to it.If necessary, repeat recursively if URL from
<a>
belongs to your site. Make sure to store URLs that you processed already in a map so you don't do it more than once.