检查是否有损坏的链接
我正在尝试使用 Java 查找网页中所有损坏的链接。这是代码:
private static boolean isLive(String link){
HttpURLConnection urlconn = null;
int res = -1;
String msg = null;
try{
URL url = new URL(link);
urlconn = (HttpURLConnection)url.openConnection();
urlconn.setConnectTimeout(10000);
urlconn.setRequestMethod("GET");
urlconn.connect();
String redirlink = urlconn.getHeaderField("Location");
System.out.println(urlconn.getHeaderFields());
if(redirlink != null && !url.toExternalForm().equals(redirlink))
return isLive(redirlink);
else
return urlconn.getResponseCode()==HttpURLConnection.HTTP_OK;
}catch(Exception e){
System.out.println(e.getMessage());
return false;
}finally{
if(urlconn != null)
urlconn.disconnect();
}
}
public static void main(String[] s){
String link = "http://www.somefakesite.net";
System.out.println(isLive(link));
}
代码引用自 http://nscraps.com /Java/146-program-code-broken-link-checker.htm。
此代码为所有网页(包括损坏的网页)提供 HTTP 200 状态。例如 http://www.somefakesite.net/ 给出以下标头字段:
{null=[HTTP/1.1 200确定],日期=[2011 年 5 月 15 日星期日 18:51:29 GMT],传输编码=[分块],保持活动=[超时=4,最大=100],连接=[保持活动],内容-Type=[text/html], Server=[Apache/2.2.15 (Win32) PHP/5.2.12], X-Powered-By=[PHP/5.2.9-1]}
即使此类网站不存在,如何将其归类为损坏的链接?
I am trying to find all the broken links in the webpage using Java. Here is the code:
private static boolean isLive(String link){
HttpURLConnection urlconn = null;
int res = -1;
String msg = null;
try{
URL url = new URL(link);
urlconn = (HttpURLConnection)url.openConnection();
urlconn.setConnectTimeout(10000);
urlconn.setRequestMethod("GET");
urlconn.connect();
String redirlink = urlconn.getHeaderField("Location");
System.out.println(urlconn.getHeaderFields());
if(redirlink != null && !url.toExternalForm().equals(redirlink))
return isLive(redirlink);
else
return urlconn.getResponseCode()==HttpURLConnection.HTTP_OK;
}catch(Exception e){
System.out.println(e.getMessage());
return false;
}finally{
if(urlconn != null)
urlconn.disconnect();
}
}
public static void main(String[] s){
String link = "http://www.somefakesite.net";
System.out.println(isLive(link));
}
Code referred from http://nscraps.com/Java/146-program-code-broken-link-checker.htm.
This code gives HTTP 200 status for all webpages including the broken ones. For example
http://www.somefakesite.net/ gives the following header fields:
{null=[HTTP/1.1 200 OK], Date=[Sun, 15 May 2011 18:51:29 GMT], Transfer-Encoding=[chunked], Keep-Alive=[timeout=4, max=100], Connection=[Keep-Alive], Content-Type=[text/html], Server=[Apache/2.2.15 (Win32) PHP/5.2.12], X-Powered-By=[PHP/5.2.9-1]}
Even though such sites do not exist, how to classify it as a broken link?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
也许问题在于,目前许多网络服务器和 DNS 提供商都会检测到这些“损坏”的链接,并将您重定向到他们的“未找到”页面。
针对您知道发送 404 代码的 URL 进行测试(它显示浏览器原始消息)。
编辑以回答作者的评论(因为评论太长):
我没有看到您的问题的简单答案,但有几种不同类型的故障:
Maybe the issue is that currently lots of webserver and DNS providers detect those "broken" links and redirect you to their "not found" pages.
Test it against an URL that you know sends the 404 code (it shows the browser original message).
EDIT to answer the comment by the author (as it is too long to fit in a comment):
I do not see an easy answer for your problem, but there are several different types of failures: