检查是否有损坏的链接

发布于 2024-11-07 12:53:25 字数 1585 浏览 1 评论 0原文

我正在尝试使用 Java 查找网页中所有损坏的链接。这是代码:

   private static boolean isLive(String link){

    HttpURLConnection urlconn = null;
    int res = -1;
    String msg = null;
    try{

        URL url = new URL(link);
        urlconn = (HttpURLConnection)url.openConnection();
        urlconn.setConnectTimeout(10000);
        urlconn.setRequestMethod("GET");
        urlconn.connect();
        String redirlink = urlconn.getHeaderField("Location");
        System.out.println(urlconn.getHeaderFields());
        if(redirlink != null && !url.toExternalForm().equals(redirlink))
            return isLive(redirlink);
        else
            return urlconn.getResponseCode()==HttpURLConnection.HTTP_OK;

    }catch(Exception e){

      System.out.println(e.getMessage());
      return false;

    }finally{

        if(urlconn != null)
            urlconn.disconnect();

    }


}

public static void main(String[] s){

    String link = "http://www.somefakesite.net";
    System.out.println(isLive(link));

}

代码引用自 http://nscraps.com /Java/146-program-code-broken-link-checker.htm

此代码为所有网页(包括损坏的网页)提供 HTTP 200 状态。例如 http://www.somefakesite.net/ 给出以下标头字段:

{null=[HTTP/1.1 200确定],日期=[2011 年 5 月 15 日星期日 18:51:29 GMT],传输编码=[分块],保持活动=[超时=4,最大=100],连接=[保持活动],内容-Type=[text/html], Server=[Apache/2.2.15 (Win32) PHP/5.2.12], X-Powered-By=[PHP/5.2.9-1]}

即使此类网站不存在,如何将其归类为损坏的链接?

I am trying to find all the broken links in the webpage using Java. Here is the code:

   private static boolean isLive(String link){

    HttpURLConnection urlconn = null;
    int res = -1;
    String msg = null;
    try{

        URL url = new URL(link);
        urlconn = (HttpURLConnection)url.openConnection();
        urlconn.setConnectTimeout(10000);
        urlconn.setRequestMethod("GET");
        urlconn.connect();
        String redirlink = urlconn.getHeaderField("Location");
        System.out.println(urlconn.getHeaderFields());
        if(redirlink != null && !url.toExternalForm().equals(redirlink))
            return isLive(redirlink);
        else
            return urlconn.getResponseCode()==HttpURLConnection.HTTP_OK;

    }catch(Exception e){

      System.out.println(e.getMessage());
      return false;

    }finally{

        if(urlconn != null)
            urlconn.disconnect();

    }


}

public static void main(String[] s){

    String link = "http://www.somefakesite.net";
    System.out.println(isLive(link));

}

Code referred from http://nscraps.com/Java/146-program-code-broken-link-checker.htm.

This code gives HTTP 200 status for all webpages including the broken ones. For example
http://www.somefakesite.net/ gives the following header fields:

{null=[HTTP/1.1 200 OK], Date=[Sun, 15 May 2011 18:51:29 GMT], Transfer-Encoding=[chunked], Keep-Alive=[timeout=4, max=100], Connection=[Keep-Alive], Content-Type=[text/html], Server=[Apache/2.2.15 (Win32) PHP/5.2.12], X-Powered-By=[PHP/5.2.9-1]}

Even though such sites do not exist, how to classify it as a broken link?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

偏爱你一生 2024-11-14 12:53:25

也许问题在于,目前许多网络服务器和 DNS 提供商都会检测到这些“损坏”的链接,并将您重定向到他们的“未找到”页面。

针对您知道发送 404 代码的 URL 进行测试(它显示浏览器原始消息)。


编辑以回答作者的评论(因为评论太长):
我没有看到您的问题的简单答案,但有几种不同类型的故障:

  • 对于重定向的 DNS 故障(DNS 无法找到 URL,并且您被重定向到另一个页面)。所有重定向(如果您被重定向)可能会转到同一页面(由您的 ISP/DNS 提供商提供),您可以检查这一点。当然,如果您尝试使用其他 ISP/DNS 提供商,页面可能会有所不同。如果您没有被重定向,那么您将收到连接错误。
  • 对于具有有效 DNS 但无法正常工作的服务器(例如 google.com 出现故障),应该会出现连接错误。
  • 对于服务器中缺少的资源(“页面”),则更加困难。 404 意味着它已损坏,但如果服务器不发送它,则无需执行更多操作。重定向可能有助于将链接标记为可疑,但应该稍后手动检查,因为它不仅用于捕获丢失的链接(例如,www.google.com 将我重定向到 www.google.es)

Maybe the issue is that currently lots of webserver and DNS providers detect those "broken" links and redirect you to their "not found" pages.

Test it against an URL that you know sends the 404 code (it shows the browser original message).


EDIT to answer the comment by the author (as it is too long to fit in a comment):
I do not see an easy answer for your problem, but there are several different types of failures:

  • For DNS failures that are redirected (an URL that cannot be found by the DNS, and you get redirected to another page). All redirections (if you are redirected) will likely go to the same page (provided by your ISP/DNS provider), you can check for that. Of course, if you try with another ISP/DNS provider the page might be different. If you are not being redirected then you will get a connection error.
  • For a server with valid DNSs but not working (for example, google.com goes down), there should be a connection error.
  • For a resource ("page") missing in a server, it is more difficult. 404 means it is broken, but if the server does not send it there is little more to do. A redirection might be useful to flag a link as dubious, but it should be manually checked later because it is not only used for capturing missing links (for example, www.google.com redirects me www.google.es)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文