需要帮助获取 Java 网站的 HTML

发布于 2024-09-12 15:47:12 字数 2194 浏览 4 评论 0原文

我从 java httpurlconnection cut off html 得到了一些代码,我的代码几乎相同从 Java 网站获取 html。 除了一个特定的网站,我无法使用此代码:

我正在尝试从该网站获取 HTML:

http://www.geni.com/genealogy /people/William-Jefferson-Blythe-Clinton/6000000001961474289

但我不断收到垃圾字符。尽管它与 http://www.google.com 等任何其他网站配合得很好。

这是我正在使用的代码:

public static String PrintHTML(){
    URL url = null;
    try {
        url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289");
    } catch (MalformedURLException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    HttpURLConnection connection = null;
    try {
        connection = (HttpURLConnection) url.openConnection();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6");
    try {
        System.out.println(connection.getResponseCode());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String line;
    StringBuilder builder = new StringBuilder();
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    try {
        while ((line = reader.readLine()) != null) {
            builder.append(line);
            builder.append("\n"); 
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String html = builder.toString();
    System.out.println("HTML " + html);
    return html;
}

我不明白为什么它不适用于我上面提到的 URL。

任何帮助将不胜感激。

I got some code from java httpurlconnection cutting off html and I am pretty much the same code to fetch html from websites in Java.
Except for one particular website that I am unable to make this code work with:

I am trying to get HTML from this website:

http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289

But I keep getting junk characters. Although it works very well with any other website like http://www.google.com.

And this is the code that I am using:

public static String PrintHTML(){
    URL url = null;
    try {
        url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289");
    } catch (MalformedURLException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    HttpURLConnection connection = null;
    try {
        connection = (HttpURLConnection) url.openConnection();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6");
    try {
        System.out.println(connection.getResponseCode());
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String line;
    StringBuilder builder = new StringBuilder();
    BufferedReader reader = null;
    try {
        reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    try {
        while ((line = reader.readLine()) != null) {
            builder.append(line);
            builder.append("\n"); 
        }
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    String html = builder.toString();
    System.out.println("HTML " + html);
    return html;
}

I don't understand why it doesn't work with the URL that I mentioned above.

Any help will be appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

何以笙箫默 2024-09-19 15:47:12

无论客户端的功能如何,该站点都会错误地压缩响应。通常,只要客户端支持,服务器就应该只对响应进行 gzip 压缩(通过 接受编码:gzip)。您需要使用 GZIPInputStream 解压缩它

reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()), "UTF-8"));

请注意,我还在 InputStreamReader 构造函数中添加了正确的字符集。通常您想从 内容中提取它-键入响应的标头。

有关更多提示,另请参阅如何使用 URLConnection 来触发和处理 HTTP 请求? 如果您最终想要的是从 HTML 中解析/提取信息,那么我强烈建议使用 HTML 解析器,例如 Jsoup。

That site is incorrectly gzipping the response regardless of the client's capabilities. Normally a server should only gzip the response whenever the client supports it (by Accept-Encoding: gzip). You need to ungzip it using GZIPInputStream.

reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()), "UTF-8"));

Note that I also added the right charset to the InputStreamReader constructor. Normally you'd like to extract it from the Content-Type header of the response.

For more hints, see also How to use URLConnection to fire and handle HTTP requests? If all what you after all want is parsing/extracting information from the HTML, then I strongly recommend to use a HTML parser like Jsoup instead.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文