将网站内容读取为字符串

发布于 2024-11-05 05:44:46 字数 1974 浏览 0 评论 0原文

目前我正在开发一个类,可用于读取 url 指定的网站内容。我刚刚开始使用 java.iojava.net,所以我需要参考我的设计。

用法:

TextURL url = new TextURL(urlString);
String contents = url.read();

我的代码:

package pl.maciejziarko.util;

import java.io.*;
import java.net.*;

public final class TextURL
{
    private static final int BUFFER_SIZE = 1024 * 10;
    private static final int ZERO = 0;
    private final byte[] dataBuffer = new byte[BUFFER_SIZE];
    private final URL urlObject;

    public TextURL(String urlString) throws MalformedURLException
    {
        this.urlObject = new URL(urlString);
    }

    public String read() 
    {
        final StringBuilder sb = new StringBuilder();

        try
        {
            final BufferedInputStream in =
                    new BufferedInputStream(urlObject.openStream());

            int bytesRead = ZERO;

            while ((bytesRead = in.read(dataBuffer, ZERO, BUFFER_SIZE)) >= ZERO)
            {
                sb.append(new String(dataBuffer, ZERO, bytesRead));
            }
        }
        catch (UnknownHostException e)
        {
            return null;
        }
        catch (IOException e)
        {
            return null;
        }

        return sb.toString();
    }

    //Usage:
    public static void main(String[] args)
    {
        try
        {
            TextURL url = new TextURL("http://www.flickr.com/explore/interesting/7days/");
            String contents = url.read();

            if (contents != null)
                System.out.println(contents);
            else
                System.out.println("ERROR!");
        }
        catch (MalformedURLException e)
        {
            System.out.println("Check you the url!");
        }
    }
}

我的问题是: 这是实现我想要的目标的好方法吗?还有更好的解决方案吗?

我特别不喜欢 sb.append(new String(dataBuffer, ZERO, bytesRead)); ,但我无法以不同的方式表达它。每次迭代都创建一个新的字符串好吗?我想不会。

还有其他弱点吗?

提前致谢!

Currently I'm working on a class that can be used to read the contents of the website specified by the url. I'm just beginning my adventures with java.io and java.net so I need to consult my design.

Usage:

TextURL url = new TextURL(urlString);
String contents = url.read();

My code:

package pl.maciejziarko.util;

import java.io.*;
import java.net.*;

public final class TextURL
{
    private static final int BUFFER_SIZE = 1024 * 10;
    private static final int ZERO = 0;
    private final byte[] dataBuffer = new byte[BUFFER_SIZE];
    private final URL urlObject;

    public TextURL(String urlString) throws MalformedURLException
    {
        this.urlObject = new URL(urlString);
    }

    public String read() 
    {
        final StringBuilder sb = new StringBuilder();

        try
        {
            final BufferedInputStream in =
                    new BufferedInputStream(urlObject.openStream());

            int bytesRead = ZERO;

            while ((bytesRead = in.read(dataBuffer, ZERO, BUFFER_SIZE)) >= ZERO)
            {
                sb.append(new String(dataBuffer, ZERO, bytesRead));
            }
        }
        catch (UnknownHostException e)
        {
            return null;
        }
        catch (IOException e)
        {
            return null;
        }

        return sb.toString();
    }

    //Usage:
    public static void main(String[] args)
    {
        try
        {
            TextURL url = new TextURL("http://www.flickr.com/explore/interesting/7days/");
            String contents = url.read();

            if (contents != null)
                System.out.println(contents);
            else
                System.out.println("ERROR!");
        }
        catch (MalformedURLException e)
        {
            System.out.println("Check you the url!");
        }
    }
}

My question is:
Is it a good way to achieve what I want? Are there any better solutions?

I particularly didn't like sb.append(new String(dataBuffer, ZERO, bytesRead)); but I wasn't able to express it in a different way. Is it good to create a new String every iteration? I suppose no.

Any other weak points?

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

内心激荡 2024-11-12 05:44:46

考虑使用 URLConnection 相反。此外,您可能想利用 IOUtils 来自 Apache Commons IO 也使字符串读取更容易。例如:

URL url = new URL("http://www.example.com/");
URLConnection con = url.openConnection();
InputStream in = con.getInputStream();
String encoding = con.getContentEncoding();  // ** WRONG: should use "con.getContentType()" instead but it returns something like "text/html; charset=UTF-8" so this value must be parsed to extract the actual encoding
encoding = encoding == null ? "UTF-8" : encoding;
String body = IOUtils.toString(in, encoding);
System.out.println(body);

如果您不想使用 IOUtils 我可能会重写上面的行,如下所示:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buf = new byte[8192];
int len = 0;
while ((len = in.read(buf)) != -1) {
    baos.write(buf, 0, len);
}
String body = new String(baos.toByteArray(), encoding);

Consider using URLConnection instead. Furthermore you might want to leverage IOUtils from Apache Commons IO to make the string reading easier too. For example:

URL url = new URL("http://www.example.com/");
URLConnection con = url.openConnection();
InputStream in = con.getInputStream();
String encoding = con.getContentEncoding();  // ** WRONG: should use "con.getContentType()" instead but it returns something like "text/html; charset=UTF-8" so this value must be parsed to extract the actual encoding
encoding = encoding == null ? "UTF-8" : encoding;
String body = IOUtils.toString(in, encoding);
System.out.println(body);

If you don't want to use IOUtils I'd probably rewrite that line above something like:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buf = new byte[8192];
int len = 0;
while ((len = in.read(buf)) != -1) {
    baos.write(buf, 0, len);
}
String body = new String(baos.toByteArray(), encoding);
逆流 2024-11-12 05:44:46

我强烈建议使用专用库,例如​​ HtmlParser

Parser parser = new Parser (url);
NodeList list = parser.parse (null);
System.out.println (list.toHtml ());

编写自己的 html 解析器非常耗时。这是其 Maven 依赖项。查看其 JavaDoc 以深入了解其功能。

看看下面的示例应该是有说服力的:

Parser parser = new Parser(url);
NodeList movies = parser.extractAllNodesThatMatch(
    new AndFilter(new TagNameFilter("div"),
    new HasAttributeFilter("class", "movie")));

I highly recommend using a dedicated library, like HtmlParser:

Parser parser = new Parser (url);
NodeList list = parser.parse (null);
System.out.println (list.toHtml ());

Writing your own html parser is such a loose of time. Here is its maven dependency. Look at its JavaDoc for digging into its features.

Looking at the following sample should be convincing:

Parser parser = new Parser(url);
NodeList movies = parser.extractAllNodesThatMatch(
    new AndFilter(new TagNameFilter("div"),
    new HasAttributeFilter("class", "movie")));
烟雨凡馨 2024-11-12 05:44:46

除非这是您为了学习而想要进行编码的某种练习...我不会重新发明轮子,我会使用 HttpURLConnection

HttpURLConnection提供了良好的封装机制来处理HTTP协议。例如,您的代码不适用于 HTTP 重定向,HttpURLConnection 将为您解决该问题。

Unless this is some sort of exercise that you want to code for the sake of learning ... I would not reinvent the wheel and I would use HttpURLConnection.

HttpURLConnection provides good encapsulation mechanisms to deal with the HTTP protocol. For instance, your code doesn't work with HTTP redirections, HttpURLConnection would fix that for you.

对你再特殊 2024-11-12 05:44:46

您可以将 InputStream 包装在 InputStreamReader 中,并可以使用 它是直接读取字符数据的 read() 方法(请注意,您 应该在创建Reader时指定编码,但找出任意 URL 的编码并非易事)。然后只需调用 sb.append() 与您刚刚读取的 char[] (以及正确的偏移量和长度)。

You can wrap your InputStream in a InputStreamReader and can use it's read() method to read character data directly (note that you should specify the encoding when creating the Reader, but finding out the encoding of arbitrary URLs is non-trivial). Then simply call sb.append() with the char[] you just read (and the correct offset and length).

要走干脆点 2024-11-12 05:44:46

我知道这是一个老问题,但我相信其他人也会找到它。

如果您不介意额外的依赖项,这里有一个非常简单的方法

Jsoup.connect("http://example.com/").get().toString()

您需要一个 Jsoup 库,但是您可以使用maven/gradle快速添加它,它还允许操作页面内容并查找特定节点。

I know this is an old question, but I'm sure other people will find it too.

If you don't mind an additional dependency, here's a very simple way

Jsoup.connect("http://example.com/").get().toString()

You'll need a Jsoup library, but you can quickly add it with maven/gradle and it also allows to manipulate the contents of the page and find specific nodes.

池予 2024-11-12 05:44:46

嘿,请使用这些代码行,它将帮助你..

 <!DOCTYPE html>
    <html>
        <head>
            <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
            <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
            <title>JSP Page</title>

        </head>
        <body>
            <h1>Hello World!</h1> 






        URL uri= new URL("Your url");
        URLConnection ec = uri.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(
                ec.getInputStream(), "UTF-8"));
        String inputLine;
        StringBuilder a = new StringBuilder();
        while ((inputLine = in.readLine()) != null)
            a.append(inputLine);
        in.close();

        out.println(a.toString());   

Hey Please use these lines of codes , it will help u..

 <!DOCTYPE html>
    <html>
        <head>
            <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
            <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
            <title>JSP Page</title>

        </head>
        <body>
            <h1>Hello World!</h1> 






        URL uri= new URL("Your url");
        URLConnection ec = uri.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(
                ec.getInputStream(), "UTF-8"));
        String inputLine;
        StringBuilder a = new StringBuilder();
        while ((inputLine = in.readLine()) != null)
            a.append(inputLine);
        in.close();

        out.println(a.toString());   
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文