将网站内容读取为字符串

发布于 2024-11-05 05:44:46 字数 1974 浏览 4 评论 0原文

目前我正在开发一个类，可用于读取 url 指定的网站内容。我刚刚开始使用 java.io 和 java.net，所以我需要参考我的设计。

用法：

TextURL url = new TextURL(urlString);
String contents = url.read();

我的代码：

package pl.maciejziarko.util;

import java.io.*;
import java.net.*;

public final class TextURL
{
    private static final int BUFFER_SIZE = 1024 * 10;
    private static final int ZERO = 0;
    private final byte[] dataBuffer = new byte[BUFFER_SIZE];
    private final URL urlObject;

    public TextURL(String urlString) throws MalformedURLException
    {
        this.urlObject = new URL(urlString);
    }

    public String read() 
    {
        final StringBuilder sb = new StringBuilder();

        try
        {
            final BufferedInputStream in =
                    new BufferedInputStream(urlObject.openStream());

            int bytesRead = ZERO;

            while ((bytesRead = in.read(dataBuffer, ZERO, BUFFER_SIZE)) >= ZERO)
            {
                sb.append(new String(dataBuffer, ZERO, bytesRead));
            }
        }
        catch (UnknownHostException e)
        {
            return null;
        }
        catch (IOException e)
        {
            return null;
        }

        return sb.toString();
    }

    //Usage:
    public static void main(String[] args)
    {
        try
        {
            TextURL url = new TextURL("http://www.flickr.com/explore/interesting/7days/");
            String contents = url.read();

            if (contents != null)
                System.out.println(contents);
            else
                System.out.println("ERROR!");
        }
        catch (MalformedURLException e)
        {
            System.out.println("Check you the url!");
        }
    }
}

我的问题是：这是实现我想要的目标的好方法吗？还有更好的解决方案吗？

我特别不喜欢 sb.append(new String(dataBuffer, ZERO, bytesRead)); ，但我无法以不同的方式表达它。每次迭代都创建一个新的字符串好吗？我想不会。

还有其他弱点吗？

提前致谢！

原文

Currently I'm working on a class that can be used to read the contents of the website specified by the url. I'm just beginning my adventures with java.io and java.net so I need to consult my design.

Usage:

TextURL url = new TextURL(urlString);
String contents = url.read();

My code:

package pl.maciejziarko.util;

import java.io.*;
import java.net.*;

public final class TextURL
{
    private static final int BUFFER_SIZE = 1024 * 10;
    private static final int ZERO = 0;
    private final byte[] dataBuffer = new byte[BUFFER_SIZE];
    private final URL urlObject;

    public TextURL(String urlString) throws MalformedURLException
    {
        this.urlObject = new URL(urlString);
    }

    public String read() 
    {
        final StringBuilder sb = new StringBuilder();

        try
        {
            final BufferedInputStream in =
                    new BufferedInputStream(urlObject.openStream());

            int bytesRead = ZERO;

            while ((bytesRead = in.read(dataBuffer, ZERO, BUFFER_SIZE)) >= ZERO)
            {
                sb.append(new String(dataBuffer, ZERO, bytesRead));
            }
        }
        catch (UnknownHostException e)
        {
            return null;
        }
        catch (IOException e)
        {
            return null;
        }

        return sb.toString();
    }

    //Usage:
    public static void main(String[] args)
    {
        try
        {
            TextURL url = new TextURL("http://www.flickr.com/explore/interesting/7days/");
            String contents = url.read();

            if (contents != null)
                System.out.println(contents);
            else
                System.out.println("ERROR!");
        }
        catch (MalformedURLException e)
        {
            System.out.println("Check you the url!");
        }
    }
}

My question is:
Is it a good way to achieve what I want? Are there any better solutions?

I particularly didn't like sb.append(new String(dataBuffer, ZERO, bytesRead)); but I wasn't able to express it in a different way. Is it good to create a new String every iteration? I suppose no.

Any other weak points?

Thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

内心激荡 2024-11-12 05:44:46

考虑使用 URLConnection 相反。此外，您可能想利用 IOUtils 来自 Apache Commons IO 也使字符串读取更容易。例如：

URL url = new URL("http://www.example.com/");
URLConnection con = url.openConnection();
InputStream in = con.getInputStream();
String encoding = con.getContentEncoding();  // ** WRONG: should use "con.getContentType()" instead but it returns something like "text/html; charset=UTF-8" so this value must be parsed to extract the actual encoding
encoding = encoding == null ? "UTF-8" : encoding;
String body = IOUtils.toString(in, encoding);
System.out.println(body);

如果您不想使用 IOUtils 我可能会重写上面的行，如下所示：

ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buf = new byte[8192];
int len = 0;
while ((len = in.read(buf)) != -1) {
    baos.write(buf, 0, len);
}
String body = new String(baos.toByteArray(), encoding);

Consider using URLConnection instead. Furthermore you might want to leverage IOUtils from Apache Commons IO to make the string reading easier too. For example:

URL url = new URL("http://www.example.com/");
URLConnection con = url.openConnection();
InputStream in = con.getInputStream();
String encoding = con.getContentEncoding();  // ** WRONG: should use "con.getContentType()" instead but it returns something like "text/html; charset=UTF-8" so this value must be parsed to extract the actual encoding
encoding = encoding == null ? "UTF-8" : encoding;
String body = IOUtils.toString(in, encoding);
System.out.println(body);

If you don't want to use IOUtils I'd probably rewrite that line above something like:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buf = new byte[8192];
int len = 0;
while ((len = in.read(buf)) != -1) {
    baos.write(buf, 0, len);
}
String body = new String(baos.toByteArray(), encoding);

回复收藏 0 原文

逆流 2024-11-12 05:44:46

我强烈建议使用专用库，例如 HtmlParser：

Parser parser = new Parser (url);
NodeList list = parser.parse (null);
System.out.println (list.toHtml ());

编写自己的 html 解析器非常耗时。这是其 Maven 依赖项。查看其 JavaDoc 以深入了解其功能。

看看下面的示例应该是有说服力的：

Parser parser = new Parser(url);
NodeList movies = parser.extractAllNodesThatMatch(
    new AndFilter(new TagNameFilter("div"),
    new HasAttributeFilter("class", "movie")));

I highly recommend using a dedicated library, like HtmlParser:

Parser parser = new Parser (url);
NodeList list = parser.parse (null);
System.out.println (list.toHtml ());

Writing your own html parser is such a loose of time. Here is its maven dependency. Look at its JavaDoc for digging into its features.

Looking at the following sample should be convincing:

Parser parser = new Parser(url);
NodeList movies = parser.extractAllNodesThatMatch(
    new AndFilter(new TagNameFilter("div"),
    new HasAttributeFilter("class", "movie")));

回复收藏 0 原文

烟雨凡馨 2024-11-12 05:44:46

除非这是您为了学习而想要进行编码的某种练习...我不会重新发明轮子，我会使用 HttpURLConnection。

HttpURLConnection提供了良好的封装机制来处理HTTP协议。例如，您的代码不适用于 HTTP 重定向，HttpURLConnection 将为您解决该问题。

回复收藏 0 原文

对你再特殊 2024-11-12 05:44:46

您可以将 InputStream 包装在 InputStreamReader 中，并可以使用它是直接读取字符数据的 read() 方法（请注意，您应该在创建Reader时指定编码，但找出任意 URL 的编码并非易事）。然后只需调用 sb.append() 与您刚刚读取的 char[] （以及正确的偏移量和长度）。

回复收藏 0 原文

要走干脆点 2024-11-12 05:44:46

我知道这是一个老问题，但我相信其他人也会找到它。

如果您不介意额外的依赖项，这里有一个非常简单的方法

Jsoup.connect("http://example.com/").get().toString()

您需要一个 Jsoup 库，但是您可以使用maven/gradle快速添加它，它还允许操作页面内容并查找特定节点。

I know this is an old question, but I'm sure other people will find it too.

If you don't mind an additional dependency, here's a very simple way

Jsoup.connect("http://example.com/").get().toString()

You'll need a Jsoup library, but you can quickly add it with maven/gradle and it also allows to manipulate the contents of the page and find specific nodes.

回复收藏 0 原文

池予 2024-11-12 05:44:46

嘿，请使用这些代码行，它将帮助你..

 <!DOCTYPE html>
    <html>
        <head>
            <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
            <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
            <title>JSP Page</title>

        </head>
        <body>
            <h1>Hello World!</h1> 






        URL uri= new URL("Your url");
        URLConnection ec = uri.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(
                ec.getInputStream(), "UTF-8"));
        String inputLine;
        StringBuilder a = new StringBuilder();
        while ((inputLine = in.readLine()) != null)
            a.append(inputLine);
        in.close();

        out.println(a.toString());

Hey Please use these lines of codes , it will help u..

 <!DOCTYPE html>
    <html>
        <head>
            <script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.2/jquery.min.js"></script>
            <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
            <title>JSP Page</title>

        </head>
        <body>
            <h1>Hello World!</h1> 






        URL uri= new URL("Your url");
        URLConnection ec = uri.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(
                ec.getInputStream(), "UTF-8"));
        String inputLine;
        StringBuilder a = new StringBuilder();
        while ((inputLine = in.readLine()) != null)
            a.append(inputLine);
        in.close();

        out.println(a.toString());

回复收藏 0 原文

~没有更多了~