如何使用java获取utf-8字符集的页面源

发布于 2024-12-29 18:32:11 字数 3134 浏览 2 评论 0原文

我想以 utf-8 字符集格式获取网页源，以获得该页面的标题、描述和关键字。

我得到了 (9/10) url 的结果。但是，我无法获得像 Twitter 这样的网站的结果。

我为此多次谷歌搜索，但无法得到完美的解决方案。

我使用下面所示的代码，

public class TitDesKey
{
        public static void main ( String[] args ) throws IOException 
        {
            String inputLine,source= null,result_tit= null,result_des= null,result_key= null;
                try 
                {
                        URL url = new URL("http://www.twitter.com");

                        URLConnection conn =  url.openConnection();
                        conn.setRequestProperty("User-Agent","Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.8.1.6) Gecko/20070723 Iceweasel/2.0.0.6 (Debian-2.0.0.6-0etch1)");
                        BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(),"utf-8"));                        


                        while ((inputLine = in.readLine()) != null) 
                        {
                            source=source+" "+inputLine;
                            if(inputLine.contains("</head>"))
                            {
                                break;
                            }
                        }
                }
                catch (MalformedURLException e)
                {
                    System.out.println("Please Enter Write Information");
                }
                catch (IOException e) 
                {
                    System.out.println("Please Enter Write Information");
                }


//              Title Data
                Pattern PATTERN_tit = Pattern.compile("<title>(.*?)</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

                Matcher m_tit = PATTERN_tit.matcher(source);
                while (m_tit.find()) 
                {
                    result_tit = m_tit.group(1);
                    result_tit = result_tit.replace("/", "").trim();
                    System.out.println(result_tit);
                }       

//              Description Data
                Pattern Pattern_dis = Pattern.compile("<meta name=\"description\" content=(.*?)>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

                Matcher m_dis = Pattern_dis.matcher(source);
                while (m_dis.find()) 
                {
                    result_des = m_dis.group(1);
                    result_des = result_des.replace("/", "").trim();
                    System.out.println(result_des);
                }   

//              Keyword Data
                Pattern Pattern_key = Pattern.compile("<meta name=\"keywords\" content=(.*?)>",Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

                Matcher m_key = Pattern_key.matcher(source);
                while (m_key.find()) 
                {
                    result_key = m_key.group(1);
                    result_key = result_key.replace("/", "").trim();
                    System.out.println(result_key);
                }   
        }
}

它为我提供了某种 ISO-8859 格式的解决方案。我还使用“utf-8”字符集重载了“InputStreamReader 构造函数”。这给了我像“??????”这样的结果。

请建议我解决这个问题。

谢谢..：）

原文

I want to fetch webpage sorce in utf-8 charset format to have title,description and keywords of that page.

I got the result for (9/10) url. But, I could not get the result of some site like twitter.

I googleling many times for this I could not get perfect solution.

I use the code shown below,

public class TitDesKey
{
        public static void main ( String[] args ) throws IOException 
        {
            String inputLine,source= null,result_tit= null,result_des= null,result_key= null;
                try 
                {
                        URL url = new URL("http://www.twitter.com");

                        URLConnection conn =  url.openConnection();
                        conn.setRequestProperty("User-Agent","Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.8.1.6) Gecko/20070723 Iceweasel/2.0.0.6 (Debian-2.0.0.6-0etch1)");
                        BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(),"utf-8"));                        


                        while ((inputLine = in.readLine()) != null) 
                        {
                            source=source+" "+inputLine;
                            if(inputLine.contains("</head>"))
                            {
                                break;
                            }
                        }
                }
                catch (MalformedURLException e)
                {
                    System.out.println("Please Enter Write Information");
                }
                catch (IOException e) 
                {
                    System.out.println("Please Enter Write Information");
                }


//              Title Data
                Pattern PATTERN_tit = Pattern.compile("<title>(.*?)</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

                Matcher m_tit = PATTERN_tit.matcher(source);
                while (m_tit.find()) 
                {
                    result_tit = m_tit.group(1);
                    result_tit = result_tit.replace("/", "").trim();
                    System.out.println(result_tit);
                }       

//              Description Data
                Pattern Pattern_dis = Pattern.compile("<meta name=\"description\" content=(.*?)>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

                Matcher m_dis = Pattern_dis.matcher(source);
                while (m_dis.find()) 
                {
                    result_des = m_dis.group(1);
                    result_des = result_des.replace("/", "").trim();
                    System.out.println(result_des);
                }   

//              Keyword Data
                Pattern Pattern_key = Pattern.compile("<meta name=\"keywords\" content=(.*?)>",Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

                Matcher m_key = Pattern_key.matcher(source);
                while (m_key.find()) 
                {
                    result_key = m_key.group(1);
                    result_key = result_key.replace("/", "").trim();
                    System.out.println(result_key);
                }   
        }
}

Which gives me solution in some ISO-8859 format. I also overloaded "InputStreamReader constructor" with "utf-8" charset. Which gives me result like "??????".

Please suggest me solution for this.

Thank You..:)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

可可 2025-01-05 18:32:11

我测试了你的方法，它对我有用。这是我使用的代码：

public static void main(String[] args) {

    String inputLine;

    try {
        URL url = new URL("http://www.twitter.com");

        URLConnection conn =  url.openConnection();
        conn.setRequestProperty(
            "User-Agent",
            "Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.8.1.6) Gecko/20070723 Iceweasel/2.0.0.6 (Debian-2.0.0.6-0etch1)");

        BufferedReader in = new BufferedReader(
            new InputStreamReader(conn.getInputStream(),"utf-8"));                        

        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);

            // Fail if any line contains more than one sequential question mark
            assert !inputLine.contains("??");
        }
    }
    catch (Exception e) {
        e.printStackTrace();
    }

}

您可以用您得到的错误解码输出的示例来更新您的问题吗？

I tested your approach, and it works for me. Here is the code I used:

public static void main(String[] args) {

    String inputLine;

    try {
        URL url = new URL("http://www.twitter.com");

        URLConnection conn =  url.openConnection();
        conn.setRequestProperty(
            "User-Agent",
            "Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.8.1.6) Gecko/20070723 Iceweasel/2.0.0.6 (Debian-2.0.0.6-0etch1)");

        BufferedReader in = new BufferedReader(
            new InputStreamReader(conn.getInputStream(),"utf-8"));                        

        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);

            // Fail if any line contains more than one sequential question mark
            assert !inputLine.contains("??");
        }
    }
    catch (Exception e) {
        e.printStackTrace();
    }

}

Can you update your question with an example of the incorrectly decoded output you are getting?

回复收藏 0 原文

~没有更多了~