如何使用java获取utf-8字符集的页面源
我想以 utf-8 字符集格式获取网页源,以获得该页面的标题、描述和关键字。
我得到了 (9/10) url 的结果。但是,我无法获得像 Twitter 这样的网站的结果。
我为此多次谷歌搜索,但无法得到完美的解决方案。
我使用下面所示的代码,
public class TitDesKey
{
public static void main ( String[] args ) throws IOException
{
String inputLine,source= null,result_tit= null,result_des= null,result_key= null;
try
{
URL url = new URL("http://www.twitter.com");
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent","Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.8.1.6) Gecko/20070723 Iceweasel/2.0.0.6 (Debian-2.0.0.6-0etch1)");
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(),"utf-8"));
while ((inputLine = in.readLine()) != null)
{
source=source+" "+inputLine;
if(inputLine.contains("</head>"))
{
break;
}
}
}
catch (MalformedURLException e)
{
System.out.println("Please Enter Write Information");
}
catch (IOException e)
{
System.out.println("Please Enter Write Information");
}
// Title Data
Pattern PATTERN_tit = Pattern.compile("<title>(.*?)</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher m_tit = PATTERN_tit.matcher(source);
while (m_tit.find())
{
result_tit = m_tit.group(1);
result_tit = result_tit.replace("/", "").trim();
System.out.println(result_tit);
}
// Description Data
Pattern Pattern_dis = Pattern.compile("<meta name=\"description\" content=(.*?)>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher m_dis = Pattern_dis.matcher(source);
while (m_dis.find())
{
result_des = m_dis.group(1);
result_des = result_des.replace("/", "").trim();
System.out.println(result_des);
}
// Keyword Data
Pattern Pattern_key = Pattern.compile("<meta name=\"keywords\" content=(.*?)>",Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher m_key = Pattern_key.matcher(source);
while (m_key.find())
{
result_key = m_key.group(1);
result_key = result_key.replace("/", "").trim();
System.out.println(result_key);
}
}
}
它为我提供了某种 ISO-8859 格式的解决方案。我还使用“utf-8”字符集重载了“InputStreamReader 构造函数”。这给了我像“??????”这样的结果。
请建议我解决这个问题。
谢谢..:)
I want to fetch webpage sorce in utf-8 charset format to have title,description and keywords of that page.
I got the result for (9/10) url. But, I could not get the result of some site like twitter.
I googleling many times for this I could not get perfect solution.
I use the code shown below,
public class TitDesKey
{
public static void main ( String[] args ) throws IOException
{
String inputLine,source= null,result_tit= null,result_des= null,result_key= null;
try
{
URL url = new URL("http://www.twitter.com");
URLConnection conn = url.openConnection();
conn.setRequestProperty("User-Agent","Mozilla/5.0 (X11; U; Linux x86_64; en-GB; rv:1.8.1.6) Gecko/20070723 Iceweasel/2.0.0.6 (Debian-2.0.0.6-0etch1)");
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream(),"utf-8"));
while ((inputLine = in.readLine()) != null)
{
source=source+" "+inputLine;
if(inputLine.contains("</head>"))
{
break;
}
}
}
catch (MalformedURLException e)
{
System.out.println("Please Enter Write Information");
}
catch (IOException e)
{
System.out.println("Please Enter Write Information");
}
// Title Data
Pattern PATTERN_tit = Pattern.compile("<title>(.*?)</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher m_tit = PATTERN_tit.matcher(source);
while (m_tit.find())
{
result_tit = m_tit.group(1);
result_tit = result_tit.replace("/", "").trim();
System.out.println(result_tit);
}
// Description Data
Pattern Pattern_dis = Pattern.compile("<meta name=\"description\" content=(.*?)>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher m_dis = Pattern_dis.matcher(source);
while (m_dis.find())
{
result_des = m_dis.group(1);
result_des = result_des.replace("/", "").trim();
System.out.println(result_des);
}
// Keyword Data
Pattern Pattern_key = Pattern.compile("<meta name=\"keywords\" content=(.*?)>",Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher m_key = Pattern_key.matcher(source);
while (m_key.find())
{
result_key = m_key.group(1);
result_key = result_key.replace("/", "").trim();
System.out.println(result_key);
}
}
}
Which gives me solution in some ISO-8859 format. I also overloaded "InputStreamReader constructor" with "utf-8" charset. Which gives me result like "??????".
Please suggest me solution for this.
Thank You..:)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我测试了你的方法,它对我有用。这是我使用的代码:
您可以用您得到的错误解码输出的示例来更新您的问题吗?
I tested your approach, and it works for me. Here is the code I used:
Can you update your question with an example of the incorrectly decoded output you are getting?