读取网页内容

发布于 2024-11-10 16:08:22 字数 407 浏览 3 评论 0原文

你好 我想使用java读取包含德语字符的网页内容,不幸的是,德语字符显示为奇怪的字符。 请提供任何帮助 这是我的代码:

String link = "some german link";

            URL url = new URL(link);
            BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
            }

Hi
I want to read the content of a web page that contains a German characters using java , unfortunately , the German characters appear as strange characters .
Any help please
here is my code:

String link = "some german link";

            URL url = new URL(link);
            BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
            }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

余罪 2024-11-17 16:08:23

您需要为您的InputStreamReader指定字符集,例如

InputStreamReader(url.openStream(), "UTF-8") 

You need to specify the character set for your InputStreamReader, like

InputStreamReader(url.openStream(), "UTF-8") 
若水微香 2024-11-17 16:08:23

您必须设置正确的编码。您可以在 HTTP 标头中找到编码:

Content-Type: text/html; charset=ISO-8859-1

这可能会在 (X)HTML 文档中被覆盖,请参阅 HTML 字符编码

我可以想象,您必须考虑许多不同的附加问题才能无错误地解析网页。但有不同的 HTTP 客户端库可用于 Java,例如 org.apache.httpcomponents。代码将如下所示:

DefaultHttpClient httpclient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet("http://www.spiegel.de");

try
{
  HttpResponse response = httpclient.execute(httpGet);
  HttpEntity entity = response.getEntity();
  if (entity != null)
  {
    System.out.println(EntityUtils.toString(entity));
  }
}
catch (ClientProtocolException e) {e.printStackTrace();}
catch (IOException e) {e.printStackTrace();}

这是 Maven 工件:

<dependency>
  <groupId>org.apache.httpcomponents</groupId>
  <artifactId>httpclient</artifactId>
  <version>4.1.1</version>
  <type>jar</type>
  <scope>compile</scope>
</dependency>

You have to set the correct encoding. You can find the encoding in the HTTP header:

Content-Type: text/html; charset=ISO-8859-1

This may be overwritten in the (X)HTML document, see HTML Character encodings

I can imagine that you have to consider many different additional issues to pars a web page error free. But there are different HTTP client libraries available for Java, e.g. org.apache.httpcomponents. The code will look like this:

DefaultHttpClient httpclient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet("http://www.spiegel.de");

try
{
  HttpResponse response = httpclient.execute(httpGet);
  HttpEntity entity = response.getEntity();
  if (entity != null)
  {
    System.out.println(EntityUtils.toString(entity));
  }
}
catch (ClientProtocolException e) {e.printStackTrace();}
catch (IOException e) {e.printStackTrace();}

This is the maven artifact:

<dependency>
  <groupId>org.apache.httpcomponents</groupId>
  <artifactId>httpclient</artifactId>
  <version>4.1.1</version>
  <type>jar</type>
  <scope>compile</scope>
</dependency>
花开雨落又逢春i 2024-11-17 16:08:23

尝试设置一个字符集。

new BufferedReader(new InputStreamReader(url.openStream(), Charset.forName("UTF-8") ));

Try to set an Charset.

new BufferedReader(new InputStreamReader(url.openStream(), Charset.forName("UTF-8") ));
小女人ら 2024-11-17 16:08:23

首先,验证您使用的字体是否可以支持您尝试显示的特定德语字符。许多字体并不包含所有字符,当这是一个简单的“丢失字符”问题时,寻找其他原因是一个很大的痛苦。

如果这不是问题,那么您输入或输出的字符集是错误的。字符集决定了代表字符的数字如何映射到字形(或代表字符的图片)。 Java内部通常使用UTF-8;所以输出流可能不是问题。检查输入流。

First, verify that the font you are using can support the particular German characters you are trying to display. Many fonts don't carry all characters, and it is a big pain looking for other reasons when it's a simple "missing character" issue.

If that's not the issue, then either you input or output is in the wrong character set. Character sets determine how the number representing the character gets mapped to the glyphs (or pictures representing the characters). Java typically uses UTF-8 internally; so the output stream is likely not the issue. Check the input stream.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文