读取网页内容
你好 我想使用java读取包含德语字符的网页内容,不幸的是,德语字符显示为奇怪的字符。 请提供任何帮助 这是我的代码:
String link = "some german link";
URL url = new URL(link);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
Hi
I want to read the content of a web page that contains a German characters using java , unfortunately , the German characters appear as strange characters .
Any help please
here is my code:
String link = "some german link";
URL url = new URL(link);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您需要为您的InputStreamReader指定字符集,例如
You need to specify the character set for your InputStreamReader, like
您必须设置正确的编码。您可以在 HTTP 标头中找到编码:
这可能会在 (X)HTML 文档中被覆盖,请参阅 HTML 字符编码
我可以想象,您必须考虑许多不同的附加问题才能无错误地解析网页。但有不同的 HTTP 客户端库可用于 Java,例如 org.apache.httpcomponents。代码将如下所示:
这是 Maven 工件:
You have to set the correct encoding. You can find the encoding in the HTTP header:
This may be overwritten in the (X)HTML document, see HTML Character encodings
I can imagine that you have to consider many different additional issues to pars a web page error free. But there are different HTTP client libraries available for Java, e.g.
org.apache.httpcomponents
. The code will look like this:This is the maven artifact:
尝试设置一个字符集。
Try to set an Charset.
首先,验证您使用的字体是否可以支持您尝试显示的特定德语字符。许多字体并不包含所有字符,当这是一个简单的“丢失字符”问题时,寻找其他原因是一个很大的痛苦。
如果这不是问题,那么您输入或输出的字符集是错误的。字符集决定了代表字符的数字如何映射到字形(或代表字符的图片)。 Java内部通常使用UTF-8;所以输出流可能不是问题。检查输入流。
First, verify that the font you are using can support the particular German characters you are trying to display. Many fonts don't carry all characters, and it is a big pain looking for other reasons when it's a simple "missing character" issue.
If that's not the issue, then either you input or output is in the wrong character set. Character sets determine how the number representing the character gets mapped to the glyphs (or pictures representing the characters). Java typically uses UTF-8 internally; so the output stream is likely not the issue. Check the input stream.