前导“?” UTF-8 中的符号
有一些文件index.html(以UTF-8保存):
<html>
<head></head>
<body>
<h1> THE TITLE </h1>
Please click <a href="url"> here </a>
<br> ... Some text... <br>
Image: <img src="nature.png"/>
<br> ... Some another text... <br>
Image2: <img src="nature2.png" />
</body>
</html>
我需要获取BODY标记内包含的所有文本,修改它,然后保存。所以我喜欢这样:
File input = new File("html/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "");
Elements body = doc.select("BODY");
//do some manipulations with the data and print it
System.out.println(body.html());
结果是:
?
<h1> THE TITLE </h1> Please click
<a href="url"> here </a>
...
很好,除了开头的问号。我怎样才能避免它? 当然,我可以从结果字符串中删除它)但我想了解怎么回事。
There is some file index.html (saved in UTF-8):
<html>
<head></head>
<body>
<h1> THE TITLE </h1>
Please click <a href="url"> here </a>
<br> ... Some text... <br>
Image: <img src="nature.png"/>
<br> ... Some another text... <br>
Image2: <img src="nature2.png" />
</body>
</html>
I need to fetch all the text containing inside the BODY tag, modify it, and save. So I do like this:
File input = new File("html/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "");
Elements body = doc.select("BODY");
//do some manipulations with the data and print it
System.out.println(body.html());
The result is:
?
<h1> THE TITLE </h1> Please click
<a href="url"> here </a>
...
It's fine, except the question symbol at the begining. How can I avoid it?
Of course I can just delete it from the result string) But I would like to understand whats the matter.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,您需要创建一个能够理解
UTF-8
的 PrintStream:然后尝试将输出重定向到文件,并查看将其读取为
UTF-8
时是否仍然存在垃圾。 。如果不是,那么您的控制台根本就不是
UTF-8
并且不知道如何处理它。First of all you need to make a PrintStream that understands
UTF-8
:Then try to redirect output to a file and see if there's still garbage when reading it as
UTF-8
.If not then your console simply isn't
UTF-8
and doesn't know how to handle it.