前导“?” UTF-8 中的符号

发布于 2025-01-05 07:06:18 字数 954 浏览 5 评论 0原文

有一些文件index.html(以UTF-8保存):

<html>
  <head></head>
  <body>
       <h1> THE TITLE </h1>     
       Please click <a href="url"> here </a>
       <br>  ... Some text... <br>    
       Image: <img  src="nature.png"/>    
       <br> ... Some another text... <br>    
        Image2: <img  src="nature2.png" />
   </body>
</html>

我需要获取BODY标记内包含的所有文本,修改它,然后保存。所以我喜欢这样:

    File input = new File("html/input.html");
    Document doc = Jsoup.parse(input, "UTF-8", "");     
    Elements body = doc.select("BODY");

    //do some manipulations with the data and print it
    System.out.println(body.html());

结果是:

?   
<h1> THE TITLE </h1> Please click 
<a href="url"> here </a> 
...

很好,除了开头的问号。我怎样才能避免它? 当然,我可以从结果字符串中删除它)但我想了解怎么回事。

There is some file index.html (saved in UTF-8):

<html>
  <head></head>
  <body>
       <h1> THE TITLE </h1>     
       Please click <a href="url"> here </a>
       <br>  ... Some text... <br>    
       Image: <img  src="nature.png"/>    
       <br> ... Some another text... <br>    
        Image2: <img  src="nature2.png" />
   </body>
</html>

I need to fetch all the text containing inside the BODY tag, modify it, and save. So I do like this:

    File input = new File("html/input.html");
    Document doc = Jsoup.parse(input, "UTF-8", "");     
    Elements body = doc.select("BODY");

    //do some manipulations with the data and print it
    System.out.println(body.html());

The result is:

?   
<h1> THE TITLE </h1> Please click 
<a href="url"> here </a> 
...

It's fine, except the question symbol at the begining. How can I avoid it?
Of course I can just delete it from the result string) But I would like to understand whats the matter.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

江南烟雨〆相思醉 2025-01-12 07:06:18

首先,您需要创建一个能够理解 UTF-8 的 PrintStream:

 PrintStream out = new PrintStream(System.out, true, "UTF-8");
 out.println(body.html());

然后尝试将输出重定向到文件,并查看将其读取为 UTF-8 时是否仍然存在垃圾。 。

如果不是,那么您的控制台根本就不是 UTF-8 并且不知道如何处理它。

First of all you need to make a PrintStream that understands UTF-8:

 PrintStream out = new PrintStream(System.out, true, "UTF-8");
 out.println(body.html());

Then try to redirect output to a file and see if there's still garbage when reading it as UTF-8.

If not then your console simply isn't UTF-8 and doesn't know how to handle it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文