jTidy 整理 HTML 后不返回任何内容

发布于 2024-12-27 03:04:08 字数 2108 浏览 0 评论 0原文

我在使用 jTidy(在 Android 上)时遇到了一个非常烦人的问题。 发现 jTidy 适用于我测试过的每个 HTML 文档,除了以下内容:

    <!DOCTYPE html>
      <html lang="en">
       <head>
        <meta charset="utf-8" />

         <!-- Always force latest IE rendering engine & Chrome Frame 
              Remove this if you use the .htaccess -->
         <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />

         <title>templates</title>
         <meta name="description" content="" />
         <meta name="author" content="" />

         <meta name="viewport" content="width=device-width; initial-scale=1.0" />

         <!-- Replace favicon.ico & apple-touch-icon.png in the root of your domain and delete these references -->
      <link rel="shortcut icon" href="/favicon.ico" />
      <link rel="apple-touch-icon" href="/apple-touch-icon.png" />
   </head>

 <body>
   <div>
     <header>
       <h1>Page Heading</h1>
     </header>
     <nav>
       <p><a href="/">Home</a></p>
       <p><a href="/contact">Contact</a></p>
     </nav>

     <div>

     </div>

     <footer>
      <p>&copy; Copyright</p>
     </footer>
   </div>
 </body>
 </html>

我 = true)

不过我注意到一些非常有趣的事情:如果我删除 HTML 正文部分中的所有内容,jTidy 就会完美地工作。 里有什么东西吗? jTidy 不喜欢?

这是我正在使用的Java代码:

 public String tidy(String sourceHTML) {
   StringReader reader = new StringReader(sourceHTML);

   ByteArrayOutputStream baos = new ByteArrayOutputStream();
   Tidy tidy = new Tidy();
   tidy.setMakeClean(true);
   tidy.setQuiet(false);
   tidy.setIndentContent(true);
   tidy.setSmartIndent(true);

   tidy.parse(reader, baos);

   try {
     return baos.toString(mEncoding);
   } catch (UnsupportedEncodingException e) {
     return null;
   }
 }

我的Java有问题吗?这是 jTidy 的错误吗?有什么办法可以让 jTidy 不这样做吗? (我无法更改 HTML)。如果这个问题绝对无法解决,还有其他好的 HTML Tidiers 吗?非常感谢!

I have come across a very annoying problem when using jTidy (on Android). I have found jTidy works on every HTML Document I have tested it against, except the following:

    <!DOCTYPE html>
      <html lang="en">
       <head>
        <meta charset="utf-8" />

         <!-- Always force latest IE rendering engine & Chrome Frame 
              Remove this if you use the .htaccess -->
         <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />

         <title>templates</title>
         <meta name="description" content="" />
         <meta name="author" content="" />

         <meta name="viewport" content="width=device-width; initial-scale=1.0" />

         <!-- Replace favicon.ico & apple-touch-icon.png in the root of your domain and delete these references -->
      <link rel="shortcut icon" href="/favicon.ico" />
      <link rel="apple-touch-icon" href="/apple-touch-icon.png" />
   </head>

 <body>
   <div>
     <header>
       <h1>Page Heading</h1>
     </header>
     <nav>
       <p><a href="/">Home</a></p>
       <p><a href="/contact">Contact</a></p>
     </nav>

     <div>

     </div>

     <footer>
      <p>© Copyright</p>
     </footer>
   </div>
 </body>
 </html>

But after tidying it, jTidy returns nothing (as in, if the String containing the Tidied HTML is called result, result.equals("") == true)

I have noticed something very interesting though: if I remove everything in the body part of the HTML jTidy works perfectly. Is there something in the <body></body> jTidy doesn't like?

Here is the Java code I am using:

 public String tidy(String sourceHTML) {
   StringReader reader = new StringReader(sourceHTML);

   ByteArrayOutputStream baos = new ByteArrayOutputStream();
   Tidy tidy = new Tidy();
   tidy.setMakeClean(true);
   tidy.setQuiet(false);
   tidy.setIndentContent(true);
   tidy.setSmartIndent(true);

   tidy.parse(reader, baos);

   try {
     return baos.toString(mEncoding);
   } catch (UnsupportedEncodingException e) {
     return null;
   }
 }

Is there something wrong with my Java? Is this an error with jTidy? Is there any way I can make jTidy not do this? (I cannot change the HTML). If this absolutely cannot be fixed, are there any other good HTML Tidiers? Thanks very much!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

嘿哥们儿 2025-01-03 03:04:08

试试这个:

tidy.setForceOutput(true);

可能存在解析错误。

Try this:

tidy.setForceOutput(true);

There are probably parse errors.

逆蝶 2025-01-03 03:04:08

查看 Jsoup,这是我对任何类型的 Java Html 处理的推荐(我已经使用过 HtmlCleaner,但是然后切换到jsoup)

使用 Jsoup 清理 Html:

final String yourHtml = ...

String output = Jsoup.clean(yourHtml, Whitelist.relaxed());

就这些!

或者(如果你想更改/删除/解析/...)某些内容:

Document doc = Jsoup.parse(<file/string/website>, null);

String output = doc.toString();

Check out Jsoup, it's my recommendation for any kind of Java Html processing (i've used HtmlCleaner to, but then switched to jsoup).

Cleaning Html with Jsoup:

final String yourHtml = ...

String output = Jsoup.clean(yourHtml, Whitelist.relaxed());

Thats all!

Or (if you want to change / remove / parse / ...) something:

Document doc = Jsoup.parse(<file/string/website>, null);

String output = doc.toString();
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文