评论被 NekoHTML(或 JTidy)转义 + XOM

发布于 2024-12-16 01:05:08 字数 2214 浏览 1 评论 0原文

我使用 NekoHTML 来清理一些 HTML,然后将其提供给 XOM 以获取对象模型。在此过程中的某个地方,评论被忽略了。

下面是输入 HTML 的相关示例(为清楚起见,对大部分 进行了剪切):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
    <script type="text/JavaScript">
        <!-- // Hide the JS
        startTimeout(6000000, "/");
        // -->
    </script>

这是代码:

// XOMSafeSAXParser is the Neko SAXParser extended to allow 
// XOM to set the (unnecessary in this case) features
// external-general-entities and external-parameter-entities
XMLReader reader = new XOMSafeSAXParser();

Builder xomBuilder = new Builder(reader);
Reader input = ...; // file, resource, etc.
Document doc = xomBuilder.build(input);

Serializer s = new Serializer(System.out, "UTF-8");
s.setIndent(4);
s.setMaxLength(200);
s.write(doc);
s.flush();

这是相应的输出:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML lang="en">
    <HEAD>
        <SCRIPT type="text/JavaScript"> &lt;!-- // Hide the JS startTimeout(6000000, "/"); // --&gt; </SCRIPT>
    </HEAD>

当我从 XOM 文档中提取脚本元素时,它看起来就像它已经被破坏一样(SCRIPT 元素有一个 Text 节点作为子节点,而不是我期望的 TextsComments 序列),所以我不要认为是 Serializer 出了问题。

现在,我不希望保留换行符,事实上我无论如何都会扔掉脚本标签,但还有其他地方我希望保留注释或至少希望能够获取没有嵌入转义注释的文本。

有什么想法吗?


更新: NekoHTML 损坏了一些标签,所以我切换到 JTidy,并且遇到了同样的问题。但有趣的是,这只是标头中的脚本标记的问题;其他评论进展顺利。还有一些奇怪的额外 JavaScript 注释,我怀疑(希望并祈祷)是 JTidy 的错。

    <script type="text/JavaScript"> // &lt;!-- // Hide the JS startTimeout(6000000, "/"); // --&gt; // </script>

看起来 JTidy 正在做的是将

<script type="text/JavaScript">
//<![CDATA[
        <!-- // Hide the JS
        startTimeout(6000000, "/");
        // -->
    //]]>
</script>

I'm using NekoHTML to clean up some HTML, and then feeding it to XOM to get an object model. Somewhere in the course of this, comments are getting escaped.

Here's a relevant example of the input HTML (most of the <head> cut for clarity):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
    <script type="text/JavaScript">
        <!-- // Hide the JS
        startTimeout(6000000, "/");
        // -->
    </script>

Here's the code:

// XOMSafeSAXParser is the Neko SAXParser extended to allow 
// XOM to set the (unnecessary in this case) features
// external-general-entities and external-parameter-entities
XMLReader reader = new XOMSafeSAXParser();

Builder xomBuilder = new Builder(reader);
Reader input = ...; // file, resource, etc.
Document doc = xomBuilder.build(input);

Serializer s = new Serializer(System.out, "UTF-8");
s.setIndent(4);
s.setMaxLength(200);
s.write(doc);
s.flush();

Here's the corresponding output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML lang="en">
    <HEAD>
        <SCRIPT type="text/JavaScript"> <!-- // Hide the JS startTimeout(6000000, "/"); // --> </SCRIPT>
    </HEAD>

When I extract the script element from the XOM document, it looks like it's already been mangled (the SCRIPT element has one Text node as a child, not the sequence of Texts and Comments I would expect), so I don't think it's the Serializer that's going wrong.

Now, I don't expect the line breaks to be preserved and in fact I'm going to throw the script tags out anyway, but there are other places where I'd like comments to be preserved or at minimum like to be able to get text without escaped comments embedded in it.

Any ideas?


Update: NekoHTML was mangling some tags, so I switched to JTidy, and I have the same problem. Interestingly, though, it's only a problem for the script tag in the header; other comments come through fine. And there are weird extra JavaScript comments that I suspect (hope and pray) are JTidy's fault.

    <script type="text/JavaScript"> // <!-- // Hide the JS startTimeout(6000000, "/"); // --> // </script>

It looks as though what JTidy's doing is converting <script> contents to CDATA; when I send JTidy's raw outputut to stdout, I get this:

<script type="text/JavaScript">
//<![CDATA[
        <!-- // Hide the JS
        startTimeout(6000000, "/");
        // -->
    //]]>
</script>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

橘和柠 2024-12-23 01:05:08

好的。我似乎至少找到了 JTidy 案例的 解释

基本问题是浏览器脚本通常会包含特殊的 XML
字符:'&''<'']]>''<' +“/”+字母。如果对这些进行转义以使 XML 处理器满意,则会破坏
脚本。商定的解决方案是将源放在 CDATA 中
部分。现在对于 和 标签都已完成此操作。迄今为止,
超好的。但存在一些悬而未决的问题和可能的意外情况
结果。 ...脚本源通常嵌入在 HTML 中
注释以防止不支持的旧浏览器解析
JavaScript。

HTML 注释一般都可以;只是

从技术上讲,我认为这意味着 JTidy 已损坏,但它足以满足我的目的,因为我根本不需要

尽管如此,如果有人有一个解决方案可以让我取出我所输入的内容,我仍然想听听。

All right. I seem to have found the explanation at least for the JTidy case:

the basic issue is that browser scripts will often contain special XML
characters: '&', '<', ']]>' and '<' + '/' + Letter. If these are escaped to make XML processors happy, it will break the
script. The agreed solution is to place source within a CDATA
section. This is now done for both and tags. So far,
so good. But there are a number open issues and possible unintended
consequences. ... script source is often embedded in HTML
comments to prevent parsing by older browsers that do not support
Javascript.

HTML comments in general are okay; it's just HTML comments inside <script> tags that get mangled, because they're turned into (and escaped within) CDATA. XOM, in turn, merges CDATA into Text.

Technically, I think this means JTidy is broken, but it's good enough for my purposes since I don't need the <script> tags at all.

Still, if anybody has a solution that gets me out what I put in, I'd still like to hear it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文