评论被 NekoHTML(或 JTidy)转义 + XOM
我使用 NekoHTML 来清理一些 HTML,然后将其提供给 XOM 以获取对象模型。在此过程中的某个地方,评论被忽略了。
下面是输入 HTML 的相关示例(为清楚起见,对大部分 进行了剪切):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<script type="text/JavaScript">
<!-- // Hide the JS
startTimeout(6000000, "/");
// -->
</script>
这是代码:
// XOMSafeSAXParser is the Neko SAXParser extended to allow
// XOM to set the (unnecessary in this case) features
// external-general-entities and external-parameter-entities
XMLReader reader = new XOMSafeSAXParser();
Builder xomBuilder = new Builder(reader);
Reader input = ...; // file, resource, etc.
Document doc = xomBuilder.build(input);
Serializer s = new Serializer(System.out, "UTF-8");
s.setIndent(4);
s.setMaxLength(200);
s.write(doc);
s.flush();
这是相应的输出:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML lang="en">
<HEAD>
<SCRIPT type="text/JavaScript"> <!-- // Hide the JS startTimeout(6000000, "/"); // --> </SCRIPT>
</HEAD>
当我从 XOM 文档中提取脚本元素时,它看起来就像它已经被破坏一样(SCRIPT 元素有一个 Text
节点作为子节点,而不是我期望的 Texts
和 Comments
序列),所以我不要认为是 Serializer
出了问题。
现在,我不希望保留换行符,事实上我无论如何都会扔掉脚本标签,但还有其他地方我希望保留注释或至少希望能够获取没有嵌入转义注释的文本。
有什么想法吗?
更新: NekoHTML 损坏了一些标签,所以我切换到 JTidy,并且遇到了同样的问题。但有趣的是,这只是标头中的脚本标记的问题;其他评论进展顺利。还有一些奇怪的额外 JavaScript 注释,我怀疑(希望并祈祷)是 JTidy 的错。
<script type="text/JavaScript"> // <!-- // Hide the JS startTimeout(6000000, "/"); // --> // </script>
看起来 JTidy 正在做的是将 内容转换为 CDATA;当我将 JTidy 的原始输出发送到 stdout 时,我得到以下结果:
<script type="text/JavaScript">
//<![CDATA[
<!-- // Hide the JS
startTimeout(6000000, "/");
// -->
//]]>
</script>
I'm using NekoHTML to clean up some HTML, and then feeding it to XOM to get an object model. Somewhere in the course of this, comments are getting escaped.
Here's a relevant example of the input HTML (most of the <head>
cut for clarity):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<script type="text/JavaScript">
<!-- // Hide the JS
startTimeout(6000000, "/");
// -->
</script>
Here's the code:
// XOMSafeSAXParser is the Neko SAXParser extended to allow
// XOM to set the (unnecessary in this case) features
// external-general-entities and external-parameter-entities
XMLReader reader = new XOMSafeSAXParser();
Builder xomBuilder = new Builder(reader);
Reader input = ...; // file, resource, etc.
Document doc = xomBuilder.build(input);
Serializer s = new Serializer(System.out, "UTF-8");
s.setIndent(4);
s.setMaxLength(200);
s.write(doc);
s.flush();
Here's the corresponding output:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML lang="en">
<HEAD>
<SCRIPT type="text/JavaScript"> <!-- // Hide the JS startTimeout(6000000, "/"); // --> </SCRIPT>
</HEAD>
When I extract the script element from the XOM document, it looks like it's already been mangled (the SCRIPT element has one Text
node as a child, not the sequence of Texts
and Comments
I would expect), so I don't think it's the Serializer
that's going wrong.
Now, I don't expect the line breaks to be preserved and in fact I'm going to throw the script tags out anyway, but there are other places where I'd like comments to be preserved or at minimum like to be able to get text without escaped comments embedded in it.
Any ideas?
Update: NekoHTML was mangling some tags, so I switched to JTidy, and I have the same problem. Interestingly, though, it's only a problem for the script tag in the header; other comments come through fine. And there are weird extra JavaScript comments that I suspect (hope and pray) are JTidy's fault.
<script type="text/JavaScript"> // <!-- // Hide the JS startTimeout(6000000, "/"); // --> // </script>
It looks as though what JTidy's doing is converting <script>
contents to CDATA; when I send JTidy's raw outputut to stdout, I get this:
<script type="text/JavaScript">
//<![CDATA[
<!-- // Hide the JS
startTimeout(6000000, "/");
// -->
//]]>
</script>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好的。我似乎至少找到了 JTidy 案例的 解释:
HTML 注释一般都可以;只是
标记内的 HTML 注释被破坏,因为它们被转换为(并在其中转义)CDATA。反过来,XOM 将 CDATA 合并到
Text
中。从技术上讲,我认为这意味着 JTidy 已损坏,但它足以满足我的目的,因为我根本不需要
标签。
尽管如此,如果有人有一个解决方案可以让我取出我所输入的内容,我仍然想听听。
All right. I seem to have found the explanation at least for the JTidy case:
HTML comments in general are okay; it's just HTML comments inside
<script>
tags that get mangled, because they're turned into (and escaped within) CDATA. XOM, in turn, merges CDATA intoText
.Technically, I think this means JTidy is broken, but it's good enough for my purposes since I don't need the
<script>
tags at all.Still, if anybody has a solution that gets me out what I put in, I'd still like to hear it.