Tidy 破坏了非拉丁字符的链接
我使用 java 库 Tidy 来清理 html 代码。一些代码包含带有俄语字母的链接。例如,
<a href="http://example.com/Русский">link with Russian letters</a>
我知道“Русский”必须被转义,但我从用户那里得到了这个html。我的工作是将其转换为 XHTML。
我认为 tidy 试图转义非拉丁字母,但结果我得到
<a href="http://example.com/%420%443%441%441%43A%438%439">link with Russian letters</a>
This is not corelect。正确版本是
<a href="http://example.com/%D0%A0%D1%83%D1%81%D1%81%D0%BA%D0%B8%D0%B9">link with Russian letters</a>
Java 代码是
private static Tidy getTidy() {
if (null == tidy) {
tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowErrors(0);
tidy.setShowWarnings(false);
tidy.setXHTML(true);
tidy.setOutputEncoding("UTF-8");
}
return tidy;
}
public static String sanitizeHtml(String html, URI pageUri) {
boolean escapeMedia = false;
String ret = "";
try {
Document doc = getTidy().parseDOM(new StringReader("<body>" + html + "</body>"), null);
// here I make some processing
// string output
ByteArrayOutputStream out = new ByteArrayOutputStream();
Node node = doc.getElementsByTagName("body").item(0);
getTidy().pprint(node, out);
ret = out.toString().trim();
}
catch (Exception e) {
ret = html;
e.printStackTrace();
}
return ret;
}
I use java library Tidy to sanitize html-code. Some of the code contains links with Russian letters. For example
<a href="http://example.com/Русский">link with Russian letters</a>
I understand that "Русский" must be escaped, but I get this html from users. And my job is to convert it to XHTML.
I think tidy tries to escape not-latin letters, but as a result I get
<a href="http://example.com/%420%443%441%441%43A%438%439">link with Russian letters</a>
This is not corect. Correct version is
<a href="http://example.com/%D0%A0%D1%83%D1%81%D1%81%D0%BA%D0%B8%D0%B9">link with Russian letters</a>
Java code is
private static Tidy getTidy() {
if (null == tidy) {
tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowErrors(0);
tidy.setShowWarnings(false);
tidy.setXHTML(true);
tidy.setOutputEncoding("UTF-8");
}
return tidy;
}
public static String sanitizeHtml(String html, URI pageUri) {
boolean escapeMedia = false;
String ret = "";
try {
Document doc = getTidy().parseDOM(new StringReader("<body>" + html + "</body>"), null);
// here I make some processing
// string output
ByteArrayOutputStream out = new ByteArrayOutputStream();
Node node = doc.getElementsByTagName("body").item(0);
getTidy().pprint(node, out);
ret = out.toString().trim();
}
catch (Exception e) {
ret = html;
e.printStackTrace();
}
return ret;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是一种硬编码行为,并且可能是一个错误。当他们应该使用 UTF-8 时,他们使用 UTF-16 来转义 URL 中的非 ASCII 字符。请参阅
org/w3c/tidy/AttrCheckImpl.java< /代码>
。
It's a hard-coded behaviour and it's probably a bug. They use UTF-16 to escape non-ASCII characters in URLs when they should use UTF-8. See
org/w3c/tidy/AttrCheckImpl.java
.