Jsoup - 如何通过转义而不删除不需要的 html 来清理 html?

发布于 2024-12-09 10:32:17 字数 652 浏览 0 评论 0原文

有没有办法让 jsoup 通过转义不需要的 HTML 而不是完全删除它来清理包含 HTML 的字符串?我的例子:

String dirty = "This is <b>REALLY</b> dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>
String clean = Jsoup.clean(dirty, new Whitelist().addTags("a").addAttributes("a", "href", "name", "rel", "target"));

这给出了一个“干净”的字符串:

This is    REALLY    dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>

我想要的是“干净”的字符串:

"This is &lt;b&gt;REALLY&lt;/b&gt; dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>

Is there a way of getting jsoup to clean a string with HTML in it by escaping the unwanted HTML rather than removing it completely? My example:

String dirty = "This is <b>REALLY</b> dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>
String clean = Jsoup.clean(dirty, new Whitelist().addTags("a").addAttributes("a", "href", "name", "rel", "target"));

This gives a "clean" string of:

This is    REALLY    dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>

What I am wanting is the "clean" string to be:

"This is <b>REALLY</b> dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

宁愿没拥抱 2024-12-16 10:32:17

假设正在解析字符串而不是 HTML 文档(根据您的问题),此方法将起作用:

public String escapeHtml(String source) {
    Document doc = Jsoup.parseBodyFragment(source);
    Elements elements = doc.select("b");
    for (Element element : elements) {
        element.replaceWith(new TextNode(element.toString(),""));
    }
    return Jsoup.clean(doc.body().toString(), new Whitelist().addTags("a").addAttributes("a", "href", "name", "rel", "target"));
}

您可以将“b”标记作为参数来传递您希望转义的标记列表。

关联的通过 JUnit 测试:

@Test
public void testHtmlEscaping() throws Exception {
    String source = "This is <b>REALLY</b> dirty code from <a href=\"www.rubbish.url.zzzz\">haxors-r-us</a>";
    String expected = "This is <b>REALLY</b> dirty code from \n<a href=\"www.rubbish.url.zzzz\">haxors-r-us</a>";
    String transformed = transformer.escapeHtml(source);
    assertEquals(transformed, expected);
}

请注意,我在测试的“预期”字符串中的“a”标记之前添加了一行 return“\n”,因为 JSoup 格式化了页面。

Assuming String rather than HTML documents are being parsed (as per your question) this method will work:

public String escapeHtml(String source) {
    Document doc = Jsoup.parseBodyFragment(source);
    Elements elements = doc.select("b");
    for (Element element : elements) {
        element.replaceWith(new TextNode(element.toString(),""));
    }
    return Jsoup.clean(doc.body().toString(), new Whitelist().addTags("a").addAttributes("a", "href", "name", "rel", "target"));
}

You could make the "b" tag an argument to pass in a list of tags you wish to escape.

The associated passing JUnit test:

@Test
public void testHtmlEscaping() throws Exception {
    String source = "This is <b>REALLY</b> dirty code from <a href=\"www.rubbish.url.zzzz\">haxors-r-us</a>";
    String expected = "This is <b>REALLY</b> dirty code from \n<a href=\"www.rubbish.url.zzzz\">haxors-r-us</a>";
    String transformed = transformer.escapeHtml(source);
    assertEquals(transformed, expected);
}

Note that I added a line return "\n" before your "a" tag in my test's "expected" String because JSoup formats the page.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文