如何删除字符串中 HTML 标记中的所有 HTML 属性

发布于 2025-01-08 06:27:05 字数 618 浏览 0 评论 0 原文

我试图获取一个包含 HTML 的字符串,去掉一些标签(img、object)和所有其他 HTML 标签,去掉它们的属性。例如:

<div id="someId" style="color: #000000">
   <p class="someClass">Some Text</p>
   <img src="images/someimage.jpg" alt="" />
   <a href="somelink.html">Some Link Text</a>
</div>

会变成:

<div>
   <p>Some Text</p>
   Some Link Text
</div>

我正在尝试:

string.replaceAll("<\/?[img|object](\s\w+(\=\".*\")?)*\>", ""); //REMOVE img/object

但我不确定如何删除标签内的所有属性。

任何帮助将不胜感激。

谢谢。

I am trying to take a string that has HTML, strip out some tags (img, object) and all other HTML tags, strip out their attributes. For example:

<div id="someId" style="color: #000000">
   <p class="someClass">Some Text</p>
   <img src="images/someimage.jpg" alt="" />
   <a href="somelink.html">Some Link Text</a>
</div>

Would become:

<div>
   <p>Some Text</p>
   Some Link Text
</div>

I am trying:

string.replaceAll("<\/?[img|object](\s\w+(\=\".*\")?)*\>", ""); //REMOVE img/object

I am not sure how to strip all attributes inside a tag though.

Any help would be appreciated.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

青朷 2025-01-15 06:27:05

如果您想过滤特定标签,我不会推荐使用正则表达式。这将是一项艰巨的工作,而且永远不会完全可靠。使用普通的 HTML 解析器,例如 Jsoup。它提供了 Whitelist API 进行清理HTML。另请参阅此食谱文档

这是在 Jsoup 的帮助下的启动示例,它只允许在所选 Whitelist< 的标准标签集旁边使用

标签/code> 是 Whitelist#simpleText()

String html = "<div id='someId' style='color: #000000'><p class='someClass'>Some Text</p><img src='images/someimage.jpg' alt='' /><a href='somelink.html'>Some Link Text</a></div>";
Whitelist whitelist = Whitelist.simpleText(); // Whitelist.simpleText() allows b, em, i, strong, u. Use Whitelist.none() instead if you want to start clean.
whitelist.addTags("div", "p");
String clean = Jsoup.clean(html, whitelist);
System.out.println(clean);

这导致

<div>
   <p>Some Text</p>Some Link Text
</div>

另请参阅:

I would not recommend regex for this if you want to filter specific tags. This is going to be hell of a job and never going to be fully reliable. Use a normal HTML parser like Jsoup. It offers the Whitelist API to clean up HTML. See also this cookbook document.

Here's a kickoff example with help of Jsoup which only allows <div> and <p> tags next to the standard set of tags of the chosen Whitelist which is Whitelist#simpleText() in the below example.

String html = "<div id='someId' style='color: #000000'><p class='someClass'>Some Text</p><img src='images/someimage.jpg' alt='' /><a href='somelink.html'>Some Link Text</a></div>";
Whitelist whitelist = Whitelist.simpleText(); // Whitelist.simpleText() allows b, em, i, strong, u. Use Whitelist.none() instead if you want to start clean.
whitelist.addTags("div", "p");
String clean = Jsoup.clean(html, whitelist);
System.out.println(clean);

This results in

<div>
   <p>Some Text</p>Some Link Text
</div>

See also:

三寸金莲 2025-01-15 06:27:05

您可以像这样删除所有属性:

string.replaceAll("(<\\w+)[^>]*(>)", "$1$2");

此表达式匹配开始标记,但仅捕获其标头 和结束 > 作为组 1 和 2。 >replaceAll 使用对这些组的引用,将它们以 $1$2 的形式重新加入到输出中。这会剪掉标签中间的属性。

You can remove all attributes like this:

string.replaceAll("(<\\w+)[^>]*(>)", "$1$2");

This expression matches an opening tag, but captures only its header <div and the closing > as groups 1 and 2. replaceAll uses references to these groups to join them back in the output as $1$2. This cuts out the attributes in the middle of the tag.

血之狂魔 2025-01-15 06:27:05

/<(/?\w+) .*?>/<\1>/ 可能有效 - 获取标签(匹配组)并读取任何属性,直到右括号并将其替换只有背板和标签。

/<(/?\w+) .*?>/<\1>/ might work - takes the tag (the matching group) and reads any attributes until the close bracket and replaces it with just the backets and the tag.

情未る 2025-01-15 06:27:05

如果您使用 SAX 或 DOM,并获取节点名称和值,并删除所有属性,可能会容易得多。

Probably would be much easier if you are using a SAX or DOM, and take the node name and value, and remove all attributes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文