用于删除 MSWord 生成的 HTML 标签的 Java 类

发布于 2024-09-30 21:19:13 字数 1665 浏览 1 评论 0原文

某些 HTML 表单是由用户使用 MSWord、FCK 编辑器或其他编辑器中的复制和粘贴来填写的。 这会产生令人讨厌的标签,干扰其他工具正常工作。 有没有办法让服务器清理传入的参数,从而删除讨厌的 HTML 标签?

当然,正则表达式没什么用,只要用户能写什么就可以了。

我的意思是 Java 类专门从事这项工作。

例如,所有这些都可能被替换为空字符串。

<p><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:HyphenationZone>21</w:HyphenationZone>
<w:PunctuationKerning />
<w:ValidateAgainstSchemas />
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables />
<w:SnapToGridInCell />
<w:WrapTextWithPunct />
<w:UseAsianBreakRules />
<w:DontGrowAutofit />
</w:Compatibility>
<w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="156">
</w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Tabla normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
</style>
<![endif]--></p>

Some HTML forms are filled by users using copy&paste from MSWord, in FCK editors, or others.
This generates nasty tags annoying other tools to work fine.
Is there a way the server can clean the incoming parameters, so nasty HTML tags would be removed?

Of course, regular expressions are not useful, as long as user can write whatever.

I mean about Java class kinda specialized in this job.

Par example, all this might be replaced by a void string.

<p><!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:HyphenationZone>21</w:HyphenationZone>
<w:PunctuationKerning />
<w:ValidateAgainstSchemas />
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables />
<w:SnapToGridInCell />
<w:WrapTextWithPunct />
<w:UseAsianBreakRules />
<w:DontGrowAutofit />
</w:Compatibility>
<w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
</w:WordDocument>
</xml><![endif]--><!--[if gte mso 9]><xml>
<w:LatentStyles DefLockedState="false" LatentStyleCount="156">
</w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Tabla normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:10.0pt;
font-family:"Times New Roman";
mso-ansi-language:#0400;
mso-fareast-language:#0400;
mso-bidi-language:#0400;}
</style>
<![endif]--></p>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

ま昔日黯然 2024-10-07 21:19:14

docx4j 生成干净的 HTML,专门用于通过 CKEditor 进行往返。

docx4j produces clean HTML, which is specifically intended to round trip through CKEditor.

染年凉城似染瑾 2024-10-07 21:19:13

FCKEditor 有一个“从单词粘贴”按钮,效果非常好。
您能否要求您的用户使用此功能?

FCKEditor has a "paste from word" button that works very well.
Could you ask your users to use this functionality?

太阳男子 2024-10-07 21:19:13

您可以尝试 JTidy。它是 HTMLtidy 的 Java 端口,可以执行您正在寻找的清理类型。买者自负:我没有使用过 JTidy,也不知道它的效果如何。

You could try JTidy. It's a Java port of HTMLtidy, which can do the type of cleanup you're looking for. Caveat emptor: I haven't used JTidy and I have no idea how well it works.

猥︴琐丶欲为 2024-10-07 21:19:13

使用 https://code.google.com/p/owasp-java-html -sanitizer/

import org.owasp.html.PolicyFactory;
import org.owasp.html.Sanitizers;

构建 html 仅接受策略。这将删除除您所说要包含的内容之外的所有内容。这不仅会删除 Word Html 垃圾,还会保护您的 html 输入免受 xss 侵害。

PolicyFactory policy = (new HtmlPolicyBuilder().allowElements("table", "tr", "td", "th").allowAttributes("style").globally()).toFactory();
        policy = policy.and(Sanitizers.FORMATTING).and(Sanitizers.BLOCKS).and(Sanitizers.IMAGES).and(Sanitizers.LINKS);

String safeHtml = policy.sanitize(html);

JTidy 的问题是它可能非常慢。相比之下,html sanitizer 的速度快得令人难以置信。

Use https://code.google.com/p/owasp-java-html-sanitizer/

import org.owasp.html.PolicyFactory;
import org.owasp.html.Sanitizers;

to build an html accept only policy. This will get rid of everything except the things you say to include. Not only will this remove Word Html garbage it will also protect your html input from xss.

PolicyFactory policy = (new HtmlPolicyBuilder().allowElements("table", "tr", "td", "th").allowAttributes("style").globally()).toFactory();
        policy = policy.and(Sanitizers.FORMATTING).and(Sanitizers.BLOCKS).and(Sanitizers.IMAGES).and(Sanitizers.LINKS);

String safeHtml = policy.sanitize(html);

The problem with JTidy is that it can be quite slow. The html sanitizer is incredibly fast in comparison.

倾城泪 2024-10-07 21:19:13

最新版本的 CKEditor 支持从 Word 粘贴时自动检测,这意味着即使按钮在那里,他们也不必使用该按钮。它会检测从Word中粘贴的内容,并提供清理或将其转换为纯文本的功能。

The latest version of CKEditor supports auto detection when you paste from word, which means they wouldn't have to use the button, even though the button is there. it would detect pasting from word and offer to clean it up or convert it to straight text.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文