用于删除 MSWord 生成的 HTML 标签的 Java 类
某些 HTML 表单是由用户使用 MSWord、FCK 编辑器或其他编辑器中的复制和粘贴来填写的。 这会产生令人讨厌的标签,干扰其他工具正常工作。 有没有办法让服务器清理传入的参数,从而删除讨厌的 HTML 标签?
当然,正则表达式没什么用,只要用户能写什么就可以了。
我的意思是 Java 类专门从事这项工作。
例如,所有这些都可能被替换为空字符串。
<p><!--[if gte mso 9]><xml> <w:WordDocument> <w:View>Normal</w:View> <w:Zoom>0</w:Zoom> <w:HyphenationZone>21</w:HyphenationZone> <w:PunctuationKerning /> <w:ValidateAgainstSchemas /> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText> <w:Compatibility> <w:BreakWrappedTables /> <w:SnapToGridInCell /> <w:WrapTextWithPunct /> <w:UseAsianBreakRules /> <w:DontGrowAutofit /> </w:Compatibility> <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel> </w:WordDocument> </xml><![endif]--><!--[if gte mso 9]><xml> <w:LatentStyles DefLockedState="false" LatentStyleCount="156"> </w:LatentStyles> </xml><![endif]--><!--[if gte mso 10]> <style> /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Tabla normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman"; mso-ansi-language:#0400; mso-fareast-language:#0400; mso-bidi-language:#0400;} </style> <![endif]--></p>
Some HTML forms are filled by users using copy&paste from MSWord, in FCK editors, or others.
This generates nasty tags annoying other tools to work fine.
Is there a way the server can clean the incoming parameters, so nasty HTML tags would be removed?
Of course, regular expressions are not useful, as long as user can write whatever.
I mean about Java class kinda specialized in this job.
Par example, all this might be replaced by a void string.
<p><!--[if gte mso 9]><xml> <w:WordDocument> <w:View>Normal</w:View> <w:Zoom>0</w:Zoom> <w:HyphenationZone>21</w:HyphenationZone> <w:PunctuationKerning /> <w:ValidateAgainstSchemas /> <w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid> <w:IgnoreMixedContent>false</w:IgnoreMixedContent> <w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText> <w:Compatibility> <w:BreakWrappedTables /> <w:SnapToGridInCell /> <w:WrapTextWithPunct /> <w:UseAsianBreakRules /> <w:DontGrowAutofit /> </w:Compatibility> <w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel> </w:WordDocument> </xml><![endif]--><!--[if gte mso 9]><xml> <w:LatentStyles DefLockedState="false" LatentStyleCount="156"> </w:LatentStyles> </xml><![endif]--><!--[if gte mso 10]> <style> /* Style Definitions */ table.MsoNormalTable {mso-style-name:"Tabla normal"; mso-tstyle-rowband-size:0; mso-tstyle-colband-size:0; mso-style-noshow:yes; mso-style-parent:""; mso-padding-alt:0cm 5.4pt 0cm 5.4pt; mso-para-margin:0cm; mso-para-margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:10.0pt; font-family:"Times New Roman"; mso-ansi-language:#0400; mso-fareast-language:#0400; mso-bidi-language:#0400;} </style> <![endif]--></p>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
docx4j 生成干净的 HTML,专门用于通过 CKEditor 进行往返。
docx4j produces clean HTML, which is specifically intended to round trip through CKEditor.
FCKEditor 有一个“从单词粘贴”按钮,效果非常好。
您能否要求您的用户使用此功能?
FCKEditor has a "paste from word" button that works very well.
Could you ask your users to use this functionality?
您可以尝试 JTidy。它是 HTMLtidy 的 Java 端口,可以执行您正在寻找的清理类型。买者自负:我没有使用过 JTidy,也不知道它的效果如何。
You could try JTidy. It's a Java port of HTMLtidy, which can do the type of cleanup you're looking for. Caveat emptor: I haven't used JTidy and I have no idea how well it works.
使用 https://code.google.com/p/owasp-java-html -sanitizer/
构建 html 仅接受策略。这将删除除您所说要包含的内容之外的所有内容。这不仅会删除 Word Html 垃圾,还会保护您的 html 输入免受 xss 侵害。
JTidy 的问题是它可能非常慢。相比之下,html sanitizer 的速度快得令人难以置信。
Use https://code.google.com/p/owasp-java-html-sanitizer/
to build an html accept only policy. This will get rid of everything except the things you say to include. Not only will this remove Word Html garbage it will also protect your html input from xss.
The problem with JTidy is that it can be quite slow. The html sanitizer is incredibly fast in comparison.
最新版本的 CKEditor 支持从 Word 粘贴时自动检测,这意味着即使按钮在那里,他们也不必使用该按钮。它会检测从Word中粘贴的内容,并提供清理或将其转换为纯文本的功能。
The latest version of CKEditor supports auto detection when you paste from word, which means they wouldn't have to use the button, even though the button is there. it would detect pasting from word and offer to clean it up or convert it to straight text.