C# 中的 HTML 白名单
花了大约 30 分钟左右寻找这个问题的明确解决方案。
这个问题似乎已经被问过很多次了,但是......
- 大多数解决方案都使用正则表达式。
- 有很多帖子说不应该使用正则表达式来处理html。
- 有很多答案只是简单地提供了 HTMLAgilityPack(在 Codeplex 上)的链接,但没有关于如何使用此包来满足规定要求的实际示例。
所以我正在寻找满足以下要求的最佳解决方案。
- 我想提供允许的 HTML 标记列表。
- 任何不在允许列表中的标签都应连同其属性和内容一起删除。
- 允许列表中的任何标签都应保留属性和内容。
- 该解决方案应该应对不同的本地化 - 用户可能会使用英语以外的语言和字符集。
- [已添加] 该解决方案应该处理诸如论坛帖子之类的文本,而不是完整的 html 页面 - 因此允许诸如 bui 等标签,但不允许脚本 div 等,应将其删除。
我正在寻找 C# 解决方案,如果最好使用正则表达式,那么我很乐意这样做。如果有一个现有的库可以做到这一点,我也很乐意使用它们。如果可能的话,我希望有一些示例代码。
我正在寻找一种确定的、经过尝试和测试的方法来解决这个问题,而不是广泛的辩论+封闭的帖子等:) :)
提前致谢。
Spent about 30 minutes or so on SO looking for a definitive solution to this problem.
This question seems to have been asked a lot of times but...
- Most solutions use regular expressions.
- There are a lot of posts saying that regular expresions should not be used to process html.
- There are lots of answers simply giving a link to the HTMLAgilityPack (on Codeplex) but no real examples of how to use this pack to meet the stated requirements.
So I am looking for the best solution to meet the following requirements.
- I want to provide an allowed list of HTML tags.
- Any tags not in the allowed list should be removed along with their attributes and contents.
- Any tags in the allowed list should be preserved with attributes and contents.
- The solution should cope with differnet localisations - it is possible users using languages and character sets other than those used in English will be used.
- [Added] The solution should handle text such as a forum post as opposed to a full html page - so tags such as b u i etc would be allowed but script div etc are not allowed and should be removed.
I am looking for a C# solution and if its best to use a RegEx then I am happy to do so. If there is an existing library that can do this I am also happy to use them. I would appreciate some example code where possible.
I am looking for a definitive and tried and tested method of solving this problem as opposed to extensive debate + closed posts etc :) :)
Thanks in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用 Html Agility Pack 来解析 HTML。然后,您可以按照您喜欢的方式处理这些元素,并将其再次写回 HTML。
You can use the Html Agility Pack for parsing the HTML. Then you can work with the elements the way you like and write it back to HTML again.