如何使用 C# 清理 html 页面上的输入？

发布于 2024-07-08 03:15:02 字数 201 浏览 17 评论 0 原文

是否有一个库或可接受的方法来清理 html 页面的输入？

在本例中，我有一个只有姓名、电话号码和电子邮件地址的表单。

代码必须是 C#。

例如：

"" 应变为 "John Doe"

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薄荷港 2024-07-15 03:15:02

我们正在使用 HtmlSanitizer .Net 库，该库：

开源 (MIT) - GitHub 链接
完全可定制，例如配置应删除哪些元素。查看维基
正在积极维护
没有像 Microsoft Anti-XSS 库这样的问题
使用
OWASP XSS 过滤器规避备忘单
是专门为此构建的（与 HTML 不同） Agility Pack，它是一个解析器 - 不是消毒剂）
不使用正则表达式（HTML 不是常规语言！）

也在 NuGet

回复收藏 0 原文

菩提树下叶撕阳。 2024-07-15 03:15:02

根据您对此答案的评论，您可能会在此问题中找到一些有用的信息：
https://stackoverflow.com/questions/ 72394/what-should-a-developer-know-before-building-a-public-web-site

这是一个参数化查询示例。而不是这样：

string sql = "UPDATE UserRecord SET FirstName='" + txtFirstName.Text + "' WHERE UserID=" + UserID;

这样做：

SqlCommand cmd = new SqlCommand("UPDATE UserRecord SET FirstName= @FirstName WHERE UserID= @UserID");
cmd.Parameters.Add("@FirstName", SqlDbType.VarChar, 50).Value = txtFirstName.Text;
cmd.Parameters.Add("@UserID", SqlDbType.Integer).Value = UserID;

编辑：由于没有注入，我删除了处理该问题的答案部分。我留下了基本的参数化查询示例，因为这对于阅读该问题的其他人可能仍然有用。
——乔尔

Based on the comment you made to this answer, you might find some useful info in this question:
https://stackoverflow.com/questions/72394/what-should-a-developer-know-before-building-a-public-web-site

Here's a parameterized query example. Instead of this:

string sql = "UPDATE UserRecord SET FirstName='" + txtFirstName.Text + "' WHERE UserID=" + UserID;

Do this:

SqlCommand cmd = new SqlCommand("UPDATE UserRecord SET FirstName= @FirstName WHERE UserID= @UserID");
cmd.Parameters.Add("@FirstName", SqlDbType.VarChar, 50).Value = txtFirstName.Text;
cmd.Parameters.Add("@UserID", SqlDbType.Integer).Value = UserID;

Edit: Since there was no injection, I removed the portion of the answer dealing with that. I left the basic parameterized query example, since that may still be useful to anyone else reading the question.
--Joel

回复收藏 0 原文

慕烟庭风 2024-07-15 03:15:02

听起来好像您有用户提交内容，但您不能完全信任他们，但您仍然希望将他们提供的内容呈现为超级安全的 HTML。这里有三种技术：HTML 编码所有内容、HTML 编码和/或仅删除有害部分，或者使用编译为您熟悉的 HTML 的 DSL。

应该变成“John Doe”吗？我会 HTML对该字符串进行编码，并让用户“John Doe”（如果这确实是他的真名...），拥有看起来很愚蠢的名字。他一开始就不应该将自己的名字包裹在脚本标签或任何标签中。这是我在所有情况下使用的方法，除非其他技术之一有非常好的业务案例。
接受用户的 HTML，然后使用白名单方法对其进行清理（在输出时），例如消毒方法@Bryant 提到。正确地做到这一点是（极其）困难的，我推迟将其交给更伟大的头脑。请注意，某些清理程序会对 HTML 进行邪恶编码，而其他清理程序则会完全删除有问题的位。
另一种方法是使用“编译”为 HTML 的 DSL。确保白帽你的 DSL 编译器，因为有些（比如MarkdownSharp）将允许任意 HTML，如

结束语：

如果技术#2 或#3 没有强有力的业务案例，那么可以采用技术#1 来降低风险、省去精力和后顾之忧。
不要因为使用了 DSL 而认为自己是安全的。例如：Markdown 的原始实现允许 HTML 通过，未编码。 "对于 Markdown 语法未涵盖的任何标记，您只需使用 HTML 本身。无需在其前面添加前缀或对其进行分隔以表明您要从 Markdown 切换到 HTML；您只需使用标签。”
输出时进行编码。您还可以对输入进行编码，但这样做可能会让您陷入困境。如果您编码错误并保存了它，您将如何取回原始输入，以便在修复错误的编码器后可以重新编码？

It sounds like you have users that submit content but you cannot fully trust them, and yet you still want to render the content they provide as super safe HTML. Here are three techniques: HTML encode everything, HTML encode and/or remove just the evil parts, or use a DSL that compiles to HTML you are comfortable with.

Should it become "John Doe"? I would HTML encode that string and let the user, "John Doe" (if indeed that is his real name...), have the stupid looking name <script src='bobs.js'>John Doe</script>. He shouldn't have wrapped his name in script tags or any tags in the first place. This is the approach I use in all cases unless there is a really good business case for one of the other techniques.
Accept HTML from the user and then sanitize it (on output) using a whitelist approach like the sanitization method @Bryant mentioned. Getting this right is (extremely) hard, and I defer pulling that off to greater minds. Note that some sanitizers will HTML encode evil where others would have removed the offending bits completely.
Another approach is to use a DSL that "compiles" to HTML. Make sure to whitehat your DSL compiler because some (like MarkdownSharp) will allow arbitrary HTML like <script> tags and evil attributes through unencoded (which by the way is perfectly reasonable but may not be what you need or expect). If that is the case you will need to use technique #2 and sanitize what your compiler outputs.

Closing thoughts:

If there is not a strong business case for technique #2 or #3 then reduce risk and save yourself effort and the use of the worries, go with technique #1.
Don't assume your safe because you used a DSL. For example: the original implementation of Markdown allows HTML through, unencoded. "For any markup that is not covered by Markdown’s syntax, you simply use HTML itself. There’s no need to preface it or delimit it to indicate that you’re switching from Markdown to HTML; you just use the tags."
Encode when you output. You can also encode input but doing so can put you in a bind. If you encoded incorrectly and saved that, how will you get the original input back so that you can re-encode after fixing faulty encoder?