jQuery 清理注释和链接 URL

发布于 2024-10-16 15:10:10 字数 336 浏览 1 评论 0原文

就 jQuery（或 Javascript）而言，当一个人在 Facebook、Twitter 或博客上发表评论时，幕后会发生什么？

例如，他们是否首先清理文本，然后将 URL 模式匹配到实际链接中？除了在后端进行一些检查之外，客户端还应该检查其他事项吗？

我找到了一些用于将 URL 转换为链接的正则表达式，但我不确定是否有更好的解决方案。

我正在努力解决这个问题，但我很难知道从哪里开始。非常感谢您提供的任何指导！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

恰似旧人归 2024-10-23 15:10:10

这是一个意见问题（在我看来），所以我会CW这个答案。作为一个真正的互联网公民，我的看法是：

“净化”有两种广泛的类型：一种是语义净化，即检查输入以确保它是应有的样子（电话号码、邮政编码、货币金额等）。另一个是防御性清理，（在我看来）这通常是一种误导性的、对用户不利的活动。
确实，输入在接触到某些东西之前从来都不会真正可怕：数据库服务器、HTML 渲染器、JavaScript 解释器等等。名单很长。

至于第 1 点，我认为防御性清理是错误的，因为它忽略了上面的第 2 点：在不知道要防御恶意输入的环境的情况下，如果不极大地限制输入，就无法真正对其进行清理字母表，即使如此，该过程也可能会与自身作斗争。它是对用户怀有敌意的，因为它不必要地限制了合法用户对他们想要保留在帐户中的数据的操作。谁说我想在我的“评论”或“昵称”或“注释”字段中包含看起来像 XML、SQL 或任何其他语言的特殊字符的字符？如果没有语义原因来过滤输入，为什么要对用户这样做呢？

第2点确实是这个问题的关键。用户输入可能很危险，因为服务器端代码（或客户端代码）可以将其直接交给毫无戒心的解释环境，其中元字符对于每个不同的环境都很重要< /em> 可能会导致意外行为。如果您通过将未受影响的用户输入直接粘贴到查询模板中来将其直接传递给 SQL，则恶意用户可以使用特殊的 SQL 元字符（例如引号）以您绝对不希望的方式控制数据库。然而，仅此一点并不能阻止我告诉你我的名字是“欧亨利”。

第 2 点的关键问题是存在许多不同的解释环境，并且就用户输入构成的威胁而言，每个环境都是完全不同的。让我们列出一些：

SQL - 用户输入中的引号是一个很大的潜在问题；特定的数据库服务器可能有其他可利用的语法约定
HTML - 当用户输入直接放入 HTML 中时，浏览器的 HTML 解析器将乐意遵循嵌入标记告诉它执行的任何操作，包括运行脚本、负载跟踪器图像以及其他任何内容。关键元字符是“<”、“>”和“&” （后者并不是因为攻击，而是因为它们造成的混乱）。在这里担心引号也可能是件好事，因为用户输入可能需要进入 HTML 元素属性。
JavaScript - 如果页面模板需要将一些用户输入直接放入某些正在运行的 JavaScript 代码中，则需要担心的事情可能是引号（如果输入将被视为 JavaScript 字符串）。如果用户输入需要进入正则表达式，则需要进行更多的清理。
日志文件 - 是的，日志文件。您如何查看日志文件？我在 Linux 机器上的简单命令行窗口上执行此操作。此类命令行“控制台”应用程序通常遵循可追溯到旧 ASCII 终端的古老“转义序列”，用于控制光标位置和各种其他操作。那么，精心设计的用户输入中嵌入的转义序列可用于利用这些转义序列进行疯狂的攻击；一般的想法是将一些用户输入放入某个日志文件中（可能作为页面错误日志的一部分）并欺骗管理员在 xterm 窗口中滚动日志文件。狂野吧？

这里的关键点是，保护这些环境免受格式错误或恶意输入影响所需的确切技术因环境而异。保护 SQL 服务器免受恶意引用与保护 HTML 或 JavaScript 中的引用是完全不同的问题（请注意，这两者也完全不同！）。

底线：因此，我的观点是，在担心潜在的格式错误或恶意输入时，正确的关注焦点是写入用户数据的过程，而不是读取用户数据的过程。由于用户提供的数据的每个片段都由您的软件与每个解释环境配合使用，因此必须执行“引用”或“转义”操作，并且它必须是特定于目标环境的操作。具体如何安排可能各地有所不同。例如，传统上在 SQL 中，人们使用准备好的语句，尽管有时准备好的语句的缺陷使这种方法变得困难。当吐出 HTML 时，大多数服务器端框架都有各种内置的钩子，用于使用实体符号转义 HTML 或 XML（例如表示“&”的 &）。如今，保护 Javascript 的最简单方法是利用 JSON 序列化器，当然还有其他方法可以选择。

This is a matter of opinion (in my opinion) so I'll CW this answer. Here's my opnion as a bona-fide citizen of the Internet:

There are two broad kinds of "sanitization": one is semantic sanitization, where input is checked to make sure it's what it's supposed to be (phone number, postal code, currency amount, whatever). The other is defensive sanitization, which is (again, in my opinion) a generally misguided, user-hostile activity.
Really, input is never really scary until it touches something: the database server, an HTML renderer, a JavaScript interpreter, and so on. The list is long.

As to point 1, I think that defensive sanitization is misguided because it ignores point 2 above: without knowing what environment you're defending from malicious input, you can't really sanitize it without greatly restricting the input alphabet, and even then the process may be fighting against itself. It's user-hostile because it needlessly restricts what legitimate users can do with the data they want to keep in their account. Who is to say that me wanting to include in my "comments" or "nickname" or "notes" fields characters that look like XML, or SQL, or any other language's special characters? If there's no semantic reason to filter inputs, why do that to your users?

Point 2 is really the crux of this. User input can be dangerous because server-side code (or client-side code, for that matter) can hand it over directly to unsuspecting interpretation environments where meta-characters important to each distinct environment can cause unexpected behavior. If you hand untouched user input directly to SQL by pasting it directly into a query template, then special SQL meta-characters like quotes can be used by a malicious user to control the database in ways you definitely don't want. However, that alone is no reason to prevent me from telling you that my name is "O'Henry".

The key issue with point 2 is that there are many different interpretation environments, and each of them is completely distinct as far as the threat posed by user input. Let's list a few:

SQL - quote marks in user input are a big potential problem; specific DB servers may have other exploitable syntax conventions
HTML - when user input is dropped straight into HTML, the browser's HTML parser will happily obey whatever embedded markup tells it to do, including run scripts, load tracker images, and whatever else. The key meta-characters are "<", ">", and "&" (the latter not so much because of attacks, but because of the mess they cause). It's probably also good to worry about quotes here too because user input may need to go inside HTML element attributes.
JavaScript - if a page template needs to put some user input directly into some running JavaScript code, the things to worry about are probably quotes (if the input is to be treated as a JavaScript string). If the user input needs to go into a regular expression, then a lot more scrubbing is necessary.
Logfiles - yes, logfiles. How do you look at logfiles? I do it on a simple command-line window on my Linux box. Such command-line "console" applications generally obey ancient "escape sequences" that date back to old ASCII terminals, for controlling cursor position and various other things. Well, embedded escape sequences in cleverly crafted user input can be used for crazy attacks that leverage those escape sequences; the general idea is to have some user input get dropped into some log file (maybe as part of a page error log) and trick an administrator into scrolling through the logfile in an xterm window. Wild, huh?

The key point here is that the exact techniques necessary to protect those environments from malformed or malicious input differ significantly from one to the next. Protecting your SQL server from malicious quotes is a completely different problem from guarding those quotes in HTML or JavaScript (and note that both of those are totally different from each other too!).

The bottom line: my opinion, therefore, is that the proper focus of attention when worrying about potentially malformed or malicious input is the process of writing user data, not reading it. As each fragment of user-supplied data is used by your software in cooperation with each interpreting environment, a "quoting" or "escaping" operation has to be done, and it has to be an operation specific to the target environment. How exactly that's arranged may vary all over the place. Traditionally in SQL, for example, one uses prepared statements, though there are times when the deficiencies of prepared statements make that approach difficult. When spitting out HTML, most server-side frameworks have all sorts of built-in hooks for HTML or XML escaping with entity notation (like & for "&"). Nowadays, the simplest way to protect things for Javascript is to leverage a JSON serializer, though of course there are other ways to go.

回复收藏 0 原文

~没有更多了~