当无法转义的字符被转义时该怎么办?
在设计(迷你)语言时: 当某些字符需要转义以失去特殊含义时(例如某些编程语言中的引号),当不可转义的字符(例如,从不具有特殊含义的普通字符)被转义时,应该做什么,特别是从安全角度来看。逃走了?错误是否应该被“错误”化,或者应该丢弃该字符,或者应该在输出中与未转义一样?
例子: 在简单语言中,字符串由双引号 ("
) 分隔,并且给定字符串中的任何引号均使用反斜杠 (\
) 转义:用于输入“我们\说,\“我们现在想要Moshiach\”“
- 应该如何处理said
中的字母s
,它是逃走了?
In designing of a (mini)language:
When there are certain characters that should be escaped to lose special meanings (like quotes in some programming languages), what should be done, especially from a security perspective, when characters that are not escapable (e.g. normal characters which never have special meaning) are escaped? Should an error be "error"ed, or should the character be discarded, or should it be in the output the same as if it was not escaped?
Example:
In a simple language where strings are delimited by double-quotes("
), and any quotes in a given string are escaped with a back-slash(\
): for input "We \said, \"We want Moshiach Now\""
-- what would should be done with the letter s
in said
which is escaped?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
当这种情况发生时,我更喜欢词法分析器发出抱怨声。词法分析器/解析器应该严格遵守语法;人们总是可以稍后放松它。如果你粗心大意,你会发现你无法撤回你认为自己没有做出的决定。
假设您最初决定将“ 反斜杠 not-an-escape”视为该对字符,并且“T”是
今天不是逃避。一段时间后,您决定扩展语言,并希望“\T”表示特殊的东西,然后您更改了语言。
你会发现一群愤怒的程序员冲进你的设计城堡,
因为对于他们来说,“\T”意味着“\”“T”(或“T”,具体取决于您的默认决定),
而你刚刚破解了他们的密码。你羞愧地低下头,收回决定,
然后意识到...哎呀,没有更多可用的转义字符了!
本课程适用于您的语言中未明确定义的任何语法。如果它不是明确合法的,那么它应该是隐式非法的,并且您的编译器应该检查它。否则你将永远无法扩展你成功的语言。
如果你的语言不会成功,你可能不会那么在意。
I prefer the lexer to whine when this occurs. A lexer/parser should be tight about syntax; one can always loosen it up later. If you are sloppy, you'll find you can't retract a decision you didn't think you made.
Assume that you initially decide to treat " backslash not-an-escape " as that pair of characters, and the "T" is
not-an-escape today. Sometime later you decide to extend the language, and want "\T" to mean something special, and you change your language.
You'll find an angry mob of programmers storming your design castle,
because for them, "\T" means "\" "T" (or "T" depending on your default decision),
and you just broke their code. You hang your head in shame, retract the decision,
and then realize... oops, there are no more available escape characters!
This lesson goes for any piece of syntax that isn't well defined in your language. If it isn't explicitly legal, it should be implicitly illegal and your compiler should check it. Or you'll never be able to extend your successful language.
If your language isn't going to be successful, you may not care as much.
解决这个问题的一种方法是,当反斜杠位于不可转义字符之前时,仅表示反斜杠。这就是 Python 所做的:
Well, one way to solve the problem is for the backslash to just mean backslash when it precedes a non-escapable character. That's what Python does:
显然,大多数系统将转义字符视为“逐字获取下一个字符”,因此转义“不可转义”字符通常是无害的。当您进行比较等时,稍后会出现问题,其中文字文本并不代表实际值(这就是您看到很多安全问题的地方,尤其是 URL 之类的问题)。
所以一方面,你只能接受有限数量的转义字符。从这个意义上说,您有一个“转义序列”,而不是转义字符(\x 是整个序列,而不是 \ 后跟 x)。这就像是最安全的机制,而且编写起来并不是很麻烦。
另一种选择是确保您通过一些规则集“规范化”您比较的所有内容。这通常意味着在比较之前预先正确删除所有转义序列,并且仅比较最终值而不是文字。
Obviously, most systems take the escape character to mean "take the next character verbatim", so escaping a "non-escapable" character is usually harmless. The problem later happens when you get to comparisons and such, where the literal text does not represent the actual value (that's where you see a lot of issues securitywise, especially with things like URLs).
So on the one hand, you can only accept a limited number of escaped characters. In that sense, you have an "escape sequence", rather than an escaped character (the \x is the entire sequence rather than a \ followed by an x). That's like the most safe mechanism, and it's not really burdensome to write.
The other option is to ensure that you you "canonicalizing" everything you compare, through some ruleset. This typically means removing all of the escape sequences properly up front, before comparison and comparing only the final values rather than the literals.
大多数系统都按照 Will Hartung 的说法解释斜杠,除了字母数字,它们被不同地用作控制代码、字符类、字边界、十六进制序列的开头、大小写区域标记、十六进制或八进制数字的别名等。
\s
特别通常表示 perl5 风格正则表达式中的空格。 JavaScript 在一种上下文中将其解释为's'
,而在另一种上下文中将其解释为空白,由于这种选择,存在一些细微的错误。考虑/foo\sbar/
与new RegExp('foo\sbar')
。Most systems interpret the slash as Will Hartung says, except for alphanumerics which are variously used as aliases for control codes, character classes, word boundaries, the start of hex sequences, case region markers, hex or octal digits, etc.
\s
in particular often means white-space in perl5 style regexs. JavaScript, which interprets it as's'
in one context and as whitespace in another suffers from subtle bugs because of this choice. Consider/foo\sbar/
vsnew RegExp('foo\sbar')
.