用于解析或验证 Base64 数据的正则表达式
是否可以使用 RegEx 来验证或清理 Base64 数据? 这是一个简单的问题,但推动这个问题的因素却使它变得困难。
我有一个 Base64 解码器,不能完全依赖输入数据来遵循 RFC 规范。 因此,我面临的问题可能是 Base64 数据可能无法分解为 78 个字符(我认为是 78 个,我必须仔细检查 RFC,所以如果确切的数字错误,请不要打扰我)字符行,或者行不能以 CRLF
结尾; 因为它可能只有 CR
或 LF
,或者两者都没有。
因此,我在解析这样格式化的 Base64 数据时遇到了很大的困难。 因此,下面的示例变得不可能可靠地解码。 为了简洁起见,我将仅显示部分 MIME 标头。
Content-Transfer-Encoding: base64
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
好的,所以解析没有问题,并且正是我们期望的结果。 在 99% 的情况下,使用任何代码至少验证缓冲区中的每个字符都是有效的 base64 字符都可以完美工作。 但下一个例子却给这个组合带来了麻烦。
Content-Transfer-Encoding: base64
http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
这是我在一些病毒和其他东西中看到的 Base64 编码的版本,这些病毒和其他东西试图利用一些邮件读者不惜一切代价解析 MIME 的愿望,而不是严格按照书本或 RFC 进行解析; 如果你愿意的话。
我的 Base64 解码器将第二个示例解码为以下数据流。 这里请记住,原始流都是 ASCII 数据!
[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8
有没有人有一个好方法可以同时解决这两个问题? 除了应用不同规则对数据进行两次转换并比较结果之外,我不确定这是否可能。 但是,如果您采用这种方法,您相信哪个输出? ASCII 启发式似乎是最佳解决方案,但是这会给像病毒扫描程序这样复杂的东西(该代码实际涉及的东西)增加多少代码、执行时间和复杂性? 您将如何训练启发式引擎来了解哪些是可接受的 Base64,哪些是不可接受的?
更新:
由于这个问题不断收到的浏览量,我决定发布一个简单的正则表达式,我已经在 C# 应用程序中使用了 3 年,有数十万个交易。 老实说,我最喜欢 Gumbo 给出的答案,这就是为什么我选择它作为选定的答案。 但对于任何使用 C# 并正在寻找一种非常快速的方法来至少检测字符串或 byte[] 是否包含有效 Base64 数据的人来说,以下正则表达式模式对我来说非常有效。
^[-A-Za-z0-9+/=]|=[^=]|={3,}$
或者按照 kael 建议的更简化的模式:
^[-A-Za-z0-9+/]*={0,3}$
是的,这只是针对 STRING 的 Base64 数据,而不是格式正确的 RFC1341信息。 因此,如果您正在处理此类数据,请在尝试使用上述正则表达式之前考虑到这一点。 如果您出于其他目的(URL、文件名、XML 编码等)而处理 Base16、Base32、Radix 甚至 Base64,那么强烈建议您阅读RFC4648 Gumbo 在他们的答案中提到,因为在尝试使用此问题/答案集中的建议之前,您需要充分了解实现使用的字符集和终止符。
Is it possible to use a RegEx to validate, or sanitize Base64 data? That's the simple question, but the factors that drive this question are what make it difficult.
I have a Base64 decoder that can not fully rely on the input data to follow the RFC specs. So, the issues I face are issues like perhaps Base64 data that may not be broken up into 78 (I think it's 78, I'd have to double check the RFC, so don't ding me if the exact number is wrong) character lines, or that the lines may not end in CRLF
; in that it may have only a CR
, or LF
, or maybe neither.
So, I've had a hell of a time parsing Base64 data formatted as such. Due to this, examples like the following become impossible to decode reliably. I will only display partial MIME headers for brevity.
Content-Transfer-Encoding: base64
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
OK, so parsing that is no problem, and is exactly the result we would expect. And in 99% of the cases, using any code to at least verify that each char in the buffer is a valid base64 char works perfectly. But the next example throws a wrench into the mix.
Content-Transfer-Encoding: base64
http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
This is a version of Base64 encoding that I have seen in some viruses and other things that attempt to take advantage of some mail readers' desire to parse MIME at all costs, versus ones that go strictly by the book, or rather RFC; if you will.
My Base64 decoder decodes the second example to the following data stream. And keep in mind here, the original stream is all ASCII data!
[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8
Does anyone have a good way to solve both problems at once? I'm not sure it's even possible, outside of doing two transforms on the data with different rules applied, and comparing the results. However, if you took that approach, which output do you trust? It seems that ASCII heuristics is about the best solution, but how much more code, execution time, and complexity would that add to something as complicated as a virus scanner, which this code is actually involved in? How would you train the heuristics engine to learn what is acceptable Base64, and what isn't?
UPDATE:
Due to the number of views this question continues to recieve, I've decided to post the simple regex that I've been using in a C# application for 3 years now, with hundreds of thousands of transactions. Honestly, I like the answer given by Gumbo the best, which is why I picked it as the selected answer. But to anyone using C#, and looking for a very quick way to at least detect whether a string, or byte[]
contains valid Base64 data or not, the following regex patterns work very well for me.
^[-A-Za-z0-9+/=]|=[^=]|={3,}$
Or a more simplified pattern as suggested by kael:
^[-A-Za-z0-9+/]*={0,3}$
And yes, this is just for a STRING of Base64 data, NOT a properly formatted RFC1341 message. So, if you are dealing with data of this type, please take that into account before attempting to use the above regex. If you are dealing with Base16, Base32, Radix or even Base64 for other purposes (URLs, file names, XML Encoding, etc.), then it is highly recommended that you read RFC4648 that Gumbo mentioned in their answer as you need to be well aware of the charset and terminators used by the implementation before attempting to use the suggestions in this question/answer set.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
“:”和“.”都不会出现在有效的 Base64 中,因此我认为您可以明确地丢弃
http://www.stackoverflow .com
线。 比如说,在 Perl 中,类似的东西可能就是你想要的。 它生成
This is simple ASCII Base64 for StackOverflow exmaple.
Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the
http://www.stackoverflow.com
line. In Perl, say, something likemight be what you want. It produces
This is simple ASCII Base64 for StackOverflow exmaple.
迄今为止我能找到的最好的正则表达式就在这里
https://www.npmjs.com/package/base64-regex
位于当前版本如下:
The best regexp which I could find up till now is in here
https://www.npmjs.com/package/base64-regex
which is in the current version looks like:
检查 RFC-4648 合规性的最短正则表达式强制执行规范编码(即所有填充位设置为 0):
实际上这是 this< /a> 和那个答案。
The shortest regex to check RFC-4648 compiliance enforcing canonical encoding (i.e. all pad bits set to 0):
Actually this is the mix of this and that answers.
我的 Base64 正则表达式的简化版本:
^[A-Za-z0-9+/]*={0,2}$
简化是它不检查其长度是否是 4 的倍数如果您需要 - 使用其他答案。 我的重点是简单性。
要测试它: https://regex101.com/r/zdtGSH/1
My simplified version of Base64 regex:
^[A-Za-z0-9+/]*={0,2}$
Simplification is that it doesn't check that its length is a multiple of 4. If you need that - use other answers. Mine is focusing on simplicity.
To test it: https://regex101.com/r/zdtGSH/1
我找到一个非常有效的解决方案
它将匹配跟随字符串
,但它不会匹配任何无效的字符串
I found a solution that works very well
It will match the following strings
while it won't match any of those invalid
来自 RFC 4648:
因此,是否应将数据视为危险数据取决于编码数据的使用目的。
但如果您只是寻找匹配 Base64 编码单词的正则表达式,则可以使用以下内容:
From the RFC 4648:
So it depends on the purpose of usage of the encoded data if the data should be considered as dangerous.
But if you’re just looking for a regular expression to match Base64 encoded words, you can use the following:
这个很好,但是会匹配一个空字符串
这个不匹配空字符串:
This one is good, but will match an empty String
This one does not match empty string :
到目前为止给出的答案未能检查 Base64 字符串是否将所有填充位设置为 0,因为它是 Base64 的规范表示(这在某些环境中很重要,请参阅 https://www.rfc-editor.org/rfc/rfc4648#section-3.5),因此,他们允许别名是同一二进制字符串的不同编码。 在某些应用程序中这可能是一个安全问题。
以下正则表达式验证给定字符串不仅是有效的 base64,而且还是二进制数据的规范 base64 字符串:
引用的 RFC 认为空字符串有效(请参阅 https://www.rfc-editor.org/rfc/rfc4648#section-10)因此上述正则表达式也是如此。
base64url 的等效正则表达式(再次参考上面的 RFC)是:
The answers presented so far fail to check that the Base64 string has all pad bits set to 0, as required for it to be the canonical representation of Base64 (which is important in some environments, see https://www.rfc-editor.org/rfc/rfc4648#section-3.5) and therefore, they allow aliases that are different encodings for the same binary string. This could be a security problem in some applications.
Here is the regexp that verifies that the given string is not just valid base64, but also the canonical base64 string for the binary data:
The cited RFC considers the empty string as valid (see https://www.rfc-editor.org/rfc/rfc4648#section-10) therefore the above regex also does.
The equivalent regular expression for base64url (again, refer to the above RFC) is:
这是一个替代正则表达式:
它满足以下条件:
(?=^(.{4})*$)
[A-Za-z0-9+/]*
={0,2}
Here's an alternative regular expression:
It satisfies the following conditions:
(?=^(.{4})*$)
[A-Za-z0-9+/]*
={0,2}
要验证 base64 图像,我们可以使用此正则表达式
To validate base64 image we can use this regex