用于解析或验证 Base64 数据的正则表达式

发布于 2024-07-12 13:33:25 字数 2347 浏览 11 评论 0原文

是否可以使用 RegEx 来验证或清理 Base64 数据？这是一个简单的问题，但推动这个问题的因素却使它变得困难。

我有一个 Base64 解码器，不能完全依赖输入数据来遵循 RFC 规范。因此，我面临的问题可能是 Base64 数据可能无法分解为 78 个字符（我认为是 78 个，我必须仔细检查 RFC，所以如果确切的数字错误，请不要打扰我）字符行，或者行不能以 CRLF 结尾；因为它可能只有 CR 或 LF，或者两者都没有。

因此，我在解析这样格式化的 Base64 数据时遇到了很大的困难。因此，下面的示例变得不可能可靠地解码。为了简洁起见，我将仅显示部分 MIME 标头。

Content-Transfer-Encoding: base64

VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu

好的，所以解析没有问题，并且正是我们期望的结果。在 99% 的情况下，使用任何代码至少验证缓冲区中的每个字符都是有效的 base64 字符都可以完美工作。但下一个例子却给这个组合带来了麻烦。

Content-Transfer-Encoding: base64

http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu

这是我在一些病毒和其他东西中看到的 Base64 编码的版本，这些病毒和其他东西试图利用一些邮件读者不惜一切代价解析 MIME 的愿望，而不是严格按照书本或 RFC 进行解析；如果你愿意的话。

我的 Base64 解码器将第二个示例解码为以下数据流。这里请记住，原始流都是 ASCII 数据！

[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8

有没有人有一个好方法可以同时解决这两个问题？除了应用不同规则对数据进行两次转换并比较结果之外，我不确定这是否可能。但是，如果您采用这种方法，您相信哪个输出？ ASCII 启发式似乎是最佳解决方案，但是这会给像病毒扫描程序这样复杂的东西（该代码实际涉及的东西）增加多少代码、执行时间和复杂性？您将如何训练启发式引擎来了解哪些是可接受的 Base64，哪些是不可接受的？

更新：

由于这个问题不断收到的浏览量，我决定发布一个简单的正则表达式，我已经在 C# 应用程序中使用了 3 年，有数十万个交易。老实说，我最喜欢 Gumbo 给出的答案，这就是为什么我选择它作为选定的答案。但对于任何使用 C# 并正在寻找一种非常快速的方法来至少检测字符串或 byte[] 是否包含有效 Base64 数据的人来说，以下正则表达式模式对我来说非常有效。

^[-A-Za-z0-9+/=]|=[^=]|={3,}$

或者按照 kael 建议的更简化的模式：

^[-A-Za-z0-9+/]*={0,3}$

是的，这只是针对 STRING 的 Base64 数据，而不是格式正确的 RFC1341信息。因此，如果您正在处理此类数据，请在尝试使用上述正则表达式之前考虑到这一点。如果您出于其他目的（URL、文件名、XML 编码等）而处理 Base16、Base32、Radix 甚至 Base64，那么强烈建议您阅读RFC4648 Gumbo 在他们的答案中提到，因为在尝试使用此问题/答案集中的建议之前，您需要充分了解实现使用的字符集和终止符。

原文

Is it possible to use a RegEx to validate, or sanitize Base64 data? That's the simple question, but the factors that drive this question are what make it difficult.

I have a Base64 decoder that can not fully rely on the input data to follow the RFC specs. So, the issues I face are issues like perhaps Base64 data that may not be broken up into 78 (I think it's 78, I'd have to double check the RFC, so don't ding me if the exact number is wrong) character lines, or that the lines may not end in CRLF; in that it may have only a CR, or LF, or maybe neither.

So, I've had a hell of a time parsing Base64 data formatted as such. Due to this, examples like the following become impossible to decode reliably. I will only display partial MIME headers for brevity.

Content-Transfer-Encoding: base64

VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu

OK, so parsing that is no problem, and is exactly the result we would expect. And in 99% of the cases, using any code to at least verify that each char in the buffer is a valid base64 char works perfectly. But the next example throws a wrench into the mix.

Content-Transfer-Encoding: base64

http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu

This is a version of Base64 encoding that I have seen in some viruses and other things that attempt to take advantage of some mail readers' desire to parse MIME at all costs, versus ones that go strictly by the book, or rather RFC; if you will.

My Base64 decoder decodes the second example to the following data stream. And keep in mind here, the original stream is all ASCII data!

[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8

Does anyone have a good way to solve both problems at once? I'm not sure it's even possible, outside of doing two transforms on the data with different rules applied, and comparing the results. However, if you took that approach, which output do you trust? It seems that ASCII heuristics is about the best solution, but how much more code, execution time, and complexity would that add to something as complicated as a virus scanner, which this code is actually involved in? How would you train the heuristics engine to learn what is acceptable Base64, and what isn't?

UPDATE:

Due to the number of views this question continues to recieve, I've decided to post the simple regex that I've been using in a C# application for 3 years now, with hundreds of thousands of transactions. Honestly, I like the answer given by Gumbo the best, which is why I picked it as the selected answer. But to anyone using C#, and looking for a very quick way to at least detect whether a string, or byte[] contains valid Base64 data or not, the following regex patterns work very well for me.

^[-A-Za-z0-9+/=]|=[^=]|={3,}$

Or a more simplified pattern as suggested by kael:

^[-A-Za-z0-9+/]*={0,3}$

And yes, this is just for a STRING of Base64 data, NOT a properly formatted RFC1341 message. So, if you are dealing with data of this type, please take that into account before attempting to use the above regex. If you are dealing with Base16, Base32, Radix or even Base64 for other purposes (URLs, file names, XML Encoding, etc.), then it is highly recommended that you read RFC4648 that Gumbo mentioned in their answer as you need to be well aware of the charset and terminators used by the implementation before attempting to use the suggestions in this question/answer set.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

爱殇璃 2024-07-19 13:33:26

“:”和“.”都不会出现在有效的 Base64 中，因此我认为您可以明确地丢弃 http://www.stackoverflow .com 线。比如说，在 Perl 中，类似的东西

my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;

say decode_base64($sanitized_str);

可能就是你想要的。它生成

This is simple ASCII Base64 for StackOverflow exmaple.

Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the http://www.stackoverflow.com line. In Perl, say, something like

my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;

say decode_base64($sanitized_str);

might be what you want. It produces

This is simple ASCII Base64 for StackOverflow exmaple.

回复收藏 0 原文

被翻牌 2024-07-19 13:33:26

迄今为止我能找到的最好的正则表达式就在这里
https://www.npmjs.com/package/base64-regex

位于当前版本如下：

module.exports = function (opts) {
  opts = opts || {};
  var regex = '(?:[A-Za-z0-9+\/]{4}\\n?)*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)';

  return opts.exact ? new RegExp('(?:^' + regex + '$)') :
                    new RegExp('(?:^|\\s)' + regex, 'g');
};

The best regexp which I could find up till now is in here
https://www.npmjs.com/package/base64-regex

which is in the current version looks like:

module.exports = function (opts) {
  opts = opts || {};
  var regex = '(?:[A-Za-z0-9+\/]{4}\\n?)*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)';

  return opts.exact ? new RegExp('(?:^' + regex + '$)') :
                    new RegExp('(?:^|\\s)' + regex, 'g');
};

回复收藏 0 原文

许久 2024-07-19 13:33:26

检查 RFC-4648 合规性的最短正则表达式强制执行规范编码（即所有填充位设置为 0）：

^(?=(.{4})*$)[A-Za-z0-9+/]*([AQgw]==|[AEIMQUYcgkosw048]=)?$

实际上这是 this< /a> 和那个答案。

The shortest regex to check RFC-4648 compiliance enforcing canonical encoding (i.e. all pad bits set to 0):

^(?=(.{4})*$)[A-Za-z0-9+/]*([AQgw]==|[AEIMQUYcgkosw048]=)?$

Actually this is the mix of this and that answers.

回复收藏 0 原文

靑春怀旧 2024-07-19 13:33:26

我的 Base64 正则表达式的简化版本：

^[A-Za-z0-9+/]*={0,2}$

简化是它不检查其长度是否是 4 的倍数如果您需要 - 使用其他答案。我的重点是简单性。

要测试它： https://regex101.com/r/zdtGSH/1

回复收藏 0 原文

刘备忘录 2024-07-19 13:33:26

我找到一个非常有效的解决方案

^(?:([a-z0-9A-Z+\/]){4})*(?1)(?:(?1)==|(?1){2}=|(?1){3})$

它将匹配跟随字符串

VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
YW55IGNhcm5hbCBwbGVhcw==
YW55IGNhcm5hbCBwbGVhc3U=
YW55IGNhcm5hbCBwbGVhc3Vy

，但它不会匹配任何无效的字符串

YW5@IGNhcm5hbCBwbGVhcw==
YW55IGNhc=5hbCBwbGVhcw==
YW55%%%%IGNhcm5hbCBwbGVhc3V
YW55IGNhcm5hbCBwbGVhc3
YW55IGNhcm5hbCBwbGVhc
YW***55IGNhcm5hbCBwbGVh=
YW55IGNhcm5hbCBwbGVhc==
YW55IGNhcm5hbCBwbGVhc===

I found a solution that works very well

^(?:([a-z0-9A-Z+\/]){4})*(?1)(?:(?1)==|(?1){2}=|(?1){3})$

It will match the following strings

VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu
YW55IGNhcm5hbCBwbGVhcw==
YW55IGNhcm5hbCBwbGVhc3U=
YW55IGNhcm5hbCBwbGVhc3Vy

while it won't match any of those invalid

YW5@IGNhcm5hbCBwbGVhcw==
YW55IGNhc=5hbCBwbGVhcw==
YW55%%%%IGNhcm5hbCBwbGVhc3V
YW55IGNhcm5hbCBwbGVhc3
YW55IGNhcm5hbCBwbGVhc
YW***55IGNhcm5hbCBwbGVh=
YW55IGNhcm5hbCBwbGVhc==
YW55IGNhcm5hbCBwbGVhc===

回复收藏 0 原文

榆西 2024-07-19 13:33:25

来自 RFC 4648：

数据的基本编码在许多情况下用于在环境中存储或传输数据，而这些环境（可能由于遗留原因）仅限于 US-ASCII 数据。

因此，是否应将数据视为危险数据取决于编码数据的使用目的。

但如果您只是寻找匹配 Base64 编码单词的正则表达式，则可以使用以下内容：

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

From the RFC 4648:

Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII data.

So it depends on the purpose of usage of the encoded data if the data should be considered as dangerous.

But if you’re just looking for a regular expression to match Base64 encoded words, you can use the following:

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

回复收藏 0 原文

空城缀染半城烟沙 2024-07-19 13:33:25

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

这个很好，但是会匹配一个空字符串

这个不匹配空字符串：

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{4})$

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

This one is good, but will match an empty String

This one does not match empty string :

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{4})$

回复收藏 0 原文

长不大的小祸害 2024-07-19 13:33:25

到目前为止给出的答案未能检查 Base64 字符串是否将所有填充位设置为 0，因为它是 Base64 的规范表示（这在某些环境中很重要，请参阅 https://www.rfc-editor.org/rfc/rfc4648#section-3.5），因此，他们允许别名是同一二进制字符串的不同编码。在某些应用程序中这可能是一个安全问题。

以下正则表达式验证给定字符串不仅是有效的 base64，而且还是二进制数据的规范 base64 字符串：

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$

引用的 RFC 认为空字符串有效（请参阅 https://www.rfc-editor.org/rfc/rfc4648#section-10）因此上述正则表达式也是如此。

base64url 的等效正则表达式（再次参考上面的 RFC）是：

^(?:[A-Za-z0-9_-]{4})*(?:[A-Za-z0-9_-][AQgw]==|[A-Za-z0-9_-]{2}[AEIMQUYcgkosw048]=)?$

The answers presented so far fail to check that the Base64 string has all pad bits set to 0, as required for it to be the canonical representation of Base64 (which is important in some environments, see https://www.rfc-editor.org/rfc/rfc4648#section-3.5) and therefore, they allow aliases that are different encodings for the same binary string. This could be a security problem in some applications.

Here is the regexp that verifies that the given string is not just valid base64, but also the canonical base64 string for the binary data:

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$

The cited RFC considers the empty string as valid (see https://www.rfc-editor.org/rfc/rfc4648#section-10) therefore the above regex also does.

The equivalent regular expression for base64url (again, refer to the above RFC) is:

^(?:[A-Za-z0-9_-]{4})*(?:[A-Za-z0-9_-][AQgw]==|[A-Za-z0-9_-]{2}[AEIMQUYcgkosw048]=)?$

回复收藏 0 原文

聽兲甴掵 2024-07-19 13:33:25

这是一个替代正则表达式：

^(?=(.{4})*$)[A-Za-z0-9+/]*={0,2}$

它满足以下条件：

字符串长度必须是四的倍数 - (?=^(.{4})*$)
内容必须是字母数字字符或 +或 / - [A-Za-z0-9+/]*
末尾最多可以有两个填充 (=) 字符 - ={0,2}
它接受空字符串

Here's an alternative regular expression:

^(?=(.{4})*$)[A-Za-z0-9+/]*={0,2}$

It satisfies the following conditions:

The string length must be a multiple of four - (?=^(.{4})*$)
The content must be alphanumeric characters or + or / - [A-Za-z0-9+/]*
It can have up to two padding (=) characters on the end - ={0,2}
It accepts empty strings

回复收藏 0 原文

酒几许 2024-07-19 13:33:25

要验证 base64 图像，我们可以使用此正则表达式

/^data:image/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[ +/])+={0,2}

  private validBase64Image(base64Image: string): boolean {
    const regex = /^data:image\/(?:gif|png|jpeg|bmp|webp|svg\+xml)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}/;
    return base64Image && regex.test(base64Image);
  }

To validate base64 image we can use this regex

/^data:image/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}

  private validBase64Image(base64Image: string): boolean {
    const regex = /^data:image\/(?:gif|png|jpeg|bmp|webp|svg\+xml)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}/;
    return base64Image && regex.test(base64Image);
  }

回复收藏 0 原文

~没有更多了~