Base64解码直到没有Base64
所以我认为我的问题非常简单。我需要解码 Base64 直到没有 Base64,我用 RegEx 检查是否有 Base64,但我不知道如何解码直到没有 Base64。
在这段简短的代码中,我可以解码 Base64 直到没有 Base64,因为我的文本已定义。 (直到 Base64 解码内容不是“Hello World”解码)
# Import Libraries
from base64 import *
import re
# Text & Base64 String
strText = "Hello World"
strEncode = "VmxSQ2ExWXlUWGxUYTJoUVVqSlNXRlJYY0hOT1ZteHlXa1pLVVZWWE9EbERaejA5Q2c9PQo=".encode("utf-8")
# Decode
objRgx = re.search('^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$', strEncode.decode("utf-8"))
strDecode = b64decode(objRgx.group(0).encode("utf-8"))
print(strDecode.decode("utf-8"))
while strDecode != strText.encode("utf-8"):
strDecode = b64decode(strDecode)
print(strDecode.decode("utf-8"))
有谁知道如何解码 Base64 直到出现真正的文本(不再是 Base64)
PS 抱歉我的英语不好。
So my problem is something very simple, i think. I need to Decode Base64 until there is no Base64, i check with an RegEx if there is some Base64 but i got no Idea how to decode until there is no Base64.
In this short Code i can Decode the Base64 until there is no Base64 because my Text is defined. (Until the Base64 Decode Stuff isn't "Hello World" decode)
# Import Libraries
from base64 import *
import re
# Text & Base64 String
strText = "Hello World"
strEncode = "VmxSQ2ExWXlUWGxUYTJoUVVqSlNXRlJYY0hOT1ZteHlXa1pLVVZWWE9EbERaejA5Q2c9PQo=".encode("utf-8")
# Decode
objRgx = re.search('^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?
Does anyone have an Idea how i can decode the Base64 until there is the real text (no more base64)
P. S. sorry for my bad english.
, strEncode.decode("utf-8"))
strDecode = b64decode(objRgx.group(0).encode("utf-8"))
print(strDecode.decode("utf-8"))
while strDecode != strText.encode("utf-8"):
strDecode = b64decode(strDecode)
print(strDecode.decode("utf-8"))
Does anyone have an Idea how i can decode the Base64 until there is the real text (no more base64)
P. S. sorry for my bad english.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
你不能,不是任意意义上的。问题很简单,正常的日常单词也可以是 BASE64。因此,没有真正的方法来区分两者之间的区别。
BASE64 除了长度之外没有终止符。它可以用 = 或 == 终止,但不必终止。 = 只是填充。不需要填充,那么就没有=。因此,BASE64 可能会结束并开始一些文本,而您却无法检测到它。
编辑“那么真的没有办法做我想做的事吗?”:
不,不确定,不可靠。即使使用启发式方法,也可能会出现失败的情况,并且最终会消耗太多字符,从而导致二进制块末尾出现垃圾,并在后续文本流中丢失字符。
现在这是针对任意 BASE64 块的。如果您知道二进制数据是什么,那么也许还有希望。
例如,如果您知道二进制数据是什么,则大多数二进制格式“知道”它们何时“完成”。我不知道有效的二进制格式是“读取直到到达 EOF”。它们通常带有“这是下一个块有多少数据”的内部描述符,或者带有“我完成了”的终止符。
在这些情况下,您可以将 BASE64 视为流。 BASE64 基本上非常简单。它需要 3 个字节并将其转换为 4 个字符。
因此,B64 流读取器只需读取 4 个字符并返回它们代表的 3 个字节。
例如,如果您有一个 PNG 阅读器,它可以开始读取转换后的流。当它“完成”时,它“关闭”流,并且您的原始文本是“在 BASE64 的末尾”。
如果您知道原始附件的大小,它也可以工作。如果有人发送“10,000 个字节”,那么您可以使用 BASE64 流解码器并从中读取“10,000”个字节。
通常,您的 BASE64 带有 = 或 == 终止符。在您不这样做的情况下,这是一个问题。解码后的流可以以任何方式工作。
如果您不知道附件的原始大小或编码的二进制文件的格式,那么您就运气不好了。
You can't, not in an arbitrary sense. The problem is simply that normal, every day words can ALSO be BASE64. So, there's no real way to tell the difference between the two.
BASE64 doesn't have a terminator other than length. It CAN be terminated with = or == but does not HAVE to be terminated. The = are just padding. No padding needed, then no =. So its possible that the BASE64 will end and some text will begin, without you being able to detect it.
Edit for "So there is really no way to do what i want?":
No, not deterministically, not reliably. Even with a heuristic, there will be potential cases where it fails and you will end up consuming too many characters, resulting in garbage at the end of your binary block, and lost of characters in the following text stream.
Now this is for an arbitrary BASE64 block. If you KNOW what the binary data is, then perhaps there's hope.
For example, if you KNOW what the binary data is, most binary formats "know" when they are "done". I don't know of a valid binary format that says "read until you reach EOF". They're typically laced with internal descriptors of "this is how much data the next chunk has" or with terminators saying "I'm done".
In these cases you can treat the BASE64 as a stream. BASE64 is basically pretty simple. It takes 3 bytes and converts them in to 4 characters.
So, a B64 stream reader needs to simply read 4 chars and return the 3 bytes they represent.
If you have, say, a PNG reader, it can start reading the converted stream. And when it is "done", it "closes" the stream, and your original text is "at the end of the BASE64".
It can also work if you know the size of the original attachment. If someone sent "10,000 bytes", then you use your BASE64 stream decoder and simply read "10,000" bytes from it.
More often than not, you will have BASE64 with a = or == terminator. It's the cases where you don't that it's a problem. The stream decoded works either way.
If you don't know the original size of the attachment, or the format of the encoded binary, then you're pretty much out of luck.
作为启发式方法,您可以计算结果中的平均单词长度。自然语言会有简短的单词,例如“作为启发式,您可以查看单词长度”。仍然采用 Base64 编码的字符串将具有很少的空格(如果有的话),并且空格之间的字符串很长。
作为另一种启发式方法,您可以计算元音(a、e、i、o、u)与辅音的比例或单词中间大写字母的数量。
As a heuristic, you could compute the average word length in the result. Natural language will have short words like "As a heuristic, you could look at word length." A string that is still Base64 encoded will have few if any spaces and long strings between the spaces.
As another heuristic, you could calculate the proportions of vowels (a, e, i, o, u) to consonants or the number of capital letters in the middle of words.
那么您正在处理可能已重复进行 Base64 编码的数据块?那么为什么不直接通过 b64decode() 循环字符串直到出错呢?
另外,我认为您可能不需要撒很多
.encode("utf-8")
。So you're dealing with a block of data that may have been repeatedly base64-encoded? Why not just loop the string through b64decode() until it errors, then?
Also I think you probably don't need to sprinkle quite so many
.encode("utf-8")
around.我在这里看到两个有价值的答案,涉及平均字长(Mark Lutton)和原始数据的字节大小(Will Hartung)。另一个有用的事情:查找字典中预期的单词、有意义的数字或/和日期。
I see two valuable answers here referring to average word length (Mark Lutton) and byte-size of original data (Will Hartung). Another useful thing: look for dictionary words expected, meaningful numbers or/and dates.
您可以使用正则表达式来检查字符串是否构成有效的 Base64 编码。 Base64编码的字符集包括[AZ、az、0-9和+/-]。如果剩余长度小于 4,则字符串用 = 字符填充。
如果内容中包含不能出现在 Base64 字符中的字符,则认为它不再是 Base64 代码。
这个解决方案也不是 100%,因为如果输入的数据本质上不包含 Base64 字符中不能出现的字符,那么它永远不会被置于无限循环中。
You can use a regular expression to check if a string constitutes a valid Base64 encoding. The character set for Base64 encoding includes [A-Z, a-z, 0-9, and + /]. If the rest length is less than 4, the string is padded with = characters.
If the content contains a character that cannot appear in Base64 characters, consider that it is no longer Base64 code.
This solution is not 100% either, because if the input is data that does not inherently contain a character that cannot occur in Base64 characters, it will never be placed in an infinite loop.