激烈的powershell vs gnu coreutils base64输出长度差
我试图找出为什么在 Powershell 与 GNU coreutils 中以 Base64 编码文件时输出大小存在巨大差异。根据选项(UTF8 与 Unicode),Powershell 输出范围约为 240MB 到 318MB。使用 coreutils base64(在本例中为 Cygwin),输出约为 80MB。原始文件大小约为 58MB。那么,有两个问题:
- 为什么会有如此巨大的差异?
- 如何让 Powershell 提供 GNU 工具提供的更小的输出?
以下是我使用的具体命令:
Powershell 较小的输出:
$input = "C:\Users\my.user\myfile.pdf"
$filecontent = get-content $input
$converted = [System.Text.Encoding]::UTF8.GetBytes($filecontent)
$encodedtext = [System.Convert]::ToBase64String($converted)
$encodedtext | Out-File "C:\Users\my.user\myfile.pdf.via_ps.base64"
较大的 Powershell 输出来自于简单地将“UTF8”替换为“Unicode”。很明显,我对 Powershell 还很陌生;我确信只要稍微擅长一点的人就可以将其组合成几行简单的代码。
Coreutils(通过 Cygwin)base64:
base64.exe -w0 myfile.pdf > myfile.pdf.via_cygwin.base64
I'm trying to figure out why there's a huge difference in the output sizes when encoding a file in base64 in Powershell vs GNU coreutils. Depending on options (UTF8 vs Unicode), the Powershell output ranges from about 240MB to 318MB. Using coreutils base64 (in Cygwin, in this case), the output is about 80MB. The original filesize is about 58MB. So, 2 questions:
- Why is there such a drastic difference?
- How can I get Powershell to give the smaller output that the GNU tool gives?
Here are the specific commands I used:
Powershell smaller output:
$input = "C:\Users\my.user\myfile.pdf"
$filecontent = get-content $input
$converted = [System.Text.Encoding]::UTF8.GetBytes($filecontent)
$encodedtext = [System.Convert]::ToBase64String($converted)
$encodedtext | Out-File "C:\Users\my.user\myfile.pdf.via_ps.base64"
The larger Powershell output came from simply replacing "UTF8" with "Unicode". It will be obvious that I'm pretty new to Powershell; I'm sure someone only slightly better with it could combine that into a couple of simple lines.
Coreutils (via Cygwin) base64:
base64.exe -w0 myfile.pdf > myfile.pdf.via_cygwin.base64
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
因为你在 PowerShell 中做的事情完全不同
通过执行
base64
的操作:)让我们看看
base64 ... > ...
实际上是:base64
:>
:由于 4 字节输出片段仅包含对应于 64 个可打印 ASCII 字符的字节值,该命令实际上从未执行任何“字符串操作” - 它仅操作的值碰巧也可以打印为 ASCII 字符串,因此生成的文件与“文本文件”无法区分。
另一方面,您的 PowerShell 脚本会执行大量字符串操作:
Get-Content $input
:[Encoding]::UTF8.GetBytes()代码>:- 使用 UTF8 重新编码生成的字符串
[Convert]::ToBase64String()
Out-File
:上面突出显示的三个附加字符串编码步骤将导致字节流大幅膨胀,这就是您看到输出大小两倍或三倍的原因。
那么如何对文件进行base64编码呢?
这里的技巧是从磁盘读取原始字节并将它们直接传递给
[convert]::ToBase64String()
从技术上讲,它可能是只需立即将整个文件读入数组:
...我强烈建议反对对于大于几千字节的文件这样做。
相反,对于一般的文件转换,您需要使用流。在这种特殊情况下,您需要使用带有
ToBase64Transform
的CryptoStream
将文件流重新编码为 base64:现在您可以执行以下操作:
并期望输出大小与
base64
相同Because you're doing something wildly different in PowerShell
By doing what
base64
does :)Let's have a look at what
base64 ... > ...
actually does:base64
:>
:Since the 4-byte output fragments only contain byte values that correspond to 64 printable ASCII characters, the command never actually does any "string manipulation" - the values on which it operates just happen to also be printable as ASCII strings and the resulting file is therefor indistinguishable from a "text file".
Your PowerShell script on the other hand does lots of string manipulation:
Get-Content $input
:[Encoding]::UTF8.GetBytes()
:[Convert]::ToBase64String()
Out-File
:The three additional string encoding steps highlighted above will result in a much-inflated byte stream, which is why you're seeing the output size double or triple.
How to base64-encode files then?
The trick here is to read the raw bytes from disk and pass those directly to
[convert]::ToBase64String()
It is technically possibly to just read the entire file into an array at once:
... I'd strongly recommend against doing so for files larger than a few kilobytes.
Instead, for file transformation in general you'll want to use streams. In this particular case, you'll want want to use a
CryptoStream
with aToBase64Transform
to re-encode a file stream as base64:Now you can do:
And expect an output the same size as with
base64