激烈的powershell vs gnu coreutils base64输出长度差

发布于 2025-01-17 19:01:18 字数 862 浏览 3 评论 0原文

我试图找出为什么在 Powershell 与 GNU coreutils 中以 Base64 编码文件时输出大小存在巨大差异。根据选项(UTF8 与 Unicode),Powershell 输出范围约为 240MB 到 318MB。使用 coreutils base64(在本例中为 Cygwin),输出约为 80MB。原始文件大小约为 58MB。那么,有两个问题:

  1. 为什么会有如此巨大的差异?
  2. 如何让 Powershell 提供 GNU 工具提供的更小的输出?

以下是我使用的具体命令:

Powershell 较小的输出:

$input = "C:\Users\my.user\myfile.pdf"
$filecontent = get-content $input
$converted = [System.Text.Encoding]::UTF8.GetBytes($filecontent)
$encodedtext = [System.Convert]::ToBase64String($converted)
$encodedtext | Out-File "C:\Users\my.user\myfile.pdf.via_ps.base64"

较大的 Powershell 输出来自于简单地将“UTF8”替换为“Unicode”。很明显,我对 Powershell 还很陌生;我确信只要稍微擅长一点的人就可以将其组合成几行简单的代码。

Coreutils(通过 Cygwin)base64:

base64.exe -w0 myfile.pdf > myfile.pdf.via_cygwin.base64

I'm trying to figure out why there's a huge difference in the output sizes when encoding a file in base64 in Powershell vs GNU coreutils. Depending on options (UTF8 vs Unicode), the Powershell output ranges from about 240MB to 318MB. Using coreutils base64 (in Cygwin, in this case), the output is about 80MB. The original filesize is about 58MB. So, 2 questions:

  1. Why is there such a drastic difference?
  2. How can I get Powershell to give the smaller output that the GNU tool gives?

Here are the specific commands I used:

Powershell smaller output:

$input = "C:\Users\my.user\myfile.pdf"
$filecontent = get-content $input
$converted = [System.Text.Encoding]::UTF8.GetBytes($filecontent)
$encodedtext = [System.Convert]::ToBase64String($converted)
$encodedtext | Out-File "C:\Users\my.user\myfile.pdf.via_ps.base64"

The larger Powershell output came from simply replacing "UTF8" with "Unicode". It will be obvious that I'm pretty new to Powershell; I'm sure someone only slightly better with it could combine that into a couple of simple lines.

Coreutils (via Cygwin) base64:

base64.exe -w0 myfile.pdf > myfile.pdf.via_cygwin.base64

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

寄风 2025-01-24 19:01:18

为什么会有这么大的差异?

因为你在 PowerShell 中做的事情完全不同

如何让 Powershell 提供 GNU 工具提供的较小输出?

通过执行 base64 的操作:)


让我们看看 base64 ... > ... 实际上是:

  • base64
    • 打开输入文件的文件句柄
    • 从磁盘读取原始字节流
    • 将每个 3 字节对转换为 4 字节 Base64 编码的输出字符串片段
  • >
    • 将原始字节流写入磁盘

由于 4 字节输出片段仅包含对应于 64 个可打印 ASCII 字符的字节值,该命令实际上从未执行任何“字符串操作” - 它仅操作的值碰巧也可以打印为 ASCII 字符串,因此生成的文件与“文本文件”无法区分。

另一方面,您的 PowerShell 脚本会执行大量字符串操作

  • Get-Content $input
    • 打开输入文件的文件句柄
    • 从磁盘读取原始字节流
    • 根据某种选定的编码方案(可能是您的 OEM 代码页)解码字节流
  • [Encoding]::UTF8.GetBytes()代码>:
    • 使用 UTF8 重新编码生成的字符串

  • [Convert]::ToBase64String()
    • 将每个 3 字节对转换为 4 字节 Base64 编码的输出字符串片段
  • Out-File
    • 将输入字符串编码为小尾数 UTF16
    • 写入磁盘

上面突出显示的三个附加字符串编码步骤将导致字节流大幅膨胀,这就是您看到输出大小两倍或三倍的原因。


那么如何对文件进行base64编码呢?

这里的技巧是从磁盘读取原始字节并将它们直接传递给[convert]::ToBase64String()

从技术上讲,它可能是只需立即将整个文件读入数组:

$bytes = Get-Content path\to\file.ext -Encoding Byte # Windows PowerShell only
# or
$bytes = [System.IO.File]::ReadAllBytes($(Convert-Path path\to\file.ext))

$b64String = [convert]::ToBase64String($bytes)

Set-Content path\to\output.base64 -Value $b64String -Encoding Ascii

...我强烈建议反对对于大于几千字节的文件这样做。

相反,对于一般的文件转换,您需要使用。在这种特殊情况下,您需要使用带有 ToBase64TransformCryptoStream 将文件流重新编码为 base64:

function New-Base64File {
    [CmdletBinding(DefaultParameterSetName = 'ByPath')]
    param(
        [Parameter(Mandatory = $true, ParameterSetName = 'ByPath', Position = 0)]
        [string]$Path,

        [Parameter(Mandatory = $true, ParameterSetName = 'ByPSPath')]
        [Alias('PSPath')]
        [string]$LiteralPath,

        [Parameter(Mandatory = $true, Position = 1)]
        [string]$Destination
    )

    # Create destination file if it doesn't exist
    if (-not(Test-Path -LiteralPath $Destination -PathType Leaf)) {
        $outFile = New-Item -Path $Destination -ItemType File
    }
    else {
        $outFile = Get-Item -LiteralPath $Destination
    }

    [void]$PSBoundParameters.Remove('Destination')

    try {
        # Open a writable file stream to the output file 
        $outStream = $outFile.OpenWrite()

        # Wrap output file stream in a CryptoStream.
        #
        # Anything that we write to the crypto stream is automatically 
        # base64-encoded and then written through to the output file stream 
        $transform = [System.Security.Cryptography.ToBase64Transform]::new()
        $cryptoStream = [System.Security.Cryptography.CryptoStream]::new($outStream, $transform, 'Write')

        foreach ($file in Get-Item @PSBoundParameters) {
            try {
                # Open readable input file stream
                $inStream = $file.OpenRead()

                # Copy input bytes to crypto stream
                # - which in turn base64-encodes and writes to output file
                $inStream.CopyTo($cryptoStream)
            }
            finally {
                # Clean up the input file stream
                $inStream | ForEach-Object Dispose
            }
        }
    }
    finally {
        # Clean up the output streams
        $transform, $cryptoStream, $outStream | ForEach-Object Dispose
    }
}

现在您可以执行以下操作:

$inputPath = "C:\Users\my.user\myfile.pdf"

New-Base64File $inputPath -Destination "C:\Users\my.user\myfile.pdf.via_ps.base64"

并期望输出大小与 base64 相同

Why is there such a drastic difference?

Because you're doing something wildly different in PowerShell

How can I get Powershell to give the smaller output that the GNU tool gives?

By doing what base64 does :)


Let's have a look at what base64 ... > ... actually does:

  • base64:
    • Opens file handle to input file
    • Reads raw byte stream from disk
    • Converts every 3-byte pair to a 4-byte base64-encoded output string-fragment
  • >:
    • Writes raw byte stream to disk

Since the 4-byte output fragments only contain byte values that correspond to 64 printable ASCII characters, the command never actually does any "string manipulation" - the values on which it operates just happen to also be printable as ASCII strings and the resulting file is therefor indistinguishable from a "text file".

Your PowerShell script on the other hand does lots of string manipulation:

  • Get-Content $input:
    • Opens file handle to input file
    • Reads raw byte stream from disk
    • Decodes the byte stream according to some chosen encoding scheme (likely your OEM codepage)
  • [Encoding]::UTF8.GetBytes():
    • Re-encodes the resulting string using UTF8
  • [Convert]::ToBase64String()
    • Converts every 3-byte pair to a 4-byte base64-encoded output string-fragment
  • Out-File:
    • Encodes input string as little-endian UTF16
    • Writes to disk

The three additional string encoding steps highlighted above will result in a much-inflated byte stream, which is why you're seeing the output size double or triple.


How to base64-encode files then?

The trick here is to read the raw bytes from disk and pass those directly to [convert]::ToBase64String()

It is technically possibly to just read the entire file into an array at once:

$bytes = Get-Content path\to\file.ext -Encoding Byte # Windows PowerShell only
# or
$bytes = [System.IO.File]::ReadAllBytes($(Convert-Path path\to\file.ext))

$b64String = [convert]::ToBase64String($bytes)

Set-Content path\to\output.base64 -Value $b64String -Encoding Ascii

... I'd strongly recommend against doing so for files larger than a few kilobytes.

Instead, for file transformation in general you'll want to use streams. In this particular case, you'll want want to use a CryptoStream with a ToBase64Transform to re-encode a file stream as base64:

function New-Base64File {
    [CmdletBinding(DefaultParameterSetName = 'ByPath')]
    param(
        [Parameter(Mandatory = $true, ParameterSetName = 'ByPath', Position = 0)]
        [string]$Path,

        [Parameter(Mandatory = $true, ParameterSetName = 'ByPSPath')]
        [Alias('PSPath')]
        [string]$LiteralPath,

        [Parameter(Mandatory = $true, Position = 1)]
        [string]$Destination
    )

    # Create destination file if it doesn't exist
    if (-not(Test-Path -LiteralPath $Destination -PathType Leaf)) {
        $outFile = New-Item -Path $Destination -ItemType File
    }
    else {
        $outFile = Get-Item -LiteralPath $Destination
    }

    [void]$PSBoundParameters.Remove('Destination')

    try {
        # Open a writable file stream to the output file 
        $outStream = $outFile.OpenWrite()

        # Wrap output file stream in a CryptoStream.
        #
        # Anything that we write to the crypto stream is automatically 
        # base64-encoded and then written through to the output file stream 
        $transform = [System.Security.Cryptography.ToBase64Transform]::new()
        $cryptoStream = [System.Security.Cryptography.CryptoStream]::new($outStream, $transform, 'Write')

        foreach ($file in Get-Item @PSBoundParameters) {
            try {
                # Open readable input file stream
                $inStream = $file.OpenRead()

                # Copy input bytes to crypto stream
                # - which in turn base64-encodes and writes to output file
                $inStream.CopyTo($cryptoStream)
            }
            finally {
                # Clean up the input file stream
                $inStream | ForEach-Object Dispose
            }
        }
    }
    finally {
        # Clean up the output streams
        $transform, $cryptoStream, $outStream | ForEach-Object Dispose
    }
}

Now you can do:

$inputPath = "C:\Users\my.user\myfile.pdf"

New-Base64File $inputPath -Destination "C:\Users\my.user\myfile.pdf.via_ps.base64"

And expect an output the same size as with base64

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文