将字节数据编码为数字

发布于 2024-09-04 08:18:25 字数 346 浏览 5 评论 0原文

是否有一种通用方法来编码和解码任意数据，以便编码的最终结果仅由数字组成 - 就像 base64_encode 但不包含字母？

虚构的例子：

$encoded = numbers_encode("Mary had a little lamb");

echo $encoded; // outputs e.g. 12238433742239423742322 (fictitious result)

$decoded = numbers_decode("12238433742239423742322");

echo $decoded; // outputs "Mary had a little lamb"

原文

Is there a common method to encode and decode arbitrary data so the encoded end result consists of numbers only - like base64_encode but without the letters?

Fictitious example:

$encoded = numbers_encode("Mary had a little lamb");

echo $encoded; // outputs e.g. 12238433742239423742322 (fictitious result)

$decoded = numbers_decode("12238433742239423742322");

echo $decoded; // outputs "Mary had a little lamb"

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

遗失的美好 2024-09-11 08:18:25

您可以将（单字节字符）字符串视为基数 256 编码的数字，其中“\x00”代表 0，' '（空格，即“\x20”）代表 32 等等，直到“\xFF”，其中代表 255。

仅使用数字 0-9 的表示只需将表示更改为基数 10 即可完成。

请注意，“base64 编码”实际上并不是基本转换。 base64 将输入分成 3 个字节（24 位）的组，并分别对这些组进行基数转换。这种方法效果很好，因为 24 位数字可以用 64 进制的四位数字表示 (2^24 = 64^4)。

这或多或少是el.pescado所做的——他将输入数据分成8位块，然后进行转换将数字转换为基数 10。但是，相对于基数 64 编码，该技术有一个缺点 - 它无法与字节边界正确对齐。为了用 8 位（无符号时为 0-255）表示一个数字，我们需要 3 个以 10 为基数的数字。但是，最左边的数字比其他数字包含的信息要少。它可以是 0、1 或 2（对于无符号数）。

以 10 为基数的数字存储 log(10)/log(2) 位。无论您选择多大的块大小，您都永远无法将表示与 8 位字节对齐（在我之前的段落中描述的“对齐”的意义上）。因此，最紧凑的表示是基本转换（您可以看到它好像是只有一大块的“基本编码”）。

以下是 bcmath 的示例。

bcscale(0);
function base256ToBase10(string $string) {
    //argument is little-endian
    $result = "0";
    for ($i = strlen($string)-1; $i >= 0; $i--) {
        $result = bcadd($result,
            bcmul(ord($string[$i]), bcpow(256, $i)));
    }
    return $result;
}
function base10ToBase256(string $number) {
    $result = "";
    $n = $number;
    do {
        $remainder = bcmod($n, 256);
        $n = bcdiv($n, 256);
        $result .= chr($remainder);
    } while ($n > 0);

    return $result;
}

因为

$string = "Mary had a little lamb";
$base10 = base256ToBase10($string);
echo $base10,"\n";
$base256 = base10ToBase256($base10);
echo $base256;

我们得到

36826012939234118013885831603834892771924668323094861
Mary had a little lamb

由于每个数字仅编码 log(10)/log(2)=~3.32193 位，因此预计数字往往是 140% 长（不是 200% 长，就像 el.pescado 那样回答）。

You can think of a (single byte character) string as a base-256 encoded number where "\x00" represents 0, ' ' (space, i.e., "\x20") represents 32 and so on until "\xFF", which represents 255.

A representation only with numbers 0-9 can be accomplished simply by changing the representation to base 10.

Note that "base64 encoding" is not actually a base conversion. base64 breaks the input into groups of 3 bytes (24 bits) and does the base conversion on those groups individually. This works well because a number with 24 bits can be represented with four digits in base 64 (2^24 = 64^4).

This is more or less what el.pescado does – he splits the input data into 8-bit pieces and then converts the number into base 10. However, this technique has one disadvantage relatively to base 64 encoding – it does not align correctly with the byte boundary. To represent a number with 8-bits (0-255 when unsigned) we need three digits in base 10. However, the left-most digit has less information than the others. It can either be 0, 1 or 2 (for unsigned numbers).

A digit in base 10 stores log(10)/log(2) bits. No matter the chunk size you choose, you're never going to be able to align the representations with 8-bit bytes (in the sense of "aligning" I've described in the paragraph before). Consequently, the most compact representation is a base conversion (which you can see as if it were a "base encoding" with only one big chunk).

Here is an example with bcmath.

bcscale(0);
function base256ToBase10(string $string) {
    //argument is little-endian
    $result = "0";
    for ($i = strlen($string)-1; $i >= 0; $i--) {
        $result = bcadd($result,
            bcmul(ord($string[$i]), bcpow(256, $i)));
    }
    return $result;
}
function base10ToBase256(string $number) {
    $result = "";
    $n = $number;
    do {
        $remainder = bcmod($n, 256);
        $n = bcdiv($n, 256);
        $result .= chr($remainder);
    } while ($n > 0);

    return $result;
}

For

$string = "Mary had a little lamb";
$base10 = base256ToBase10($string);
echo $base10,"\n";
$base256 = base10ToBase256($base10);
echo $base256;

we get

36826012939234118013885831603834892771924668323094861
Mary had a little lamb

Since each digit encodes only log(10)/log(2)=~3.32193 bits expect the number to tend to be 140% longer (not 200% longer, as would be with el.pescado's answer).

回复收藏 0 原文

那请放手 2024-09-11 08:18:25

好吧，这将是“base 8”编码而不是 Base 64。这更好地称为八进制。

Base64 所做的就是将比特流转换为 6 位块 (0-63)，并从 64 个字符集中分配一个字符。八进制使用 3 位，0-7。因此它可以使用 ABCDEFGH，但改为使用 0-7。你不能（轻易）使用0-9，因为0-9最多为4位，但不完全是4位。这就是它对于二进制数据的糟糕编码的原因。

回复收藏 0 原文

聊慰 2024-09-11 08:18:25

不管你如何编码，你总是会回到一个较小的基数。可以通过一些 dechex() 转换将结果整数缩小一点，但最终只会保存几个字符。话虽如此，当您开始用 0-9 表示多字节字符时，这个数字确实会激增。

我想知道作为 ID 的整数、代表单词或完整的字符串是否不会提供更小的占用空间。并不是真正的直接编码，而是一个可行的选择。

@el.pescado 在前半部分得到了赞扬，但他确实向读者提出了挑战。所以，我回应了（主要是因为我想了解发生了什么）。

function pekka_encode($s) {
    $out = '';
    for ($i=0;$i<strlen($s); $i++) {
        $out .= sprintf("%03d", ord($s[$i]));     
    }
    return $out;
}

function pekka_decode($s) {
    $out = '';
    for ($i=0;$i<strlen($s);$i+=3) {
        $out .= chr($s[$i].$s[$i+1].$s[$i+2]);
    }
    return $out;
}

Regardless of how you encode you'll always end back up at a smaller base. It may be possible to shrink the resultant integer a bit smaller with some dechex() conversions but ultimately you'll only save a few characters. That being said, the number really balloons the moment you start representing multi-byte characters with 0-9.

I have to wonder if integers as IDs, representing words, or complete strings, wouldn't provide a smaller footprint. Not really a direct encoding but a viable option.

@el.pescado gets credit for the first half but he did challenge the reader. So, I responded (mainly because I wanted to understand what's happening).

function pekka_encode($s) {
    $out = '';
    for ($i=0;$i<strlen($s); $i++) {
        $out .= sprintf("%03d", ord($s[$i]));     
    }
    return $out;
}

function pekka_decode($s) {
    $out = '';
    for ($i=0;$i<strlen($s);$i+=3) {
        $out .= chr($s[$i].$s[$i+1].$s[$i+2]);
    }
    return $out;
}

回复收藏 0 原文

久伴你 2024-09-11 08:18:25

非常简单的例子 - 它将每个输入字节表示为 3 位十进制数：

function data2numbers ($data) {
    $out = "";
    for ($i = 0; $i < strlen ($data); $i++) {
        $out .= sprintf ("%03d", ord ($data[$i]));
    }
    return $out;
}

缺点是它使任何输入数据的大小增加了三倍（每个输入字节表示为三个输出字节）。

解码函数留给读者作为练习；）

Very simple example - it represents every input byte as 3-digit decimal number:

function data2numbers ($data) {
    $out = "";
    for ($i = 0; $i < strlen ($data); $i++) {
        $out .= sprintf ("%03d", ord ($data[$i]));
    }
    return $out;
}

Downside is that it triples size of any input data (every input byte is represented as three output bytes).

Decoding function is left as an exercise to the reader;)

回复收藏 0 原文

~没有更多了~