当前位置：文江博客话题详情

Java中BASE64类的编码/解码算法的效率如何？

发布于 2024-11-15 21:40:00 字数 554 浏览 7 评论 0原文

我将使用一种算法对从 XML 文件检索的可变长度但非常长的 String 字段进行编码，然后该编码数据应保留在数据库中。

稍后，当我收到第二个文件时，我需要从数据库（之前存储的）中获取编码数据，然后对其进行解码并使用新数据进行验证以进行重复。

我尝试了 org.apache.commons.codec.binary.Base64 类它有两种方法：

encodeBase64(Byte[] barray)
decodeBase64(String str)

工作得很好并解决了我的问题。但它将 55 个字符的字符串转换为 6 个字符的字符串。

所以我想知道是否存在这些算法将 2 个非常大且只有 1 个字符不匹配（例如）的字符串编码为相同编码的字节数组的情况。

我对 Base64 类了解不多，但如果有人可以帮助我，那将非常有帮助。

如果您可以建议任何其他算法，使大字符串短于固定长度并解决我的目的，我将很乐意使用它。

提前致谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤者何惧 2024-11-22 21:40:00

效率不太高。

此外，使用sun.misc类会产生不可移植的应用程序。

查看 MiGBase64 的以下性能比较：

在此处输入图像描述

所以我想知道是否有任何情况
这些算法编码 2 个字符串
它们非常大并且只有 1
字符不匹配（例如）相同
编码字节数组。

Base64 不是哈希算法，它是一种编码，因此必须是双向的。必然不允许发生冲突——否则解码将是不确定的。 Base64 旨在用 ASCII 字符串表示任意二进制数据。将 Unicode 字符串编码为 Base64 通常会增加所需的代码点数量，因为 Unicode 字符集需要多个字节。 Unicode 字符串的 Base64 表示形式将根据所使用的编码（UTF-8、UTF-16）而有所不同。例如：

Base64( UTF8( "test" ) ) => "dGVzdA=="
Base64( UTF16( "test" ) ) => "/v8AdABlAHMAdA=="

解决方案 1

使用无损压缩

GZip( UTF8( "test" ) )

这里您将字符串转换为字节数组，并使用无损压缩来减少必须存储的字节数。您可以改变字符编码和压缩算法以减少字节数，具体取决于您要存储的字符串（即，如果主要是 ASCII，则 UTF-8 可能是最好的。

优点：无冲突，恢复原始字符串的能力
缺点：存储值所需的字节是可变的；存储值所需的字节较大

解决方案 2

使用哈希算法

SHA256( UTF8( "test" ) )

这里，您使用哈希函数将字符串转换为固定长度的字节集。散列是单向的，就其本质而言可能发生冲突。但是，根据您希望处理的配置文件和字符串数量，您可以选择哈希函数来最大程度地减少冲突的可能性

优点：存储值所需的字节是固定的；存储值所需的字节很小
缺点：可能发生冲突，无法恢复原始字符串

Not very efficient.

Also, using sun.misc classes gives a non-portable application.

Check out the following performance comparisons from MiGBase64:

enter image description here

So I wonder if there is any case where
these algorithm encodes 2 Strings
which are very large and have only 1
char mismatch (for example) into same
encoded byte arrays.

Base64 isn't a hashing algorithm, it's an encoding and must therefore be bi-directional. Collisions can't be allowed by necessity - otherwise decoding would be non-deterministic. Base64 is designed to represent arbitrary binary data in an ASCII string. Encoding a Unicode string as Base64 will often increase the number of code points required since the Unicode character set requires multiple bytes. The Base64 representation of a Unicode string will vary depending on the encoding (UTF-8, UTF-16) used. For example:

Base64( UTF8( "test" ) ) => "dGVzdA=="
Base64( UTF16( "test" ) ) => "/v8AdABlAHMAdA=="

Solution 1

Use lossless compression

GZip( UTF8( "test" ) )

Here you are converting the string to byte array and using lossless compression to reduce the number of bytes you have to store. You can vary the char encoding and compression algorithm to reduce the number of bytes depending on the Strings you will be storing (ie if it's mostly ASCII then UTF-8 will probably be best.

Pros: no collisions, ability to recover original string
Cons: Bytes required to store value is variable; bytes required to store value is larger

Solution 2

Use a hashing algorithm

SHA256( UTF8( "test" ) )

Here you are converting the string to a fixed length set of bytes with a hashing function. Hashing is uni-directional and by its nature collisions can be possible. However, based on the profile and number of Strings that you expect to process you can select a hash function to minimise the likelihood of collisions

Pros: Bytes required to store value is fixed; bytes required to store value is small
Cons: Collisions possible, no ability to recover original string

回复收藏 0 原文