Java中BASE64类的编码/解码算法的效率如何?

发布于 2024-11-15 21:40:00 字数 554 浏览 2 评论 0原文

我将使用一种算法对从 XML 文件检索的可变长度但非常长的 String 字段进行编码,然后该编码数据应保留在数据库中。

稍后,当我收到第二个文件时,我需要从数据库(之前存储的)中获取编码数据,然后对其进行解码并使用新数据进行验证以进行重复。

我尝试了 org.apache.commons.codec.binary.Base64 类 它有两种方法:

  1. encodeBase64(Byte[] barray)
  2. decodeBase64(String str)

工作得很好并解决了我的问题。 但它将 55 个字符的字符串转换为 6 个字符的字符串。

所以我想知道是否存在这些算法将 2 个非常大且只有 1 个字符不匹配(例如)的字符串编码为相同编码的字节数组的情况。

我对 Base64 类了解不多,但如果有人可以帮助我,那将非常有帮助。

如果您可以建议任何其他算法,使大字符串短于固定长度并解决我的目的,我将很乐意使用它。

提前致谢。

I am about to use an algorithm to encode a variable length but very long String field retrieved from an XML file, then that encoded data should be persisted in the database.

Later, when I recieve a second file I need to fetch the encoded data from database (previously stored) and then decode it and validate with the new data for duplicate.

I tried org.apache.commons.codec.binary.Base64 class
it has 2 methods:

  1. encodeBase64(Byte[] barray)
  2. decodeBase64(String str)

which works perfectly fine and solves my problem.
But it converts 55 char string to just 6 char String.

So I wonder if there is any case where these algorithm encodes 2 Strings which are very large and have only 1 char mismatch (for example) into same encoded byte arrays.

I donot know about the Base64 class much but if anyone can help me out it will be really helpful.

If you can suggest any other Algorithm which makes a large String short of fixed length and solves my purpose I will be happy to use it.

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

孤者何惧 2024-11-22 21:40:00

效率不太高。

此外,使用sun.misc类会产生不可移植的应用程序。

查看 MiGBase64 的以下性能比较:

在此处输入图像描述


所以我想知道是否有任何情况
这些算法编码 2 个字符串
它们非常大并且只有 1
字符不匹配(例如)相同
编码字节数组。

Base64 不是哈希算法,它是一种编码,因此必须是双向的。必然不允许发生冲突——否则解码将是不确定的。 Base64 旨在用 ASCII 字符串表示任意二进制数据。将 Unicode 字符串编码为 Base64 通常会增加所需的代码点数量,因为 Unicode 字符集需要多个字节。 Unicode 字符串的 Base64 表示形式将根据所使用的编码(UTF-8、UTF-16)而有所不同。例如:

Base64( UTF8( "test" ) ) => "dGVzdA=="
Base64( UTF16( "test" ) ) => "/v8AdABlAHMAdA=="

解决方案 1

使用无损压缩

GZip( UTF8( "test" ) )

这里您将字符串转换为字节数组,并使用无损压缩来减少必须存储的字节数。您可以改变字符编码和压缩算法以减少字节数,具体取决于您要存储的字符串(即,如果主要是 ASCII,则 UTF-8 可能是最好的。

优点:无冲突,恢复原始字符串的能力
缺点:存储值所需的字节是可变的;存储值所需的字节较大

解决方案 2

使用哈希算法

SHA256( UTF8( "test" ) )

这里,您使用哈希函数将字符串转换为固定长度的字节集。散列是单向的,就其本质而言可能发生冲突。但是,根据您希望处理的配置文件和字符串数量,您可以选择哈希函数来最大程度地减少冲突的可能性

优点:存储值所需的字节是固定的;存储值所需的字节很小
缺点:可能发生冲突,无法恢复原始字符串

Not very efficient.

Also, using sun.misc classes gives a non-portable application.

Check out the following performance comparisons from MiGBase64:

enter image description here


So I wonder if there is any case where
these algorithm encodes 2 Strings
which are very large and have only 1
char mismatch (for example) into same
encoded byte arrays.

Base64 isn't a hashing algorithm, it's an encoding and must therefore be bi-directional. Collisions can't be allowed by necessity - otherwise decoding would be non-deterministic. Base64 is designed to represent arbitrary binary data in an ASCII string. Encoding a Unicode string as Base64 will often increase the number of code points required since the Unicode character set requires multiple bytes. The Base64 representation of a Unicode string will vary depending on the encoding (UTF-8, UTF-16) used. For example:

Base64( UTF8( "test" ) ) => "dGVzdA=="
Base64( UTF16( "test" ) ) => "/v8AdABlAHMAdA=="

Solution 1

Use lossless compression

GZip( UTF8( "test" ) )

Here you are converting the string to byte array and using lossless compression to reduce the number of bytes you have to store. You can vary the char encoding and compression algorithm to reduce the number of bytes depending on the Strings you will be storing (ie if it's mostly ASCII then UTF-8 will probably be best.

Pros: no collisions, ability to recover original string
Cons: Bytes required to store value is variable; bytes required to store value is larger

Solution 2

Use a hashing algorithm

SHA256( UTF8( "test" ) )

Here you are converting the string to a fixed length set of bytes with a hashing function. Hashing is uni-directional and by its nature collisions can be possible. However, based on the profile and number of Strings that you expect to process you can select a hash function to minimise the likelihood of collisions

Pros: Bytes required to store value is fixed; bytes required to store value is small
Cons: Collisions possible, no ability to recover original string

a√萤火虫的光℡ 2024-11-22 21:40:00

我刚刚看到你的评论 - 看来你实际上是在寻找压缩而不是散列,就像我最初想象的那样。尽管在这种情况下,您将无法获得任意输入的固定长度输出(想一想,无限数量的输入无法双射映射到有限数量的输出),所以我希望这不是一个强烈的要求。

无论如何,您选择的压缩算法的性能将取决于输入文本的特征。在缺乏进一步信息的情况下,DEFLATE 压缩(由 Zip 输入流 IIRC 使用)是一种很好的通用算法,可以作为开始使用,并且至少可以用作比较的基础。不过,为了便于实施,您可以使用 Deflator 类,它使用 ZLib 压缩。

如果您的输入字符串具有特定模式,那么不同的压缩算法可能会或多或少有效。一方面,如果您不希望任何其他进程读取压缩数据,那么使用哪一个并不重要 - 只要您可以自己压缩和解压缩,它对您的客户端来说就是透明的。

您可能会感兴趣这些其他问题:

I just saw your comment - it seems you're actually looking for compression rather than hashing as I initially thought. Though in that case, you won't be able to get fixed length output for arbitrary input (think about it, an infinite number of inputs cannot map bijectively to a finite number of outputs), so I hope that wasn't a strong requirement.

In any case, the performance of your chosen compression algorithm will depend on the characteristics of the input text. In the absence of further information, DEFLATE compression (as used by the Zip input streams, IIRC) is a good general-purpose algorithm to start with, and at least use as a basis for comparison. For ease of implementation, though, you can use the Deflator class built into the JDK, which uses ZLib compression.

If your input strings have particular patterns, then different compression algorithms may be more or less efficient. In one respect it doesn't matter which one you use, if you don't intend the compressed data to be read by any other processes - so long as you can compress and decompress yourself, it'll be transparent to your clients.

These other questions may be of interest:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文