在 Java 中缩短已经很短的字符串

发布于 2024-12-04 02:06:43 字数 469 浏览 0 评论 0原文

我正在寻找一种方法来尽可能缩短已经很短的字符串。

该字符串是主机名:端口组合,可能类似于“my-domain.se:2121”或“123.211.80.4:2122”。

我知道由于所需的开销和缺乏重复,对于这么短的字符串来说常规压缩几乎是不可能的,但我知道如何做到这一点。

因为字母表的长度限制为 39 个字符 ([az][0-9]-:.),每个字符可以容纳 6 位。与 ASCII 相比,长度最多可减少 25%。所以我的建议是这样的:

  1. 使用某种自定义编码将字符串编码为字节数组
  2. 将字节数组解码为 UTF-8 或 ASCII 字符串(该字符串显然没有任何意义)。

然后逆向处理即可得到原始字符串。

所以我的问题是:

  1. 这可行吗?
  2. 有更好的办法吗?
  3. 如何?

I'm looking for a way to shorten an already short string as much as possible.

The string is a hostname:port combo and could look like "my-domain.se:2121" or "123.211.80.4:2122".

I know regular compression is pretty much out of the question on strings this short due to the overhead needed and the lack of repetition but I have an idea of how to do it.

Because the alphabet is limited to 39 characters ([a-z][0-9]-:.) every character could fit in 6 bits. This reduce the length with up to 25% compared to ASCII. So my suggestion is somthing along these lines:

  1. Encode the string to a byte array using some kind of custom encoding
  2. Decode the byte array to a UTF-8 or ASCII string (this string will obviously not make any sense).

And then reverse the process to get the original string.

So to my questions:

  1. Could this work?
  2. Is there a better way?
  3. How?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

虐人心 2024-12-11 02:06:43

您可以将字符串编码为基数 40,它比基数 64 更紧凑。这将为您提供 12 个这样的标记,长度为 64 位。第 40 个标记可能是字符串标记的结尾,用于给出长度(因为它不再是整数字节)

如果您使用算术编码,它可能会小得多,但您需要每个标记的频率表令牌。 (使用一长串可能的示例)

class Encoder {
  public static final int BASE = 40;
  StringBuilder chars = new StringBuilder(BASE);
  byte[] index = new byte[256];

  {
    chars.append('\0');
    for (char ch = 'a'; ch <= 'z'; ch++) chars.append(ch);
    for (char ch = '0'; ch <= '9'; ch++) chars.append(ch);
    chars.append("-:.");
    Arrays.fill(index, (byte) -1);
    for (byte i = 0; i < chars.length(); i++)
      index[chars.charAt(i)] = i;
  }

  public byte[] encode(String address) {
    try {
      ByteArrayOutputStream baos = new ByteArrayOutputStream();
      DataOutputStream dos = new DataOutputStream(baos);
      for (int i = 0; i < address.length(); i += 3) {
        switch (Math.min(3, address.length() - i)) {
          case 1: // last one.
            byte b = index[address.charAt(i)];
            dos.writeByte(b);
            break;

          case 2:
            char ch = (char) ((index[address.charAt(i+1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch);
            break;

          case 3:
            char ch2 = (char) ((index[address.charAt(i+2)] * 40 + index[address.charAt(i + 1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch2);
            break;
        }
      }
      return baos.toByteArray();
    } catch (IOException e) {
      throw new AssertionError(e);
    }
  }

  public static void main(String[] args) {
    Encoder encoder = new Encoder();
    for (String s : "twitter.com:2122,123.211.80.4:2122,my-domain.se:2121,www.stackoverflow.com:80".split(",")) {
      System.out.println(s + " (" + s.length() + " chars) encoded is " + encoder.encode(s).length + " bytes.");
    }
  }
}

打印

twitter.com:2122 (16 chars) encoded is 11 bytes.
123.211.80.4:2122 (17 chars) encoded is 12 bytes.
my-domain.se:2121 (17 chars) encoded is 12 bytes.
www.stackoverflow.com:80 (24 chars) encoded is 16 bytes.

我将解码作为练习。 ;)

You could encode the string as base 40 which is more compact than base 64. This will give you 12 such tokens into a 64 bit long. The 40th token could be the end of string marker to give you the length (as it will not be a whole number of bytes any more)

If you use arithmetic encoding, it could be much smaller but you would need a table of frequencies for each token. (using a long list of possible examples)

class Encoder {
  public static final int BASE = 40;
  StringBuilder chars = new StringBuilder(BASE);
  byte[] index = new byte[256];

  {
    chars.append('\0');
    for (char ch = 'a'; ch <= 'z'; ch++) chars.append(ch);
    for (char ch = '0'; ch <= '9'; ch++) chars.append(ch);
    chars.append("-:.");
    Arrays.fill(index, (byte) -1);
    for (byte i = 0; i < chars.length(); i++)
      index[chars.charAt(i)] = i;
  }

  public byte[] encode(String address) {
    try {
      ByteArrayOutputStream baos = new ByteArrayOutputStream();
      DataOutputStream dos = new DataOutputStream(baos);
      for (int i = 0; i < address.length(); i += 3) {
        switch (Math.min(3, address.length() - i)) {
          case 1: // last one.
            byte b = index[address.charAt(i)];
            dos.writeByte(b);
            break;

          case 2:
            char ch = (char) ((index[address.charAt(i+1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch);
            break;

          case 3:
            char ch2 = (char) ((index[address.charAt(i+2)] * 40 + index[address.charAt(i + 1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch2);
            break;
        }
      }
      return baos.toByteArray();
    } catch (IOException e) {
      throw new AssertionError(e);
    }
  }

  public static void main(String[] args) {
    Encoder encoder = new Encoder();
    for (String s : "twitter.com:2122,123.211.80.4:2122,my-domain.se:2121,www.stackoverflow.com:80".split(",")) {
      System.out.println(s + " (" + s.length() + " chars) encoded is " + encoder.encode(s).length + " bytes.");
    }
  }
}

prints

twitter.com:2122 (16 chars) encoded is 11 bytes.
123.211.80.4:2122 (17 chars) encoded is 12 bytes.
my-domain.se:2121 (17 chars) encoded is 12 bytes.
www.stackoverflow.com:80 (24 chars) encoded is 16 bytes.

I leave decoding as an exercise. ;)

拿命拼未来 2024-12-11 02:06:43

首先,IP 地址被设计为 4 个字节,端口号为 2 个字节。ascii 表示仅供人类阅读,因此对其进行压缩是没有意义的。

您压缩域名字符串的想法是可行的。

First of all, IP addresses are designed to fit into 4 bytes and port numbers into 2. The ascii representation is only for humans to read, so it doesn't make sense to do compression on that.

Your idea for compressing domain name strings is doable.

心凉 2024-12-11 02:06:43

对于您的情况,我会为您的用例使用专门的算法。认识到您可以存储字符串以外的其他内容。因此,对于 IPv4 地址:端口,您将有一个捕获 6 个字节的类 - 4 个字节用于地址,2 个字节用于端口。另一个用于字母数字主机名的类型。端口始终以两个字节存储。例如,主机名部分本身也可以专门支持 .com。因此,示例层次结构可能是:

    HostPort
       |
  +----+--------+
  |             |
IPv4        HostnamePort
                |
           DotComHostnamePort


public interface HostPort extends CharSequence { }


public HostPorts {
  public static HostPort parse(String hostPort) {
    ...
  }
}

在这种情况下,DotComHostnamePort 允许您从主机名中删除 .com 并保存 4 个字符/字节,具体取决于您是以 puny 形式还是以 UTF16 形式存储主机名。

Well in your case, I would use a specialized algo for your usecase. Recognize that you can store something other than strings. So for a IPv4 address : port, you would have a class that captured 6 bytes -- 4 for the address and 2 for the port. Another for type for apha-numeric hostnames. The port would always be stored in two bytes. The hostname part itself could also have specialized support for .com, for example. So a sample hierarchy may be:

    HostPort
       |
  +----+--------+
  |             |
IPv4        HostnamePort
                |
           DotComHostnamePort


public interface HostPort extends CharSequence { }


public HostPorts {
  public static HostPort parse(String hostPort) {
    ...
  }
}

In this case, the DotComHostnamePort allows you to drop .com from the host name and save 4 chars/bytes, depending on whether you store hostnames in punyform or in UTF16 form.

铃予 2024-12-11 02:06:43

前两个字节可以包含端口号。如果始终以此固定长度端口号开头,则无需包含分隔符 :。而是使用一个位来指示后面是否有 IP 地址(请参阅 Karl Bielefeldt 的 解决方案)或主机名。

The first two bytes could contain the port number. If you always start with this fixed length port number, you don't need to include the separator :. Instead use a bit that indicates whether an IP address follows (see Karl Bielefeldt's solution) or a host name.

人生百味 2024-12-11 02:06:43

您可以使用CDC 显示代码对它们进行编码。这种编码在过去比特稀缺且程序员紧张的时候就被使用过。

You could encode them using the CDC Display code. This encoding was used back in the old days when bits were scarce and programmers were nervous.

伪装你 2024-12-11 02:06:43

您所建议的内容与 Base 64 编码/解码类似,并且在查看其中一些实现时可能会有所帮助(Base 64 编码使用 6 位)。

作为初学者,如果您使用 Apache 的 base 64 库,

String x = new String(Base64.decodeBase64("my-domain.se:2121".getBytes()));
String y = new String(Base64.encodeBase64(x.getBytes()));
System.out.println("x = " + x);
System.out.println("y = " + y);

它会将您的字符串缩短几个字符。这显然是行不通的,因为你最终得到的并不是你开始的。

What you are suggesting is similar to base 64 encoding/decoding and there might be some mileage in looking at some of those implementations (base 64 encoding uses 6 bits).

As a starter if you use Apaches base 64 library

String x = new String(Base64.decodeBase64("my-domain.se:2121".getBytes()));
String y = new String(Base64.encodeBase64(x.getBytes()));
System.out.println("x = " + x);
System.out.println("y = " + y);

It will shorten your string by a few chars. This obviously does not work as what you end up with is not what you started with.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文