当前位置：文江博客话题详情

在 Java 中缩短已经很短的字符串

发布于 2024-12-04 02:06:43 字数 469 浏览 0 评论 0原文

我正在寻找一种方法来尽可能缩短已经很短的字符串。

该字符串是主机名:端口组合，可能类似于“my-domain.se:2121”或“123.211.80.4:2122”。

我知道由于所需的开销和缺乏重复，对于这么短的字符串来说常规压缩几乎是不可能的，但我知道如何做到这一点。

因为字母表的长度限制为 39 个字符 ([az][0-9]-:.)，每个字符可以容纳 6 位。与 ASCII 相比，长度最多可减少 25%。所以我的建议是这样的：

使用某种自定义编码将字符串编码为字节数组
将字节数组解码为 UTF-8 或 ASCII 字符串（该字符串显然没有任何意义）。

然后逆向处理即可得到原始字符串。

所以我的问题是：

这可行吗？
有更好的办法吗？
如何？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

虐人心 2024-12-11 02:06:43

您可以将字符串编码为基数 40，它比基数 64 更紧凑。这将为您提供 12 个这样的标记，长度为 64 位。第 40 个标记可能是字符串标记的结尾，用于给出长度（因为它不再是整数字节）

如果您使用算术编码，它可能会小得多，但您需要每个标记的频率表令牌。（使用一长串可能的示例）

class Encoder {
  public static final int BASE = 40;
  StringBuilder chars = new StringBuilder(BASE);
  byte[] index = new byte[256];

  {
    chars.append('\0');
    for (char ch = 'a'; ch <= 'z'; ch++) chars.append(ch);
    for (char ch = '0'; ch <= '9'; ch++) chars.append(ch);
    chars.append("-:.");
    Arrays.fill(index, (byte) -1);
    for (byte i = 0; i < chars.length(); i++)
      index[chars.charAt(i)] = i;
  }

  public byte[] encode(String address) {
    try {
      ByteArrayOutputStream baos = new ByteArrayOutputStream();
      DataOutputStream dos = new DataOutputStream(baos);
      for (int i = 0; i < address.length(); i += 3) {
        switch (Math.min(3, address.length() - i)) {
          case 1: // last one.
            byte b = index[address.charAt(i)];
            dos.writeByte(b);
            break;

          case 2:
            char ch = (char) ((index[address.charAt(i+1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch);
            break;

          case 3:
            char ch2 = (char) ((index[address.charAt(i+2)] * 40 + index[address.charAt(i + 1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch2);
            break;
        }
      }
      return baos.toByteArray();
    } catch (IOException e) {
      throw new AssertionError(e);
    }
  }

  public static void main(String[] args) {
    Encoder encoder = new Encoder();
    for (String s : "twitter.com:2122,123.211.80.4:2122,my-domain.se:2121,www.stackoverflow.com:80".split(",")) {
      System.out.println(s + " (" + s.length() + " chars) encoded is " + encoder.encode(s).length + " bytes.");
    }
  }
}

打印

twitter.com:2122 (16 chars) encoded is 11 bytes.
123.211.80.4:2122 (17 chars) encoded is 12 bytes.
my-domain.se:2121 (17 chars) encoded is 12 bytes.
www.stackoverflow.com:80 (24 chars) encoded is 16 bytes.

我将解码作为练习。 ;)

You could encode the string as base 40 which is more compact than base 64. This will give you 12 such tokens into a 64 bit long. The 40th token could be the end of string marker to give you the length (as it will not be a whole number of bytes any more)

If you use arithmetic encoding, it could be much smaller but you would need a table of frequencies for each token. (using a long list of possible examples)

class Encoder {
  public static final int BASE = 40;
  StringBuilder chars = new StringBuilder(BASE);
  byte[] index = new byte[256];

  {
    chars.append('\0');
    for (char ch = 'a'; ch <= 'z'; ch++) chars.append(ch);
    for (char ch = '0'; ch <= '9'; ch++) chars.append(ch);
    chars.append("-:.");
    Arrays.fill(index, (byte) -1);
    for (byte i = 0; i < chars.length(); i++)
      index[chars.charAt(i)] = i;
  }

  public byte[] encode(String address) {
    try {
      ByteArrayOutputStream baos = new ByteArrayOutputStream();
      DataOutputStream dos = new DataOutputStream(baos);
      for (int i = 0; i < address.length(); i += 3) {
        switch (Math.min(3, address.length() - i)) {
          case 1: // last one.
            byte b = index[address.charAt(i)];
            dos.writeByte(b);
            break;

          case 2:
            char ch = (char) ((index[address.charAt(i+1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch);
            break;

          case 3:
            char ch2 = (char) ((index[address.charAt(i+2)] * 40 + index[address.charAt(i + 1)]) * 40 + index[address.charAt(i)]);
            dos.writeChar(ch2);
            break;
        }
      }
      return baos.toByteArray();
    } catch (IOException e) {
      throw new AssertionError(e);
    }
  }

  public static void main(String[] args) {
    Encoder encoder = new Encoder();
    for (String s : "twitter.com:2122,123.211.80.4:2122,my-domain.se:2121,www.stackoverflow.com:80".split(",")) {
      System.out.println(s + " (" + s.length() + " chars) encoded is " + encoder.encode(s).length + " bytes.");
    }
  }
}

prints

twitter.com:2122 (16 chars) encoded is 11 bytes.
123.211.80.4:2122 (17 chars) encoded is 12 bytes.
my-domain.se:2121 (17 chars) encoded is 12 bytes.
www.stackoverflow.com:80 (24 chars) encoded is 16 bytes.

I leave decoding as an exercise. ;)

回复收藏 0 原文

拿命拼未来 2024-12-11 02:06:43

首先，IP 地址被设计为 4 个字节，端口号为 2 个字节。ascii 表示仅供人类阅读，因此对其进行压缩是没有意义的。

您压缩域名字符串的想法是可行的。

回复收藏 0 原文

心凉 2024-12-11 02:06:43

对于您的情况，我会为您的用例使用专门的算法。认识到您可以存储字符串以外的其他内容。因此，对于 IPv4 地址：端口，您将有一个捕获 6 个字节的类 - 4 个字节用于地址，2 个字节用于端口。另一个用于字母数字主机名的类型。端口始终以两个字节存储。例如，主机名部分本身也可以专门支持 .com。因此，示例层次结构可能是：

    HostPort
       |
  +----+--------+
  |             |
IPv4        HostnamePort
                |
           DotComHostnamePort


public interface HostPort extends CharSequence { }


public HostPorts {
  public static HostPort parse(String hostPort) {
    ...
  }
}

在这种情况下，DotComHostnamePort 允许您从主机名中删除 .com 并保存 4 个字符/字节，具体取决于您是以 puny 形式还是以 UTF16 形式存储主机名。

Well in your case, I would use a specialized algo for your usecase. Recognize that you can store something other than strings. So for a IPv4 address : port, you would have a class that captured 6 bytes -- 4 for the address and 2 for the port. Another for type for apha-numeric hostnames. The port would always be stored in two bytes. The hostname part itself could also have specialized support for .com, for example. So a sample hierarchy may be:

    HostPort
       |
  +----+--------+
  |             |
IPv4        HostnamePort
                |
           DotComHostnamePort


public interface HostPort extends CharSequence { }


public HostPorts {
  public static HostPort parse(String hostPort) {
    ...
  }
}

In this case, the DotComHostnamePort allows you to drop .com from the host name and save 4 chars/bytes, depending on whether you store hostnames in punyform or in UTF16 form.

回复收藏 0 原文

铃予 2024-12-11 02:06:43

前两个字节可以包含端口号。如果始终以此固定长度端口号开头，则无需包含分隔符 :。而是使用一个位来指示后面是否有 IP 地址（请参阅 Karl Bielefeldt 的解决方案）或主机名。

回复收藏 0 原文

人生百味 2024-12-11 02:06:43

您可以使用CDC 显示代码对它们进行编码。这种编码在过去比特稀缺且程序员紧张的时候就被使用过。

回复收藏 0 原文

伪装你 2024-12-11 02:06:43

您所建议的内容与 Base 64 编码/解码类似，并且在查看其中一些实现时可能会有所帮助（Base 64 编码使用 6 位）。

作为初学者，如果您使用 Apache 的 base 64 库，

String x = new String(Base64.decodeBase64("my-domain.se:2121".getBytes()));
String y = new String(Base64.encodeBase64(x.getBytes()));
System.out.println("x = " + x);
System.out.println("y = " + y);

它会将您的字符串缩短几个字符。这显然是行不通的，因为你最终得到的并不是你开始的。

What you are suggesting is similar to base 64 encoding/decoding and there might be some mileage in looking at some of those implementations (base 64 encoding uses 6 bits).

As a starter if you use Apaches base 64 library

String x = new String(Base64.decodeBase64("my-domain.se:2121".getBytes()));
String y = new String(Base64.encodeBase64(x.getBytes()));
System.out.println("x = " + x);
System.out.println("y = " + y);

It will shorten your string by a few chars. This obviously does not work as what you end up with is not what you started with.

回复收藏 0 原文

~没有更多了~