当前位置：文江博客话题详情

Java 中 String 享元实现的最佳替代方案

发布于 2024-09-02 17:06:18 字数 335 浏览 11 评论 0原文

我的应用程序是多线程的，具有密集的字符串处理。我们遇到内存消耗过多的情况，分析表明这是由字符串数据引起的。我认为使用某种享元模式实现甚至缓存会极大地提高内存消耗（我确信字符串经常是重复的，尽管我在这方面没有任何硬数据）。

我研究过 Java Constant Pool 和 String.intern，但似乎它会引发一些 PermGen 问题。

在 java 中实现应用程序范围的多线程字符串池的最佳替代方案是什么？

编辑：另请参阅我之前的相关问题：Java 如何在幕后实现字符串的享元模式？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如何视而不见 2024-09-09 17:06:18

注意：此答案使用的示例可能与现代运行时 JVM 库不相关。特别是，substring 示例在 OpenJDK/Oracle 7+ 中不再是问题。

我知道这与人们经常告诉您的相反，但有时会显式创建新的 String< /code> 实例可以是减少内存的重要方法。

由于字符串是不可变的，因此有几种方法利用这一事实并共享支持字符数组以节省内存。但是，有时这实际上可以通过防止对这些数组的未使用部分进行垃圾收集来增加内存。

例如，假设您正在解析日志文件的消息 ID 以提取警告 ID。您的代码看起来像这样：

//Format:
//ID: [WARNING|ERROR|DEBUG] Message...
String testLine = "5AB729: WARNING Some really really really long message";

Matcher matcher = Pattern.compile("([A-Z0-9]*): WARNING.*").matcher(testLine);
if ( matcher.matches() ) {
    String id = matcher.group(1);
        //...do something with id...
}

但是看看实际存储的数据：

    //...
    String id = matcher.group(1);
    Field valueField = String.class.getDeclaredField("value");
    valueField.setAccessible(true);

    char[] data = ((char[])valueField.get(id));
    System.out.println("Actual data stored for string \"" + id + "\": " + Arrays.toString(data) );

这是整个测试行，因为匹配器只是将一个新的 String 实例包装在相同的字符数据周围。比较将 String id = matcher.group(1); 替换为 String id = new String(matcher.group(1)); 时的结果。

Note: This answer uses examples that might not be relevant in modern runtime JVM libraries. In particular, the substring example is no longer an issue in OpenJDK/Oracle 7+.

I know it goes against what people often tell you, but sometimes explicitly creating new String instances can be a significant way to reduce your memory.

Because Strings are immutable, several methods leverage that fact and share the backing character array to save memory. However, occasionally this can actually increase the memory by preventing garbage collection of unused parts of those arrays.

For example, assume you were parsing the message IDs of a log file to extract warning IDs. Your code would look something like this:

//Format:
//ID: [WARNING|ERROR|DEBUG] Message...
String testLine = "5AB729: WARNING Some really really really long message";

Matcher matcher = Pattern.compile("([A-Z0-9]*): WARNING.*").matcher(testLine);
if ( matcher.matches() ) {
    String id = matcher.group(1);
        //...do something with id...
}

But look at the data actually being stored:

    //...
    String id = matcher.group(1);
    Field valueField = String.class.getDeclaredField("value");
    valueField.setAccessible(true);

    char[] data = ((char[])valueField.get(id));
    System.out.println("Actual data stored for string \"" + id + "\": " + Arrays.toString(data) );

It's the whole test line, because the matcher just wraps a new String instance around the same character data. Compare the results when you replace String id = matcher.group(1); with String id = new String(matcher.group(1));.

回复收藏 0 原文

鸢与 2024-09-09 17:06:18

这已经在 JVM 级别完成了。您只需要确保您不会每次都显式或隐式地创建new String。

即不要这样做：

String s1 = new String("foo");
String s2 = new String("foo");

这会在堆中创建两个实例。相反，这样做：

String s1 = "foo";
String s2 = "foo";

这将在堆中创建一个实例，并且两个实例都将引用相同的实例（作为证据，s1 == s2 将在此处返回 true）。

也不要使用 += 来连接字符串（在循环中）：

String s = "";
for (/* some loop condition */) {
    s += "new";
}

+= 每次都会在堆中隐式创建一个 new String 。如果可以的话

StringBuilder sb = new StringBuilder();
for (/* some loop condition */) {
    sb.append("new");
}
String s = sb.toString();

，请使用 StringBuilder 或其同步兄弟 StringBuffer 而不是 String 来进行“密集字符串处理”。它为这些目的提供了有用的方法，例如 append()、insert()、delete() 等。另请参阅其 javadoc。

This is already done at the JVM level. You only need to ensure that you aren't creating new Strings everytime, either explicitly or implicitly.

I.e. don't do:

String s1 = new String("foo");
String s2 = new String("foo");

This would create two instances in the heap. Rather do so:

String s1 = "foo";
String s2 = "foo";

This will create one instance in the heap and both will refer the same (as evidence, s1 == s2 will return true here).

Also don't use += to concatenate strings (in a loop):

String s = "";
for (/* some loop condition */) {
    s += "new";
}

The += implicitly creates a new String in the heap everytime. Rather do so

StringBuilder sb = new StringBuilder();
for (/* some loop condition */) {
    sb.append("new");
}
String s = sb.toString();

If you can, rather use StringBuilder or its synchronized brother StringBuffer instead of String for "intensive String processing". It offers useful methods for exactly those purposes, such as append(), insert(), delete(), etc. Also see its javadoc.

回复收藏 0 原文

望她远 2024-09-09 17:06:18

Java 7/8

如果您正在按照接受的答案所说的去做，并使用 Java 7 或更高版本，那么您就没有按照它所说的去做。

subString() 的实现已更改。

切勿编写依赖于可能发生巨大变化的实现的代码，如果您依赖旧的行为，则可能会使事情变得更糟。

1950    public String substring(int beginIndex, int endIndex) {
1951        if (beginIndex < 0) {
1952            throw new StringIndexOutOfBoundsException(beginIndex);
1953        }
1954        if (endIndex > count) {
1955            throw new StringIndexOutOfBoundsException(endIndex);
1956        }
1957        if (beginIndex > endIndex) {
1958            throw new StringIndexOutOfBoundsException(endIndex - beginIndex);
1959        }
1960        return ((beginIndex == 0) && (endIndex == count)) ? this :
1961            new String(offset + beginIndex, endIndex - beginIndex, value);
1962    }

因此，如果您在 Java 7 或更高版本中使用已接受的答案，您将创建两倍的内存使用量和需要收集的垃圾。

Java 7/8

If you are doing what the accepted answer says and using Java 7 or newer you are not doing what it says you are.

The implementation of subString() has changed.

Never write code that relies on an implementation that can change drastically and might make things worse if you are relying on the old behavior.

1950    public String substring(int beginIndex, int endIndex) {
1951        if (beginIndex < 0) {
1952            throw new StringIndexOutOfBoundsException(beginIndex);
1953        }
1954        if (endIndex > count) {
1955            throw new StringIndexOutOfBoundsException(endIndex);
1956        }
1957        if (beginIndex > endIndex) {
1958            throw new StringIndexOutOfBoundsException(endIndex - beginIndex);
1959        }
1960        return ((beginIndex == 0) && (endIndex == count)) ? this :
1961            new String(offset + beginIndex, endIndex - beginIndex, value);
1962    }

So if you use the accepted answer with Java 7 or newer you are creating twice as much memory usage and garbage that needs to be collected.

回复收藏 0 原文

笨死的猪 2024-09-09 17:06:18

有效地将字符串打包到内存中！我曾经编写过一个超级内存高效的 Set 类，其中字符串存储为树。如果通过遍历字母到达叶子，则该条目包含在集合中。使用起来也很快，并且非常适合存储大字典。

并且不要忘记，在我介绍的几乎每个应用程序中，字符串通常是内存中最大的部分，因此如果您需要它们，请不要关心它们。

插图：

你有 3 根绳子：啤酒、豆子和血。您可以创建这样的树结构：

B
+-e
  +-er
  +-ans
+-lood

对于例如街道名称列表非常有效，这对于固定字典显然是最合理的，因为插入无法有效完成。事实上，该结构应该创建一次，然后序列化，然后加载。

Effeciently pack Strings in memory! I once wrote a hyper memory efficient Set class, where Strings were stored as a tree. If a leaf was reached by traversing the letters, the entry was contained in the set. Fast to work with, too, and ideal to store a large dictionary.

And don't forget that Strings are often the largest part in memory in nearly every app I profiled, so don't care for them if you need them.

Illustration:

You have 3 Strings: Beer, Beans and Blood. You can create a tree structure like this:

B
+-e
  +-er
  +-ans
+-lood

Very efficient for e.g. a list of street names, this is obviously most reasonable with a fixed dictionary, because insert cannot be done efficiently. In fact the structure should be created once, then serialized and afterwards just loaded.

回复收藏 0 原文