Java 中 String 享元实现的最佳替代方案

发布于 2024-09-02 17:06:18 字数 335 浏览 6 评论 0原文

我的应用程序是多线程的,具有密集的字符串处理。我们遇到内存消耗过多的情况,分析表明这是由字符串数据引起的。我认为使用某种享元模式实现甚至缓存会极大地提高内存消耗(我确信字符串经常是重复的,尽管我在这方面没有任何硬数据)。

我研究过 Java Constant Pool 和 String.intern,但似乎它会引发一些 PermGen 问题。

在 java 中实现应用程序范围的多线程字符串池的最佳替代方案是什么?

编辑:另请参阅我之前的相关问题:Java 如何在幕后实现字符串的享元模式?

My application is multithreaded with intensive String processing. We are experiencing excessive memory consumption and profiling has demonstrated that this is due to String data. I think that memory consumption would benefit greatly from using some kind of flyweight pattern implementation or even cache (I know for sure that Strings are often duplicated, although I don't have any hard data in that regard).

I have looked at Java Constant Pool and String.intern, but it seems that it can provoke some PermGen problems.

What would be the best alternative for implementing application-wide, multithreaded pool of Strings in java?

EDIT: Also see my previous, related question: How does java implement flyweight pattern for string under the hood?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

如何视而不见 2024-09-09 17:06:18

注意:此答案使用的示例可能与现代运行时 JVM 库不相关。特别是,substring 示例在 OpenJDK/Oracle 7+ 中不再是问题。

我知道这与人们经常告诉您的相反,但有时会显式创建新的 String< /code> 实例可以是减少内存的重要方法。

由于字符串是不可变的,因此有几种方法利用这一事实并共享支持字符数组以节省内存。但是,有时这实际上可以通过防止对这些数组的未使用部分进行垃圾收集来增加内存。

例如,假设您正在解析日志文件的消息 ID 以提取警告 ID。您的代码看起来像这样:

//Format:
//ID: [WARNING|ERROR|DEBUG] Message...
String testLine = "5AB729: WARNING Some really really really long message";

Matcher matcher = Pattern.compile("([A-Z0-9]*): WARNING.*").matcher(testLine);
if ( matcher.matches() ) {
    String id = matcher.group(1);
        //...do something with id...
}

但是看看实际存储的数据:

    //...
    String id = matcher.group(1);
    Field valueField = String.class.getDeclaredField("value");
    valueField.setAccessible(true);

    char[] data = ((char[])valueField.get(id));
    System.out.println("Actual data stored for string \"" + id + "\": " + Arrays.toString(data) );

这是整个测试行,因为匹配器只是将一个新的 String 实例包装在相同的字符数据周围。比较将 String id = matcher.group(1); 替换为 String id = new String(matcher.group(1)); 时的结果。

Note: This answer uses examples that might not be relevant in modern runtime JVM libraries. In particular, the substring example is no longer an issue in OpenJDK/Oracle 7+.

I know it goes against what people often tell you, but sometimes explicitly creating new String instances can be a significant way to reduce your memory.

Because Strings are immutable, several methods leverage that fact and share the backing character array to save memory. However, occasionally this can actually increase the memory by preventing garbage collection of unused parts of those arrays.

For example, assume you were parsing the message IDs of a log file to extract warning IDs. Your code would look something like this:

//Format:
//ID: [WARNING|ERROR|DEBUG] Message...
String testLine = "5AB729: WARNING Some really really really long message";

Matcher matcher = Pattern.compile("([A-Z0-9]*): WARNING.*").matcher(testLine);
if ( matcher.matches() ) {
    String id = matcher.group(1);
        //...do something with id...
}

But look at the data actually being stored:

    //...
    String id = matcher.group(1);
    Field valueField = String.class.getDeclaredField("value");
    valueField.setAccessible(true);

    char[] data = ((char[])valueField.get(id));
    System.out.println("Actual data stored for string \"" + id + "\": " + Arrays.toString(data) );

It's the whole test line, because the matcher just wraps a new String instance around the same character data. Compare the results when you replace String id = matcher.group(1); with String id = new String(matcher.group(1));.

鸢与 2024-09-09 17:06:18

这已经在 J​​VM 级别完成了。您只需要确保您不会每次都显式或隐式地创建new String

即不要这样做:

String s1 = new String("foo");
String s2 = new String("foo");

这会在堆中创建两个实例。相反,这样做:

String s1 = "foo";
String s2 = "foo";

这将在堆中创建一个实例,并且两个实例都将引用相同的实例(作为证据,s1 == s2 将在此处返回 true)。

也不要使用 += 来连接字符串(在循环中):

String s = "";
for (/* some loop condition */) {
    s += "new";
}

+= 每次都会在堆中隐式创建一个 new String 。 如果可以的话

StringBuilder sb = new StringBuilder();
for (/* some loop condition */) {
    sb.append("new");
}
String s = sb.toString();

,请使用 StringBuilder 或其同步兄弟 StringBuffer 而不是 String 来进行“密集字符串处理”。它为这些目的提供了有用的方法,例如 append()insert()delete() 等。另请参阅其 javadoc

This is already done at the JVM level. You only need to ensure that you aren't creating new Strings everytime, either explicitly or implicitly.

I.e. don't do:

String s1 = new String("foo");
String s2 = new String("foo");

This would create two instances in the heap. Rather do so:

String s1 = "foo";
String s2 = "foo";

This will create one instance in the heap and both will refer the same (as evidence, s1 == s2 will return true here).

Also don't use += to concatenate strings (in a loop):

String s = "";
for (/* some loop condition */) {
    s += "new";
}

The += implicitly creates a new String in the heap everytime. Rather do so

StringBuilder sb = new StringBuilder();
for (/* some loop condition */) {
    sb.append("new");
}
String s = sb.toString();

If you can, rather use StringBuilder or its synchronized brother StringBuffer instead of String for "intensive String processing". It offers useful methods for exactly those purposes, such as append(), insert(), delete(), etc. Also see its javadoc.

望她远 2024-09-09 17:06:18

Java 7/8

如果您正在按照接受的答案所说的去做,并使用 Java 7 或更高版本,那么您就没有按照它所说的去做。

subString() 的实现已更改。

切勿编写依赖于可能发生巨大变化的实现的代码,如果您依赖旧的行为,则可能会使事情变得更糟。

1950    public String substring(int beginIndex, int endIndex) {
1951        if (beginIndex < 0) {
1952            throw new StringIndexOutOfBoundsException(beginIndex);
1953        }
1954        if (endIndex > count) {
1955            throw new StringIndexOutOfBoundsException(endIndex);
1956        }
1957        if (beginIndex > endIndex) {
1958            throw new StringIndexOutOfBoundsException(endIndex - beginIndex);
1959        }
1960        return ((beginIndex == 0) && (endIndex == count)) ? this :
1961            new String(offset + beginIndex, endIndex - beginIndex, value);
1962    }

因此,如果您在 Java 7 或更高版本中使用已接受的答案,您将创建两倍的内存使用量和需要收集的垃圾。

Java 7/8

If you are doing what the accepted answer says and using Java 7 or newer you are not doing what it says you are.

The implementation of subString() has changed.

Never write code that relies on an implementation that can change drastically and might make things worse if you are relying on the old behavior.

1950    public String substring(int beginIndex, int endIndex) {
1951        if (beginIndex < 0) {
1952            throw new StringIndexOutOfBoundsException(beginIndex);
1953        }
1954        if (endIndex > count) {
1955            throw new StringIndexOutOfBoundsException(endIndex);
1956        }
1957        if (beginIndex > endIndex) {
1958            throw new StringIndexOutOfBoundsException(endIndex - beginIndex);
1959        }
1960        return ((beginIndex == 0) && (endIndex == count)) ? this :
1961            new String(offset + beginIndex, endIndex - beginIndex, value);
1962    }

So if you use the accepted answer with Java 7 or newer you are creating twice as much memory usage and garbage that needs to be collected.

笨死的猪 2024-09-09 17:06:18

有效地将字符串打包到内存中!我曾经编写过一个超级内存高效的 Set 类,其中字符串存储为树。如果通过遍历字母到达叶子,则该条目包含在集合中。使用起来也很快,并且非常适合存储大字典。

并且不要忘记,在我介绍的几乎每个应用程序中,字符串通常是内存中最大的部分,因此如果您需要它们,请不要关心它们。

插图:

你有 3 根绳子:啤酒、豆子和血。您可以创建这样的树结构:

B
+-e
  +-er
  +-ans
+-lood

对于例如街道名称列表非常有效,这对于固定字典显然是最合理的,因为插入无法有效完成。事实上,该结构应该创建一次,然后序列化,然后加载。

Effeciently pack Strings in memory! I once wrote a hyper memory efficient Set class, where Strings were stored as a tree. If a leaf was reached by traversing the letters, the entry was contained in the set. Fast to work with, too, and ideal to store a large dictionary.

And don't forget that Strings are often the largest part in memory in nearly every app I profiled, so don't care for them if you need them.

Illustration:

You have 3 Strings: Beer, Beans and Blood. You can create a tree structure like this:

B
+-e
  +-er
  +-ans
+-lood

Very efficient for e.g. a list of street names, this is obviously most reasonable with a fixed dictionary, because insert cannot be done efficiently. In fact the structure should be created once, then serialized and afterwards just loaded.

时光倒影 2024-09-09 17:06:18

首先,确定如果消除某些解析,您的应用程序和开发人员将遭受多少损失。如果在此过程中员工流失率翻倍,那么更快的申请对您没有任何好处!我认为根据您的问题,我们可以假设您已经通过了这项测试。

其次,如果您无法消除对象的创建,那么您的下一个目标应该是确保它不会在 Eden 集合中幸存下来。而解析查找可以解决这个问题。然而,“正确实现”的缓存(我不同意这个基本前提,但我不会用随之而来的咆哮来烦你)通常会带来线程争用。您将用一种内存压力替换另一种内存压力。

解析查找习惯用法有一种变体,它较少受到通常从全面缓存中获得的那种附带损害,这​​是一个简单的预先计算的查找表(另请参阅“记忆化”)。您通常看到的模式是类型安全枚举 (TSE)。使用 TSE,您可以解析字符串,将其传递给 TSE 以检索关联的枚举类型,然后丢弃该字符串。

您正在处理的文本是自由格式的,还是输入必须遵循严格的规范?如果您的大量文本呈现为一组固定的可能值,那么 TSE 可以在这里为您提供帮助,并提供更好的服务:在创建时而不是在使用时向您的信息添加上下文/语义。

First, decide how much your application and developers would suffer if you eliminated some of that parsing. A faster application does you no good if you double your employee turnover rate in the process! I think based on your question we can assume you passed this test already.

Second, if you can't eliminate creating an object, then your next goal should be to ensure it doesn't survive Eden collection. And parse-lookup can solve that problem. However, a cache "implemented properly" (I disagree with that basic premise, but I won't bore you with the attendant rant) usually brings thread contention. You'd be replacing one kind of memory pressure for another.

There's a variation of the parse-lookup idiom that suffers less from the sort of collateral damage you usually get from full-on caching, and that's a simple precalculated lookup table (see also "memoization"). The Pattern you usually see for this is the Type Safe Enumeration (TSE). With the TSE, you parse the String, pass it to the TSE to retrieve the associated enumerated type, and then you throw the String away.

Is the text you're processing free-form, or does the input have to follow a rigid specification? If a lot of your text renders down to a fixed set of possible values, then a TSE could help you here, and serves a greater master: Adding context/semantics to your information at the point of creation, instead of at the point of use.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文