如何避免 string.intern() 争用并保持较低的内存占用?

发布于 2024-11-26 15:50:47 字数 473 浏览 2 评论 0原文

我正在解析一个相当大(200 MB)的 XML 文件,该文件会生成一个对象树,每个对象定义一堆参数(键=值)。该数据结构在 Tomcat Web 应用程序中运行,用于查找这些参数。

几个月前,我们发现该服务器上存在堆内存问题。我们可以通过保留参数键和值(其中大部分非常冗余)来解决这个问题,从而将内存占用量从超过 150 MB 减少到只有 20 MB。

今天我重新访问服务器,因为人们抱怨启动时间。我正在对服务器进行分析,发现使用 XPP3 解析 XML 需要 40 秒,而 String.intern() 需要 30 多秒。

我知道这是一个权衡。我知道我可以自己实习。由于解析 XML 是单线程的,因此简单的 HashMap 也可以完成这项工作。但你知道,这感觉有点奇怪。

是否有人对数字进行了计算,看看是否值得放弃 String.intern 而采用不同的解决方案?

那么问题是?对于此类问题,如何尽可能减少争用?

谢谢, 斯特凡

I am parsing a rather large (200 MB) XML file that results in a tree of objects each defining a bunch of parameters (key=value). This data structure is running in a Tomcat webapp and used to lookup those parameters.

Months ago we discovered a heap memory issue on this server. We could solve it by interning the parameter keys and values (most of them being very redundant) which reduced the memory footprint from over 150 MB to as little as 20 MB.

Today I am revisiting the server because people are complaining about startup times. I am profiling into the server and seeing that parsing the XML with XPP3 takes 40 seconds, where String.intern() takes more than 30 seconds.

I know this is a tradeoff. And I know I could do the interning myself. As parsing the XML is single-threaded as simple HashMap might do the job as well. But you know, this feels kind of odd.

Did anybody crunch the numbers to see if it's worth dropping String.intern in favor of a different solution?

So the question is? How can I get contention as low as possible for such problems?

Thanks,
Stefan

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

百善笑为先 2024-12-03 15:50:47

添加额外的间接步骤:使用第二个 HashMap 来保存键,并在将键插入内存结构之前首先查找其中的键。这将为您提供比 String#intern() 更多的灵活性。

但是,如果您需要在每次 tomcat 启动时解析 200MB XML 文件,并且额外的 10 秒让人抱怨(他们是否经常重新启动 tomcat?),这会导致弹出标志(您是否考虑过使用数据库,甚至是 Apache德比,保留解析的数据?)。

Add an extra indirection step: Have a second HashMap that keeps the keys, and look up the keys there first before inserting them in the in-memory structures. This will give you much more flexibility than String#intern().

However, if you need to parse that 200MB XML file on every tomcat startup, and the extra 10 seconds make people grumble (are they restarting tomcat every so often?) - that makes flags pop up (have you considered using a database, even Apache Derby, to keep the parsed data?).

我为君王 2024-12-03 15:50:47

当您添加更多字符串时, String.intern() 似乎不能很好地扩展。池中字符串的数量看起来 O(n) 。

Random rand = new Random();
for(int i=0;i<100;i++) {
    long start = System.nanoTime();
    for(int j=0;j<100000;j++)
        Long.toString(rand.nextLong()).toString().intern();
    long time = System.nanoTime() - start;
    System.out.printf("Took %,d ns on average to intern() a random string%n", time/100000);
}

相反

Took 1,586 ns on average to intern() a random string
Took 3,843 ns on average to intern() a random string
Took 7,551 ns on average to intern() a random string
Took 13,436 ns on average to intern() a random string
Took 20,226 ns on average to intern() a random string
Took 27,609 ns on average to intern() a random string
Took 35,098 ns on average to intern() a random string
Took 42,439 ns on average to intern() a random string
Took 50,801 ns on average to intern() a random string
Took 20,975 ns on average to intern() a random string
Took 4,634 ns on average to intern() a random string
Took 10,512 ns on average to intern() a random string
Took 16,914 ns on average to intern() a random string
Took 23,601 ns on average to intern() a random string
Took 30,230 ns on average to intern() a random string
Took 36,184 ns on average to intern() a random string
Took 43,266 ns on average to intern() a random string

,我使用数组作为字符串池。

private static void testHashArray(String[] strings2, int size) {
    String[] pool = new String[size];
    int hit=0, miss=0;
    long start2 = System.nanoTime();
    for (String s : strings2) {
        int hash = (s.hashCode() & 0x7fffffff) % pool.length;
        String s2 = pool[hash];
        if (s.equals(s2)) {
            hit++;
        } else {
            miss++;
        }
        if (s2 != s)
            pool[hash] = s;
    }
    long time2 = System.nanoTime() - start2;
    System.out.printf("Hash size: %,d took %.3f second. Hit/miss %,d/%,d %n", size, time2 / 1e9, hit, miss);
}

public static void main(String... args) {
    Random rand = new Random();

    // a million unique strings.
    String[] strings = new String[1000 * 1000];
    for (int i = 0; i < strings.length; i++)
        strings[i] = String.valueOf(rand.nextLong());
    // random selection of Strings
    String[] strings2 = new String[10 * 1000 * 1000];
    int totalSize = 0;
    for (int i = 0; i < strings2.length; i++) {
        int idx = (int) Math.pow(strings.length, rand.nextFloat());
        String s = strings[idx];
        strings2[i] = s;
        totalSize += s.length() + 16; // with overhead
    }
    System.out.printf("Original size %,d%n", totalSize);

    Set<String> uniqueStrings = Collections.newSetFromMap(new IdentityHashMap<String, Boolean>());
    uniqueStrings.addAll(Arrays.asList(strings2));
    System.out.printf("Unique strings %,d%n", uniqueStrings.size());

    long start = System.nanoTime();
    HashMap<String,String> map = new HashMap();
    for(String s: strings2)
        map.put(s,s);
    long time = System.nanoTime() - start;
    System.out.printf("Took %.3f second to map strings%n", time/1e9);

    testHashArray(strings2, 10192);
    testHashArray(strings2, 101929);
    testHashArray(strings2, 1019291);
}

如果实习生

Original size 353,293,201
Unique strings 766,222
Took 0.979 second to map strings
Hash size: 10,192 took 0.357 second. Hit/miss 5,213,210/4,786,790 
Hash size: 101,929 took 0.309 second. Hit/miss 7,202,094/2,797,906 
Hash size: 1,019,291 took 0.254 second. Hit/miss 8,789,382/1,210,618 

很慢,那么在后台线程加载之后执行它怎么样?加载服务器后,您可以在发现重复字符串时对字符串进行 intern() 操作。

您真的需要节省 130 MB 吗?我知道这听起来不错,但是内存还能用于其他用途吗?

如果您希望 intern() 具有更快的形式,您可以使用固定大小的数组。

It appears that String.intern() doesn't scale very well as you add more an more Strings. It appears to O(n) with the number of Strings in the pool.

Random rand = new Random();
for(int i=0;i<100;i++) {
    long start = System.nanoTime();
    for(int j=0;j<100000;j++)
        Long.toString(rand.nextLong()).toString().intern();
    long time = System.nanoTime() - start;
    System.out.printf("Took %,d ns on average to intern() a random string%n", time/100000);
}

prints

Took 1,586 ns on average to intern() a random string
Took 3,843 ns on average to intern() a random string
Took 7,551 ns on average to intern() a random string
Took 13,436 ns on average to intern() a random string
Took 20,226 ns on average to intern() a random string
Took 27,609 ns on average to intern() a random string
Took 35,098 ns on average to intern() a random string
Took 42,439 ns on average to intern() a random string
Took 50,801 ns on average to intern() a random string
Took 20,975 ns on average to intern() a random string
Took 4,634 ns on average to intern() a random string
Took 10,512 ns on average to intern() a random string
Took 16,914 ns on average to intern() a random string
Took 23,601 ns on average to intern() a random string
Took 30,230 ns on average to intern() a random string
Took 36,184 ns on average to intern() a random string
Took 43,266 ns on average to intern() a random string

Instead I use an array as a string pool.

private static void testHashArray(String[] strings2, int size) {
    String[] pool = new String[size];
    int hit=0, miss=0;
    long start2 = System.nanoTime();
    for (String s : strings2) {
        int hash = (s.hashCode() & 0x7fffffff) % pool.length;
        String s2 = pool[hash];
        if (s.equals(s2)) {
            hit++;
        } else {
            miss++;
        }
        if (s2 != s)
            pool[hash] = s;
    }
    long time2 = System.nanoTime() - start2;
    System.out.printf("Hash size: %,d took %.3f second. Hit/miss %,d/%,d %n", size, time2 / 1e9, hit, miss);
}

public static void main(String... args) {
    Random rand = new Random();

    // a million unique strings.
    String[] strings = new String[1000 * 1000];
    for (int i = 0; i < strings.length; i++)
        strings[i] = String.valueOf(rand.nextLong());
    // random selection of Strings
    String[] strings2 = new String[10 * 1000 * 1000];
    int totalSize = 0;
    for (int i = 0; i < strings2.length; i++) {
        int idx = (int) Math.pow(strings.length, rand.nextFloat());
        String s = strings[idx];
        strings2[i] = s;
        totalSize += s.length() + 16; // with overhead
    }
    System.out.printf("Original size %,d%n", totalSize);

    Set<String> uniqueStrings = Collections.newSetFromMap(new IdentityHashMap<String, Boolean>());
    uniqueStrings.addAll(Arrays.asList(strings2));
    System.out.printf("Unique strings %,d%n", uniqueStrings.size());

    long start = System.nanoTime();
    HashMap<String,String> map = new HashMap();
    for(String s: strings2)
        map.put(s,s);
    long time = System.nanoTime() - start;
    System.out.printf("Took %.3f second to map strings%n", time/1e9);

    testHashArray(strings2, 10192);
    testHashArray(strings2, 101929);
    testHashArray(strings2, 1019291);
}

prints

Original size 353,293,201
Unique strings 766,222
Took 0.979 second to map strings
Hash size: 10,192 took 0.357 second. Hit/miss 5,213,210/4,786,790 
Hash size: 101,929 took 0.309 second. Hit/miss 7,202,094/2,797,906 
Hash size: 1,019,291 took 0.254 second. Hit/miss 8,789,382/1,210,618 

If doing the intern is slow, how about performing it after the load in a background thread. After the server is loaded, you can intern() the strings when a duplicate is found.

Do you really need to save 130 MB? I know it sounds great but would the memory be used for something else anyway?

For you want a faster form on intern() you can use a fixed size array.

一腔孤↑勇 2024-12-03 15:50:47

我们在将字符串解析为经过验证的“名称”对象时遇到问题。
这是在应用程序中的各个地方完成的,需要在内存和速度方面进行优化。

经过几次测试运行后,我们最终得到了一个处理 char 数组的解决方案,无论是在解析过程中还是在 Name 的实现过程中。

String.toCharArray() 检索字符串数组,或者可以使用 String.charAt(pos)< /a>.为了在数组之间快速复制,我们使用 System.arrayCopy

解析实际上比使用缓存进行查找更快。

We had a problem with a String being parsed into a verified 'Name' object.
This were done all over the place in the application and needed to be optimized in both memory and speed.

After a few test runs we eventually ended up with a solution processing char arrays, both during parsing and in the implementation of Name.

String.toCharArray() to retrieve the array of the string, or one can use String.charAt(pos). For quick copying between arrays we used System.arrayCopy.

The parsing were actually quicker than using a cache for lookup.

︶ ̄淡然 2024-12-03 15:50:47

这是另一个想法,尽管听起来可能有点奇怪。您是否考虑过编写一个代码生成器,它只解析您的 XML 文件并输出 Java 代码,该代码使用实际字符串填充映射(这些字符串在编译时被保留),

就像这样

public final class ConfigurationData {
  public static String get(String key) {
    return map.get(key);
  }
  private static final Map<String,String> MAP;
  static {
    MAP = new HashMap<String,String>([[[ number of records to load up ]]]);
    MAP.put([[[key 1]]], [[[ value 1 ]]]);
    MAP.put([[[key 2]]], [[[ value 2 ]]]);
    ...
  }
}

这遵循与预编译 JSP 相同的概念,以节省时间第一个用户惩罚,但如果配置文件发生更改(无论如何都应该受到控制),它会添加另一个构建步骤并成为部署。

Here's another thought, though it may sound a bit on the cooky side. Have you thought of just writing a code generator that just parses your XML file and spits out Java code which populates a map using actual strings (those get interned at compile time)

Something like this

public final class ConfigurationData {
  public static String get(String key) {
    return map.get(key);
  }
  private static final Map<String,String> MAP;
  static {
    MAP = new HashMap<String,String>([[[ number of records to load up ]]]);
    MAP.put([[[key 1]]], [[[ value 1 ]]]);
    MAP.put([[[key 2]]], [[[ value 2 ]]]);
    ...
  }
}

This follows the same concept as precompiled JSPs to save on the first user penalty, but it adds another build step and becomes a deployment if there is a configuration file change (which should be controlled anyway).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文