存储在磁盘上的 HashMap 从磁盘读回非常慢

发布于 2024-11-19 23:21:53 字数 1231 浏览 1 评论 0原文

我有一个存储外部 uid 的 HashMap,然后它存储为给定 uid 设置的不同 id(我们应用程序的内部)。

例如:

  • 123.345.432=00001
  • 123.354.433=00002

地图由 uid 检查以确保使用相同的内部 id。如果向应用程序重新发送某些内容。

DICOMUID2StudyIdentiferMap 定义如下:

private static Map DICOMUID2StudyIdentiferMap = Collections.synchronizedMap(new HashMap());

但是,如果我们成功加载,加载将会覆盖它,否则它将使用默认的空 HashMap。

它通过执行以下操作从磁盘读回:

FileInputStream f = new FileInputStream( studyUIDFile );  
ObjectInputStream s = new ObjectInputStream( f );

Map loadedMap = ( Map )s.readObject();
DICOMUID2StudyIdentiferMap = Collections.synchronizedMap( loadedMap );

HashMap 使用以下方法写入磁盘:

FileOutputStream f = new FileOutputStream( studyUIDFile );
ObjectOutputStream s = new ObjectOutputStream( f );

s.writeObject(DICOMUID2StudyIdentiferMap);

我遇到的问题是,在 Eclipse 中本地运行性能良好,但是当应用程序在计算机上正常使用时运行时,HashMap 需要几分钟才能加载从磁盘。加载后,还需要很长时间来检查先前的值,例如查看 DICOMUID2StudyIdentiferMap.put(..., ...) 是否会返回值。

我在这两种情况下加载相同的地图对象,它是一个约 400kb 的文件。它包含的 HashMap 有大约 3000 个键值对。

为什么在一台机器上这么慢,但在eclipse上却没有?

该机器是运行 XP 的 VM,它最近才开始读取 HashMap 变慢,因此它必须与它的大小有关,但是我认为 400kb 并不是很大。

欢迎任何建议,TIA

I have a HashMap that stores external uids and then it stores a different id ( internal for our app ) that has been set for the given uid.

e.g:

  • 123.345.432=00001
  • 123.354.433=00002

The map is checked by uid to make sure the same internal id will be used. If something is resent to the application.

DICOMUID2StudyIdentiferMap defined as follows:

private static Map DICOMUID2StudyIdentiferMap = Collections.synchronizedMap(new HashMap());

The load however will overwrite it, if we successfully load, otherwise it will use the default empty HashMap.

Its read back from disk by doing:

FileInputStream f = new FileInputStream( studyUIDFile );  
ObjectInputStream s = new ObjectInputStream( f );

Map loadedMap = ( Map )s.readObject();
DICOMUID2StudyIdentiferMap = Collections.synchronizedMap( loadedMap );

The HashMap is written to disk using:

FileOutputStream f = new FileOutputStream( studyUIDFile );
ObjectOutputStream s = new ObjectOutputStream( f );

s.writeObject(DICOMUID2StudyIdentiferMap);

The issue I have is, locally running in Eclipse performance is fine, but when the application is running in normal use on a machine the HashMap is taking several minutes to load from disk. Once loaded it also takes a long time to check for a previous value by say seeing if DICOMUID2StudyIdentiferMap.put(..., ...) will return a value.

I load the same map object in both cases, its a ~400kb file. The HashMap that it contains has about ~3000 key-value pairs.

Why is it so slow on one machine, but not in eclipse?

The machine is a VM running XP it has only recently started becoming slow to read the HashMap, so it must be related to the size of it, however 400kb isn't very big I don't think.

Any advice welcome, TIA

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

梦旅人picnic 2024-11-26 23:21:53

正如 @biziclop 评论的那样,您应该首先使用分析器来查看应用程序将所有时间都花在哪里。

如果这没有给您任何结果,这里有一些理论。

  • 您的应用程序可能即将耗尽堆。当 JVM 接近耗尽堆时,它会花费几乎所有时间进行垃圾收集,徒劳地尝试继续运行。如果您启用 GC 日志记录,则会显示此信息。

  • ObjectInputStream 和 ObjectOutputStream 可能正在执行大量小型读取系统调用。尝试用缓冲流包装文件流,看看它是否显着加快速度。

为什么在一台机器上那么慢,而在eclipse中却不这么慢?

“满堆”理论可以解释这一点。 Eclipse 的默认堆大小比使用 java ... 启动且没有堆大小选项的应用程序大得多。

As @biziclop comments, you should start by using a profiler to see where your application is spending all of its time.

If that doesn't give you any results, here are a couple of theories.

  • It could be that your application is getting close to running out of heap. As the JVM gets close to running out of heap, it can spend nearly all of its time garbage collecting in a vain attempt to keep going. This will show up if you enable GC logging.

  • It could be that the ObjectInputStream and ObjectOutputStream are doing huge numbers of small read syscalls. Try wrapping the file streams with buffered streams and see if it speeds things up noticeably.

Why is it so slow on one machine, but not in eclipse?

The "full heap" theory could explain that. The default heap size for Eclipse is a lot bigger than for an application launched using java ... with no heap size options.

面如桃花 2024-11-26 23:21:53

不确定序列化您的地图是最佳选择。如果映射是基于磁盘的持久性,为什么不使用专为磁盘设计的库呢?查看京都内阁。它实际上是用 C++ 编写的,但有一个 java API。我已经用过它好几次了,它非常容易使用,速度非常快,并且可以扩展到很大的尺寸。

这是我复制/粘贴东京内阁(京都的旧版本)的示例,但它基本上是相同的:

import tokyocabinet.HDB;

....

String dir = "/path/to/my/dir/";
HDB hash = new HDB();

// open the hash for read/write, create if does not exist on disk
if (!hash.open(dir + "unigrams.tch", HDB.OWRITER | HDB.OCREAT)) {
    throw new IOException("Unable to open " + dir + "unigrams.tch: " + hash.errmsg());
}

// Add something to the hash
hash.put("blah", "my string");

// Close it
hash.close();

Not sure that serialising your Map is the best option. If the Map is disk-based for persistance, why not use a lib that's designed for disk? Check out Kyoto Cabinet. It's actually written in c++ but there is a java API. I've used it several times, it's very easy to use, very fast and can scale to a huge size.

This is an example I'm copy/pasting for Tokyo cabinet, the old version of Kyoto, but it's basically the same:

import tokyocabinet.HDB;

....

String dir = "/path/to/my/dir/";
HDB hash = new HDB();

// open the hash for read/write, create if does not exist on disk
if (!hash.open(dir + "unigrams.tch", HDB.OWRITER | HDB.OCREAT)) {
    throw new IOException("Unable to open " + dir + "unigrams.tch: " + hash.errmsg());
}

// Add something to the hash
hash.put("blah", "my string");

// Close it
hash.close();
感悟人生的甜 2024-11-26 23:21:53

以下列出了 122 个 NoSQL 数据库,您可以将其用作替代数据库。

这里有两个昂贵的操作,一是对象的序列化,二是磁盘访问。您可以通过仅读取/写入您需要的数据来加快访问速度。您可以通过使用自定义格式来加快序列化速度。

您还可以更改数据结构以提高效率。如果您想每次重新加载/重写整个地图,我建议使用以下方法。


private Map<Integer, Integer> mapping = new LinkedHashMap<Integer, Integer>();

public void saveTo(File file) throws IOException {
    DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(file)));
    dos.writeInt(mapping.size());
    for (Map.Entry<Integer, Integer> entry : mapping.entrySet()) {
        dos.writeInt(entry.getKey());
        dos.writeInt(entry.getValue());
    }
    dos.close();
}

public void loadFrom(File file) throws IOException {
    DataInputStream dis = new DataInputStream(new BufferedInputStream(new FileInputStream(file)));
    mapping.clear();
    int len = dis.readInt();
    for (int i = 0; i < len; i++)
        mapping.put(dis.readInt(), dis.readInt());
    dis.close();
}

public static void main(String[] args) throws IOException {
    Random rand = new Random();
    Main main = new Main();
    for (int i = 1; i <= 3000; i++) {
        // 100,000,000 to 999,999,999
        int uid = 100000000 + rand.nextInt(900000000); 
        main.mapping.put(uid, i);
    }
    final File file = File.createTempFile("deleteme", "data");
    file.deleteOnExit();
    for (int i = 0; i < 10; i++) {
        long start = System.nanoTime();
        main.saveTo(file);
        long mid = System.nanoTime();
        new Main().loadFrom(file);
        long end = System.nanoTime();
        System.out.printf("Took %.3f ms to save and %.3f ms to load %,d entries.%n",
                (end - mid) / 1e6, (mid - start) / 1e6, main.mapping.size());
    }
}

使用 TIntIntHashMap 代替打印速度

Took 1.203 ms to save and 1.706 ms to load 3,000 entries.
Took 1.209 ms to save and 1.203 ms to load 3,000 entries.
Took 0.961 ms to save and 0.966 ms to load 3,000 entries.

大约快 10%。

将地图大小增加到 100 万个打印条目

Took 412.718 ms to save and 62.009 ms to load 1,000,000 entries.
Took 403.135 ms to save and 61.756 ms to load 1,000,000 entries.
Took 399.431 ms to save and 61.816 ms to load 1,000,000 entries.

Here is a list of 122 NoSQL databases you could use as an alternative.

You have two expensive operations here, one is the serialization of objects and the second is disk access. You can speed up access by only reading/writing the data you need. The seralization you can speed up by using a custom format.

You could also change the structure of your data to make it more efficient. If you want to reload/rewrite the whole map each time I would suggest using the following approach.


private Map<Integer, Integer> mapping = new LinkedHashMap<Integer, Integer>();

public void saveTo(File file) throws IOException {
    DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(file)));
    dos.writeInt(mapping.size());
    for (Map.Entry<Integer, Integer> entry : mapping.entrySet()) {
        dos.writeInt(entry.getKey());
        dos.writeInt(entry.getValue());
    }
    dos.close();
}

public void loadFrom(File file) throws IOException {
    DataInputStream dis = new DataInputStream(new BufferedInputStream(new FileInputStream(file)));
    mapping.clear();
    int len = dis.readInt();
    for (int i = 0; i < len; i++)
        mapping.put(dis.readInt(), dis.readInt());
    dis.close();
}

public static void main(String[] args) throws IOException {
    Random rand = new Random();
    Main main = new Main();
    for (int i = 1; i <= 3000; i++) {
        // 100,000,000 to 999,999,999
        int uid = 100000000 + rand.nextInt(900000000); 
        main.mapping.put(uid, i);
    }
    final File file = File.createTempFile("deleteme", "data");
    file.deleteOnExit();
    for (int i = 0; i < 10; i++) {
        long start = System.nanoTime();
        main.saveTo(file);
        long mid = System.nanoTime();
        new Main().loadFrom(file);
        long end = System.nanoTime();
        System.out.printf("Took %.3f ms to save and %.3f ms to load %,d entries.%n",
                (end - mid) / 1e6, (mid - start) / 1e6, main.mapping.size());
    }
}

prints

Took 1.203 ms to save and 1.706 ms to load 3,000 entries.
Took 1.209 ms to save and 1.203 ms to load 3,000 entries.
Took 0.961 ms to save and 0.966 ms to load 3,000 entries.

Using TIntIntHashMap instead is about 10% faster.

Increasing the size of the Map to 1 million entries prints

Took 412.718 ms to save and 62.009 ms to load 1,000,000 entries.
Took 403.135 ms to save and 61.756 ms to load 1,000,000 entries.
Took 399.431 ms to save and 61.816 ms to load 1,000,000 entries.
活泼老夫 2024-11-26 23:21:53

也许您应该寻找与 Map 类似的替代方案,例如 SimpleDB、BerkeleyDB 或 Google BigTable。

Maybe you should look for alternatives that work similar like a Map, e.g. SimpleDB, BerkeleyDB or Google BigTable.

桃扇骨 2024-11-26 23:21:53

Voldemort 是 Linkedin 流行的开源键值存储。我建议您查看源代码,看看他们是如何做事的。现在我正在查看 https 的序列化部分://github.com/voldemort/voldemort/blob/master/src/java/voldemort/serialization/ObjectSerializer.java。查看他们使用的代码 ByteArrayOutputStream 其中我认为这是从光盘读取/写入的更有效的方法。

为什么在一台机器上这么慢,而在eclipse中却不这么慢?

您的问题不太清楚,但是 Eclipse 是否在 VM(VirtualBox?)中运行?因为如果是这样的话,情况可能会更快,因为完整的虚拟机存储在内存中,这比访问磁盘要快得多。

Voldemort is a popular open-source key-value store by Linkedin. I advice you to have a look at the source-code to see how they did things. Right now I am looking at the serialization part at https://github.com/voldemort/voldemort/blob/master/src/java/voldemort/serialization/ObjectSerializer.java. Looking at the code they are using ByteArrayOutputStream which I assume is more efficient way to read/write to/from disc.

Why is it so slow on one machine, but not in eclipse?

Not really clear from your question, but is Eclipse running in VM(VirtualBox?)? Because if so it might be the case that is faster because the complete VM is stored in memory which is a lot faster than accessing the disc.

℡寂寞咖啡 2024-11-26 23:21:53

我认为这可能是一个哈希问题。您在 Map 中使用的键的类型是什么,它是否具有有效的 hashCode() 方法来很好地分散键?

I think this may be a hashing problem. What is the type of the key you are using in the Map, and does it have an efficient hashCode() method that spreads out the keys well?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文