Vector 和 HashSet 之间巨大的性能差异

发布于 2024-09-08 04:55:27 字数 407 浏览 2 评论 0原文

我有一个程序,它从数据库中获取记录(使用 Hibernate)并将它们填充到 Vector 中。操作性能存在问题,我用 HashSet 替换 Vector 进行了测试。有了 300000 条记录,速度增益是巨大的 - 45 分钟到 2 分钟!

所以我的问题是,是什么造成了如此巨大的差异?仅仅是因为 Vector 中的所有方法都是同步的,还是因为 Vector 内部使用数组而 HashSet 不同步?或者其他什么?

该代码在单线程中运行。

编辑: 该代码仅将值插入 Vector 中(在其他情况下,插入 HashSet)。

I have a program which fetches records from database (using Hibernate) and fills them in a Vector. There was an issue regarding the performance of the operation and I did a test with the Vector replaced by a HashSet. With 300000 records, the speed gain is immense - 45 mins to 2 mins!

So my question is, what is causing this huge difference? Is it just the point that all methods in Vector are synchronized or the point that internally Vector uses an array whereas HashSet does not? Or something else?

The code is running in a single thread.

EDIT:
The code is only inserting the values in the Vector (and in the other case, HashSet).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

弃爱 2024-09-15 04:55:45

根据 Heinz Kabutz 博士的说法,他在他的一份时事通讯中说了这一点。

旧的 Vector 类以简单的方式实现序列化。它们只是执行默认的序列化,将整个 Object[] 按原样写入流中。因此,如果我们向List中插入一堆元素,然后清除它,Vector和ArrayList之间的区别是巨大的。

import java.util.*;
import java.io.*;

public class VectorWritingSize {
  public static void main(String[] args) throws IOException {
    test(new LinkedList<String>());
    test(new ArrayList<String>());
    test(new Vector<String>());
  }

  public static void test(List<String> list) throws IOException {
    insertJunk(list);
    for (int i = 0; i < 10; i++) {
      list.add("hello world");
    }
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ObjectOutputStream out = new ObjectOutputStream(baos);
    out.writeObject(list);
    out.close();
    System.out.println(list.getClass().getSimpleName() +
        " used " + baos.toByteArray().length + " bytes");
  }

  private static void insertJunk(List<String> list) {
    for(int i = 0; i<1000 * 1000; i++) {
      list.add("junk");
    }
    list.clear();
  }
}

当我们运行此代码时,我们得到以下输出:

LinkedList used 107 bytes
ArrayList used 117 bytes
Vector used 1310926 bytes

Vector 在序列化时可以使用惊人数量的字节。教训在这里? 永远不要在可序列化的对象中使用 Vector 作为列表。发生灾难的可能性太大了。

According to Dr Heinz Kabutz, he said this in one of his newsletters.

The old Vector class implements serialization in a naive way. They simply do the default serialization, which writes the entire Object[] as-is into the stream. Thus if we insert a bunch of elements into the List, then clear it, the difference between Vector and ArrayList is enormous.

import java.util.*;
import java.io.*;

public class VectorWritingSize {
  public static void main(String[] args) throws IOException {
    test(new LinkedList<String>());
    test(new ArrayList<String>());
    test(new Vector<String>());
  }

  public static void test(List<String> list) throws IOException {
    insertJunk(list);
    for (int i = 0; i < 10; i++) {
      list.add("hello world");
    }
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ObjectOutputStream out = new ObjectOutputStream(baos);
    out.writeObject(list);
    out.close();
    System.out.println(list.getClass().getSimpleName() +
        " used " + baos.toByteArray().length + " bytes");
  }

  private static void insertJunk(List<String> list) {
    for(int i = 0; i<1000 * 1000; i++) {
      list.add("junk");
    }
    list.clear();
  }
}

When we run this code, we get the following output:

LinkedList used 107 bytes
ArrayList used 117 bytes
Vector used 1310926 bytes

Vector can use a staggering amount of bytes when being serialized. The lesson here? Don't ever use Vector as Lists in objects that are Serializable. The potential for disaster is too great.

一笔一画续写前缘 2024-09-15 04:55:43
import java.util.*;

public class Test {
  public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Vector<String> vector = new Vector<String>();
    for (int i = 0; i < 300000; i++) {
       if(vector.contains(i)) {
         vector.add("dummy value");
       }
     }
    long end = System.currentTimeMillis();
    System.out.println("Time taken: " + (end - start) + "ms");
  }
}

如果在将元素插入向量之前检查重复元素,则需要更多时间,具体取决于向量的大小。最好的方法是使用 HashSet 来获得高性能,因为 Hashset 不允许重复,并且在插入之前不需要检查重复元素。

import java.util.*;

public class Test {
  public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Vector<String> vector = new Vector<String>();
    for (int i = 0; i < 300000; i++) {
       if(vector.contains(i)) {
         vector.add("dummy value");
       }
     }
    long end = System.currentTimeMillis();
    System.out.println("Time taken: " + (end - start) + "ms");
  }
}

If you check for duplicate element before insert the element in the vector, it will take more time depend upon the size of vector. best way is to use the HashSet for high performance, because Hashset will not allow duplicate and no need to check for duplicate element before inserting.

楠木可依 2024-09-15 04:55:42

在正常情况下,向 Vector 插入 300,000 条记录比向 HashSet 插入相同的记录要多花 43 分钟,这完全难以置信

然而,我认为对于正在发生的事情有一个可能的解释。

首先,从数据库中出来的记录必须具有非常高的重复比例。或者至少,根据记录类的 equals/hashcode 方法的语义,它们必须是重复的。

接下来,我认为您一定非常接近填满堆了。

因此,HashSet 解决方案之所以如此快,是因为它的大部分记录都被 set.add 操作替换。相比之下,Vector 解决方案保留所有记录,并且 JVM 花费大部分时间通过运行 GC 来尝试压缩最后的 0.05% 内存,一遍又一遍。

测试这一理论的一种方法是使用更大的堆运行应用程序的 Vector 版本。


无论如何,调查此类问题的最佳方法是使用分析器运行应用程序,并查看所有 CPU 时间都去哪儿了。

Under normal circumstances, it is totally implausible that inserting 300,000 records into a Vector will take 43 minutes longer than inserting the same records into a HashSet.

However, I think there is a possible explanation of what might be going on.

First, the records coming out of the database must have a very high proportion of duplicates. Or at least, they must be duplicates according to the semantics of the equals/hashcode methods of your record class.

Next, I think you must be pushing very close to filling up the heap.

So the reason that the HashSet solution is so much faster is that it is most of the records are being replaced by the set.add operation. By contrast the Vector solution is keeping all of the records, and the JVM is spending most of its time trying to squeeze that last 0.05% of memory by running the GC over, and over and over.

One way to test this theory is to run the Vector version of the application with a much bigger heap.


Irrespective, the best way to investigate this kind of problem is to run the application using a profiler, and see where all the CPU time is going.

想念有你 2024-09-15 04:55:40

Vector默认是同步的; HashSet 不是。这是我的猜测。获取监视器以进行访问需要时间。

我不知道你的测试中是否有读取,但是如果使用 get() 访问 Vector 条目,Vector 和 HashSet 都是 O(1) 。

Vector is synchronized by default; HashSet is not. That's my guess. Obtaining a monitor for access takes time.

I don't know if there are reads in your test, but Vector and HashSet are both O(1) if get() is used to access Vector entries.

岛歌少女 2024-09-15 04:55:38

Vector 已过时,不应再使用。使用 ArrayList 或 LinkedList(取决于您如何使用列表)进行配置文件,您将看到差异(同步与不同步)。
为什么要在单线程应用程序中使用 Vector?

Vector is outdated and should not be used anymore. Profile with ArrayList or LinkedList (depends on how you use the list) and you will see the difference (sync vs unsync).
Why are you using Vector in a single threaded application at all?

胡大本事 2024-09-15 04:55:36

如果您将它们插入到中间或开头而不是末尾,则 Vector 需要将它们全部移动。每一个插入。另一方面,哈希图并不真正关心或不必做任何事情。

If you are inserting them at the middle or beginning instead of at the end, then the Vector needs to move them all along. Every insert. The hashmap, on the other hand, doesn't really care or have to do anything.

旧故 2024-09-15 04:55:34

如果它尝试将向量用作集合,并在添加记录之前检查记录是否存在,则填充向量将成为 O(n^2) 操作,与 HashSet 的 O(n) 相比。如果将每个元素插入向量的开头而不是末尾,它也将成为 O(n^2) 操作。

如果您只是使用collection.add(item),那么我不会期望看到这种差异 - 同步不是那样 > 慢。

如果您可以尝试使用不同数量的记录来测试它,您可以看到每个版本如何随着 n 的增加而增长 - 这将使您更容易弄清楚发生了什么。

编辑:如果您只是使用 Vector.add 那么听起来可能会发生其他事情 - 例如,您的数据库在不同的测试运行之间表现不同。这是一个小测试应用程序:

import java.util.*;

public class Test {
  public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Vector<String> vector = new Vector<String>();
    for (int i = 0; i < 300000; i++) {
      vector.add("dummy value");
    }
    long end = System.currentTimeMillis();
    System.out.println("Time taken: " + (end - start) + "ms");
  }
}

输出:

耗时:38ms

现在显然这不会很准确 - System.currentTimeMillis 不是获得准确计时的最佳方法 - 但显然不会花费 45 分钟。换句话说,如果您确实只是调用Vector.add(item),则应该在其他地方寻找问题。

现在,更改上面的代码

vector.add(0, "dummy value"); // Insert item at the beginning

会产生巨大的差异 - 需要 42 秒而不是 38 毫秒。这显然要糟糕得多 - 但距离 45 分钟还有很长的路要走 - 我怀疑我的桌面速度是你的 60 倍。

If it's trying to use the Vector as a set, and checking for the existence of a record before adding it, then filling the vector becomes an O(n^2) operation, compared with O(n) for HashSet. It would also become an O(n^2) operation if you insert each element at the start of the vector instead of at the end.

If you're just using collection.add(item) then I wouldn't expect to see that sort of difference - synchronization isn't that slow.

If you can try to test it with different numbers of records, you could see how each version grows as n increases - that would make it easier to work out what's going on.

EDIT: If you're just using Vector.add then it sounds like something else could be going on - e.g. your database was behaving differently between your different test runs. Here's a little test application:

import java.util.*;

public class Test {
  public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Vector<String> vector = new Vector<String>();
    for (int i = 0; i < 300000; i++) {
      vector.add("dummy value");
    }
    long end = System.currentTimeMillis();
    System.out.println("Time taken: " + (end - start) + "ms");
  }
}

Output:

Time taken: 38ms

Now obviously this isn't going to be very accurate - System.currentTimeMillis isn't the best way of getting accurate timing - but it's clearly not taking 45 minutes. In other words, you should look elsewhere for the problem, if you really are just calling Vector.add(item).

Now, changing the code above to use

vector.add(0, "dummy value"); // Insert item at the beginning

makes an enormous difference - it takes 42 seconds instead of 38ms. That's clearly a lot worse - but it's still a long way from being 45 minutes - and I doubt that my desktop is 60 times as fast as yours.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文