定义高性能数据结构的一组基本规则 (java)
我通常交替使用向量/数组列表、哈希图/树形图和其他 Java 集合,但有时存在功能性 API 要求(例如,在某些情况下我可能需要排序的数据集)。
然而,最近我发现需要将我正在运行的某些算法的 java 性能推向极限。
是否有一套高性能数据结构指南,我可以将其用作编码的基本规则?
我正在寻找一般规则,但是在这种情况下,答案是以下问题也可能非常有帮助:
1) 什么时候应该使用多维数组而不是嵌套 收藏?
2) Vector 与 ArrayList - 确实存在性能差异吗?
3) 收集 API 是否像 Google 的收集、java 技巧(例如 反射和转换),以及其他常见的 Java 开发人员习惯用法 在重负载时减慢 JVM 的速度?
4) 基元与常规对象(即 Double 与 double)速度会变慢吗 JVM 在进行大量计算时?
5) 对于处理大型问题,还有其他重要的指导方针吗? java程序中的集合需要高性能吗?
- 注意:此时,我没有进行任何多线程处理...我意识到一旦开始并行化,可能还会应用其他约束。
I generally use vectors/arraylists , hashmaps/treemaps, and other java collections interchangeably, with exception of the fact that there are sometimes functional API requirements (for example, I might need a sorted data set in certain instances).
Lately, however, I've found a need to push java performance to the limit for some algorithms I'm running.
Is there a set of guidelines for high-performance data structures, that I can use as ground rules for my coding ?
I'm looking for general rules, but, in this context, answers to the following questions might also be very helpful :
1) When should I use multidimensional arrays instead of nested
Collections ?2) Vectors vs. ArrayLists - is there truly a performance difference ?
3) Do collection API's like Google's collections, java tricks (like
reflection and casting), and other common java developer idioms tend
to slow down the JVM when it is under heavy load ?4) Do primitives vs regular objects (i.e. Double vs double) slow down
the JVM when doing lots of calculations ?5) Are there other important guidelines for dealing with large
collections in java programs which need to be high-performance ?
- Note : at this point, I'm not doing any multithreading... I realize that there are other constraints which might apply once I start parallelizing .
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
所有性能问题都应首先通过分析(时间和内存/对象使用)来解决。不要优化那些不影响代码性能的东西。有了这个警告,有一些一般的经验规则(都应该通过分析来测试!)
当您不需要动态调整集合的大小并且不需要将数据提供给需要集合的任何内容时,那么多维数组(实际上是数组的数组)可能会更快。
是的。 Vector中的许多方法都是同步的,这是昂贵的。如果您不使用多线程,请避免使用 Vector。即使您是这样,同步的粒度通常也是错误的,您最好自己提供线程安全性。
反射速度慢;垃圾收集速度很慢。您可以采取任何措施来避免这些情况,从而加快速度。
是的。自动装箱/拆箱会很快产生大量垃圾。这一切都必须收集,这也会减慢你的程序。
优先选择局部方法变量而不是字段访问。您可以通过搜索网络找到许多其他指南。不过,最重要的是分析。
编辑:此处提供了大量性能提示。
All performance issues should be addressed first by profiling (for both time and memory/object use). Don't optimize things that aren't a factor in the performance of your code. With that caveat, there are some general rules of thumb (that should all be tested by profiling!)
When you don't need the dynamic sizing of Collections and don't need to feed your data to anything that requires a Collection, then multidimensional arrays (arrays of arrays, actually) can be a faster.
Yes. Many methods in Vector are synchronized, which is expensive. If you aren't multithreading, then avoid Vector. Even if you are, the granularity of the synchronization is usually wrong and you're better off providing thread safety yourself.
Reflection is slow; garbage collection is slow. Anything you can do to avoid those will speed things up.
Yes. Autoboxing/unboxing can create huge amounts of garbage very quickly. This all has to be collected, which will also slow down your program.
Prefer local method variables to field accesses. You can find many other guidelines by searching the web. The main thing, though, is to profile.
Edit: There's a good collection of performance hints here.
回答你的 4) 是的,Double 与 double 肯定会改变性能
当你有由基元组成的集合时,你当然可以使用由基元支持的集合,就像非常好的 Trove API。通过避免不断的基元到对象以及反之亦然(拆箱)装箱,您可以节省内存和宝贵的时间。
另外,Vector 类现在几乎已经成为过去。
To answer your 4) Yes, Double vs double definitely changes the performances
When you have collections made up of primitives you certainly can use collections backed by primitives, like the very good Trove API. By avoiding constant primitive-to-object and vice-versa (un)boxing you save both memory and precious time.
Also the Vector class is, by now, pretty much a thing of the past.
1)如果您不需要真正动态调整大小,或者您可以将数据放入足够小的“最大大小”容器中,那么由于删除了方法,从数组的随机访问将比从集合中获得更好的性能调用开销甚至可能更多(取决于所使用的集合)。
2)在我看来,向量和哈希表几乎应该被视为已被弃用。它们是“线程安全的”,但对于大多数现实世界的场景,仅仅让数据结构本身是线程安全的是不够的;通常,您的应用程序逻辑也必须成为此同步的一部分。 ArrayList、HashMap 的性能会更好,因为它们没有同步块,而 99.9% 的情况下它们不会给你带来任何有用的东西。
3) Google 的集合 API 很棒,没有真正的性能问题。反射肯定很慢,不应该出现在内部循环中。
4)理想情况下,您希望避免在内循环中对基元进行装箱/拆箱。您可以找到专门针对原语进行调整的集合(即 Trove 集合 http://trove.starlight-systems.com /)。
5)这取决于具体用途,我不会说有任何通用准则。只需确保了解在转换集合等时您在做什么。例如,当您将列表转换为集合或类似内容时,请确保它不会克隆整个集合。
1) If you don't require really dynamic resizing, or you can fit your data inside a small enough "maximum size" container, then you will get better performance on random access from arrays than you do from collections due to the removal of method call overhead and possibly more (depending on the collections used).
2) Vectors and Hashtables should be considered almost as if they are deprecated in my opinion. They are "thread safe", but for most real world scenarios, simply having the data structure itself be thread safe is not sufficient; usually your application logic also has to be a part of this synchronization. ArrayList, HashMap will perform better as they don't have synchronized blocks, which 99.9% of the time don't get you anything useful anyways.
3) Google's collections APIs are great, no real performance issues. Reflection is definitely slow and should not be in inner loops.
4) Ideally you would like to avoid boxing/unboxing of primitives in inner loops. You can find collections that are specifically tuned to primitives (ie. Trove collections http://trove.starlight-systems.com/).
5) It depends on the specific use, I wouldn't say that there are any general guidelines. Just be sure to understand what you are doing when transforming collections, etc. For example, be sure it isn't cloning your entire collection when you transform a list to a set or something like that.
我相信你唯一应该使用Vector的时候是当你需要它同步的时候,但是你可以在ArrayList上使用特殊的Synchronized东西,所以我想说Vector是不需要的。始终使用 ArrayList 而不是 LinkedList。它背离了常识,所以它必须是java的实现,但是ArrayList要快得多。我曾经相信 LinkedList 所以我创建了以下测试:
导入java.util.ArrayList;
导入 java.util.GregorianCalendar;
导入 java.util.LinkedList;
导入java.util.List;
import java.util.Random;
/**
*
*/
/**
* @作者汤姆
*
*/
public class ListTest {
它产生了以下结果:
请有人验证我的代码以确保我没有做一些愚蠢的事情,但它表明 ArrayList 在所有方面都比 LinkedList 快得多。
反射肯定很慢。
基元的计算速度要快得多。请小心自动装箱,因为它会影响性能。这很好,只要确保您了解成本
I believe the only time you should use Vector is when you need it to be syncronized, but you can used the special Syncronized thingy on ArrayList, so I'd say Vector isn't needed. Always use ArrayList instead of LinkedList. It departs from common sense, so it has to be java's implementation, but ArrayList is tons faster. I used to believe in LinkedList so I created the following test:
import java.util.ArrayList;
import java.util.GregorianCalendar;
import java.util.LinkedList;
import java.util.List;
import java.util.Random;
/**
*
*/
/**
* @author thom
*
*/
public class ListTest {
And it produced the following results:
Someone please verify my code to ensure that I didn't do something stupid, but it shows that ArrayList is EXTREMELY faster than LinkedList for everything.
Reflection is definitely slow.
Primitives are way faster for calculations. Be careful about auto-boxing as it's a performance hit. It's nice, just be sure you understand the costs.
1)当你知道最大尺寸时,使用数组。
2) Vector 有同步方法,因此比 ArrayList 慢。有一个区别。最近有使用 Collections.synchronizedList 而不是向量的趋势。
3)有一些“快速”集合的实现,例如 http://labs.carrotsearch.com/ hppc.html 或 Trove,其他什么是最高效的 Java Collections 库?
4) 如果可以,请使用原语。包装器会带来额外的开销。
5)想想你必须做什么,最常执行什么操作,例如向集合中添加元素比向数组列表中添加元素要慢,遍历数组列表比在集合中迭代要快。然而,从 arraylist 中删除元素比在 set 中删除元素要慢。当可以使用数组时 - 它们将比任何其他集合更快。当您必须使用集合,但您大约知道将插入多少元素时,请使用具有初始大小的构造函数。
1) When you know maximum size, use arrays.
2) Vectors has synchronized methods so are slowers than ArrayLists. There is a difference. Lately there is tendention to use Collections.synchronizedList instead of vectors.
3) There are a few implementations of "fast" collections, e.g. http://labs.carrotsearch.com/hppc.html or Trove, other What is the most efficient Java Collections library?
4) If you can, use primitive. Wrappers brings additional overhead.
5) Think what you have to do, what actions will be performed most e.g. adding element to set is slower that to arraylist,iterating through arraylist is faster than in set. However removing elements from arraylist is slower than in set. When it is possible use arrays - they will be faster than any other collection. When you have to use collection, but you know approximately how many elements will be inserted, use constructor with initial size.
恕我直言,首要的规则是为您的用例选择正确的结构。
使用映射来实现字典可能会提高性能(时间),但会占用大量内存(空间),请使用
哈希搜索(使用 HashMap)很好,但如果你有一个有限数字范围的键,那么数组会做得更好。
我建议的唯一经验法则是,当您必须处理 GB 级数据和/或微秒级响应要求时,设计自己的数据结构。
IMHO first and foremost rule is to pick the right struct for your usecase.
Using a map for implementing a dictionary might be good for performance (time) for would take lot of memory (space), use a Trie instead.
Hash search (using HashMap) is good but if you have a key with a finite numeric-range an array would do better.
Only rule of thumb I recommend is to design your own data structure when you have to deal with GBs of data and/or response-in-micro-seconds requirements.
您是否需要直接访问数据?如果需要,您现在是否知道对象的确切位置?如果你一直循环遍历集合来找出你需要的对象在哪里,这需要一些时间(因此直接访问将是有利的)
而且自动装箱确实需要时间,因为你无法创建对象的集合原始类型,它们将始终被自动装箱到其亲属中。
Do you need direct access to the data and if so, do you now the exact position of the objects? If you loop through the collection all the time to figure out where the object is that you need, this takes some time (and therefor a direct access would be of advantage)
Also auto boxing does take time and as you can't create collections of primitive types, they will be autoboxed into their relatives all the time.
另一个小技巧:
如果您使用非常大的集合,并且您提前知道(或可以估计)它们的大小,则应该使用允许您指定初始容量的构造函数。这避免了多个数组分配。
Another small trick:
If you work with really big collections, and you know (or can estimate) their size in advance, you should use the constructors that let you specify the initial capacity. This avoids multiple array allocations.