垃圾收集是否会损害此类程序的性能
我正在构建一个程序,该程序将驻留在 AWS EC2 实例上(可能)通过 cron 作业定期调用。该程序将“抓取”/“轮询”我们与之合作的特定网站,并索引/聚合其内容并更新我们的数据库。我认为 java 非常适合用于编写此应用程序的语言。我们工程团队的一些成员担心 java 的垃圾收集功能对性能的损害,并建议使用 C++。
这些担忧是否合理?这是一个通过 cron 作业每 30 分钟调用一次的应用程序,只要它在该时间范围内完成其任务,我认为性能是可以接受的。我不确定垃圾收集是否会成为性能问题,因为我假设服务器将有足够的内存,并且实际行为是跟踪有多少对象指向内存区域,然后在内存达到 0 时声明该内存空闲似乎对我来说并不太不利。
I'm building a program that will live on an AWS EC2 instance (probably) be invoked periodically via a cron job. The program will 'crawl'/'poll' specific websites that we've partnered with and index/aggregate their content and update our database. I'm thinking java is a perfect fit for a language to program this application in. Some members of our engineering team are concerned about the performance detriment of java's garbage collection feature, and are suggesting using C++.
Are these valid concerns? This is an application that will be invoked possible once every 30 minutes via cron job, and as long as it finishes its task within that time frame the performance is acceptable I would assume. I'm not sure if garbage collection would be a performance issue, since I would assume the server will have plenty of memory and the actual act of tracking how many objects point to an area of memory and then declaring that memory free when it reaches 0 doesn't seem too detrimental to me.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
不,您的担忧很可能是没有根据的。
当处理大堆和大堆时,GC 可能是一个问题。破碎的内存(需要停止世界收集)或中等生存期的对象被提升到老一代,但随后快速取消引用(需要过多的GC,但可以通过调整新旧空间的比率来修复)。
网络爬虫不太可能适合上述两个配置文件中的任何一个 - 您可能不需要大量的老一代,并且应该具有相对较短生命周期的对象(解析数据时内存中的页面表示),并且这将得到有效处理在年轻一代收藏家中。
我们有一个内部爬虫(Java),每天可以轻松处理 200 万个页面,包括每页一些额外的后处理,在商用硬件(2G RAM)上,主要限制是带宽。 GC 不是问题。
正如其他人提到的,对于吞吐量敏感的应用程序(例如爬虫)来说,GC 很少是一个问题,但对于延迟敏感的应用程序(例如交易平台)来说,它可能(如果不小心的话)成为一个问题。
No, your concerns are most likely unfounded.
GC can be a concern, when dealing with large heaps & fractured memory (requires a stop the world collection) or medium lived objects that are promoted to old generation but then quickly de-referenced (requires excessive GC, but can be fixed by resizing ratio of new:old space).
A web crawler is very unlikely to fit either of the above two profiles - you probably don't need a massive old generation and should have relatively short lived objects (page representation in memory while you parse out data) and this will be efficiently dealt with in the young generation collector.
We have an in-house crawler (Java) that can happily handle 2 million pages per day, including some additional post-processing per page, on commodity hardware (2G RAM), the main constraint is bandwidth. GC is a non-issue.
As others have mentioned, GC is rarely an issue for throughput sensitive applications (such as a crawler) but it can (if one is not careful) be an issue for latency sensitive apps (such as a trading platform).
C++ 程序员对 GC 的典型关注点之一是延迟。也就是说,当您运行程序时,周期性 GC 会中断变异器并导致延迟峰值。当我以前以运行 Java Web 应用程序为生时,我有几个客户会在日志中看到延迟峰值并抱怨它 - 我的工作是调整 GC 以最大限度地减少这些峰值的影响。多年来,GC 取得了一些相对复杂的进展,使巨大的 Java 应用程序始终以低延迟运行,Sun(现在的 Oracle)工程师的工作给我留下了深刻的印象,他们使这成为可能。
然而,GC 一直非常擅长处理高吞吐量的任务,其中延迟不是问题。这包括 cron 作业。你们的工程师有毫无根据的担忧。
最简单的 GC 提供了大堆(高延迟、高吞吐量)和小堆(低延迟、低吞吐量)之间的权衡。需要进行一些分析才能使其适合特定的应用程序和工作负载,但这些简单的 GC 在大堆/高吞吐量/高延迟配置中非常宽容。
The typical concern C++ programmers have for GC is one of latency. That is, as you run a program, periodic GCs interrupt the mutator and cause spikes in latency. Back when I used to run Java web applications for a living, I had a couple customers who would see latency spikes in the logs and complain about it — and my job was to tune the GC to minimize the impact of those spikes. There are some relatively complicated advances in GC over the years to make monstrous Java applications run with consistently low latency, and I'm impressed with the work of the engineers at Sun (now Oracle) who made that possible.
However, GC has always been very good at handling tasks with high throughput, where latency is not a concern. This includes cron jobs. Your engineers have unfounded concerns.
The simplest GCs around offer a tradeoff between large heap (high latency, high throughput) and small heap (lower latency, lower throughput). It takes some profiling to get it right for a particular application and workload, but these simple GCs are very forgiving in a large heap / high throughput / high latency configuration.
获取和解析网站将比垃圾收集器花费更多的时间,其影响可能可以忽略不计。此外,在处理大量小对象(例如字符串)时,自动内存管理通常比通过 new/delete 进行手动内存管理更有效。不说垃圾回收的内存更容易使用这一事实。
Fetching and parsing websites will take way more time than the garbage collector, its impact will be probably neliglible. Moreover, the automatic memory management is often more efficient when dealing with a lot of small objects (such as strings) than a manual memory management via new/delete. Not talking about the fact that the garbage collected memory is easier to use.
我没有任何硬性数据来支持这一点,但是执行大量小型字符串操作(短时间内大量小型分配和解除分配)的代码应该快得多垃圾收集环境。
原因是现代 GC 定期“重新打包”堆,通过将对象从“伊甸园”空间移动到幸存者空间,然后移动到终身对象堆,并且现代 GC 针对许多小型对象的情况进行了大量优化。对象被快速分配和释放。
例如,在 Java 中(在任何现代 JVM 上)构建新字符串与 C++ 中的堆栈分配一样快。相比之下,除非您在 C++ 中进行精美的字符串池操作,否则您将通过大量小而快速的分配来真正增加分配器的负担。
另外,还有其他几个充分的理由考虑使用 Java 来开发此类应用程序:它对网络协议有更好的开箱即用支持,您需要这些协议来获取网站数据,而且它对于网络协议的鲁棒性更强。面对恶意内容时缓冲区溢出的可能性。
I don't have any hard numbers to back this up, but code that does a lot of small string manipulations (lots of small allocations and deallocations in a short period of time) should be much faster in a garbage-collected environment.
The reason is that modern GC's "re-pack" the heap on a regular basis, by moving objects from an "eden" space to survivor spaces and then to a tenured object heap, and modern GC's are heavily optimized for the case where many small objects are allocated and then deallocated quickly.
For example, constructing a new string in Java (on any modern JVM) is as fast as a stack allocation in C++. By contrast, unless you're doing fancy string-pooling stuff in C++, you'll be really taxing your allocator with lots of small and quick allocations.
Plus, there are several other good reasons to consider Java for this sort of app: it has better out-of-the-box support for network protocols, which you'll need for fetching website data, and it is much more robust against the possibility of buffer overflows in the face of malicious content.
垃圾收集(GC)从根本上来说是一种时空权衡。拥有的内存越多,程序执行垃圾收集所需的时间就越少。只要相对于最大实时大小(使用的总内存)有大量可用内存,GC(全堆收集)的主要性能影响应该是罕见的事件。 Java 的其他优势(特别是健壮性、安全性、可移植性和出色的网络库)使这成为理所当然的事情。
如需与同事分享一些硬数据,表明 GC 在具有大量可用 RAM 的情况下执行与
malloc
/free
一样的性能,请参阅:“量化垃圾收集与显式内存管理的性能”,马修·赫兹和埃默里·D.伯杰,OOPSLA 2005。
论文:PDF
演示幻灯片:PPT、PDF
Garbage collection (GC) is fundamentally a space-time tradeoff. The more memory you have, the less time your program will need to spend performing garbage collection. As long as you have a lot of memory available relative to the maximum live size (total memory in use), the main performance hit of GC -- whole-heap collections -- should be a rare event. Java's other advantages (notably robustness, security, portability, and an excellent networking library) make this a no-brainer.
For some hard data to share with your colleagues showing that GC performs as well as
malloc
/free
with plenty of available RAM, see:"Quantifying the Performance of Garbage Collection vs. Explicit Memory Management", Matthew Hertz and Emery D. Berger, OOPSLA 2005.
Paper: PDF
Presentation slides: PPT, PDF