追踪 Java 中的内存泄漏/垃圾收集问题

发布于 2024-07-25 21:36:30 字数 2905 浏览 5 评论 0原文

这是我几个月来一直试图找出的问题。 我有一个正在运行的 java 应用程序,它处理 xml feed 并将结果存储在数据库中。 间歇性的资源问题非常难以追踪。

背景: 在生产盒子上(问题最明显的地方),我没有特别好的访问盒子的权限,并且无法运行 Jprofiler。 该机器是一台 64 位四核、8GB 机器,运行 centos 5.2、tomcat6 和 java 1.6.0.11。 首先是这些java-opts

JAVA_OPTS="-server -Xmx5g -Xms4g -Xss256k -XX:MaxPermSize=256m -XX:+PrintGCDetails -
XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution -XX:+UseParNewGC"

技术栈如下:

  • Centos 64-bit 5.2
  • Java 6u11
  • 雄猫6
  • Spring/WebMVC 2.5
  • 休眠3
  • 石英1.6.1
  • DBCP 1.2.1
  • mysql 5.0.45
  • 埃缓存1.5.0
  • (当然还有许多其他依赖项,特别是 jakarta-commons 库)

我能最接近地重现该问题的是内存要求较低的 32 位机器。 我确实可以控制。 我已经用 JProfiler 彻底探究了它,并修复了许多性能问题(同步问题、预编译/缓存 xpath 查询、减少线程池、删除不必要的休眠预取以及处理过程中过度热心的“缓存预热”)。

在每种情况下,分析器都显示这些由于某种原因占用了大量资源,并且一旦发生更改,这些就不再是主要资源消耗者。

问题: JVM 似乎完全忽略了内存使用设置,填满了所有内存并变得无响应。 这对于面向客户的终端来说是一个问题,他们希望定期进行轮询(5 分钟一次,1 分钟重试),对于我们的运营团队来说也是一个问题,他们不断收到盒子已变得无响应并且必须重新启动的通知。 这个盒子上没有任何其他重要的运行。

问题似乎是垃圾收集。 我们使用 ConcurrentMarkSweep(如上所述)收集器,因为原始 STW 收集器导致 JDBC 超时并且变得越来越慢。 日志显示,随着内存使用量的增加,即开始抛出 cms 故障,并返回到原始的 stop-the-world 收集器,该收集器似乎无法正确收集。

然而,使用 jprofiler 运行时,“运行 GC”按钮似乎可以很好地清理内存,而不是显示出不断增加的占用空间,但由于我无法将 jprofiler 直接连接到生产框,并且解决经过验证的热点似乎不起作用,所以我正在留下了盲目调整垃圾收集的巫毒。

我尝试过的:

  • 分析和修复热点。
  • 使用 STW、并行和 CMS 垃圾收集器。
  • 以 1/2,2/4,4/5,6/6 增量运行最小/最大堆大小。
  • 使用永久生成空间以 256M 增量运行,最高可达 1Gb。
  • 以上的多种组合。
  • 我还查阅了 JVM [调整参考](http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html) ,但无法真正找到任何解释此行为的内容或任何_which_调整的示例在这种情况下使用的参数。
  • 我也(不成功)在离线模式下尝试过 jprofiler,与 jconsole、visualvm 连接,但我似乎找不到任何可以解释我的 gc 日志数据的东西。

不幸的是,这个问题也会偶尔出现,它似乎是不可预测的,它可以运行几天甚至一周而没有任何问题,或者它可以在一天内失败 40 次,而我似乎唯一能持续捕捉到的是垃圾收集正在发挥作用。

任何人都可以提供以下方面的建议:
a) 当 JVM 配置为最大容量小于 6 时,为什么要使用 8 个物理 GB 和 2 GB 交换空间。
b) GC 调优的参考,它实际解释或给出了何时以及何种设置使用高级集合的合理示例。
c)对最常见的java内存泄漏的引用(我理解无人认领的引用,但我的意思是在库/框架级别,或者数据结构中更固有的东西,例如哈希图)。

感谢您提供的所有见解。

编辑
埃米尔 H:
1) 是的,我的开发集群是生产数据的镜像,一直到媒体服务器。 主要区别在于 32/64 位和可用 RAM 量,我无法轻松复制,但代码、查询和设置是相同的。

2) 有一些依赖 JaxB 的遗留代码,但在重新排序作业以避免调度冲突时,我通常会消除该执行,因为它每天运行一次。 主解析器使用 XPath 查询来调用 java.xml.xpath 包。 这是一些热点的来源,一是查询没有被预编译,二是对它们的引用是硬编码字符串。 我创建了一个线程安全缓存(哈希图),并将对 xpath 查询的引用分解为最终静态字符串,这显着降低了资源消耗。 查询仍然是处理的很大一部分,但这应该是因为这是应用程序的主要职责。

3) 额外说明,另一个主要消费者是来自 JAI 的图像操作(重新处理来自 feed 的图像)。 我对java的图形库不熟悉,但从我发现它们并不是特别泄漏。

(感谢各位到目前为止的回答!)

更新:
我能够使用 VisualVM 连接到生产实例,但它禁用了 GC 可视化/运行 GC 选项(尽管我可以在本地查看它)。 有趣的是:VM 的堆分配遵循 JAVA_OPTS,实际分配的堆舒适地位于 1-1.5 gig,并且似乎没有泄漏,但盒级监控仍然显示泄漏模式,但它是未反映在VM监控中。 这个盒子上没有其他东西在运行,所以我很困惑。

This is a problem I have been trying to track down for a couple months now. I have a java app running that processes xml feeds and stores the result in a database. There have been intermittent resource problems that are very difficult to track down.

Background:
On the production box (where the problem is most noticeable), i do not have particularly good access to the box, and have been unable to get Jprofiler running. That box is a 64bit quad-core, 8gb machine running centos 5.2, tomcat6, and java 1.6.0.11. It starts with these java-opts

JAVA_OPTS="-server -Xmx5g -Xms4g -Xss256k -XX:MaxPermSize=256m -XX:+PrintGCDetails -
XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+PrintTenuringDistribution -XX:+UseParNewGC"

The technology stack is the following:

  • Centos 64-bit 5.2
  • Java 6u11
  • Tomcat 6
  • Spring/WebMVC 2.5
  • Hibernate 3
  • Quartz 1.6.1
  • DBCP 1.2.1
  • Mysql 5.0.45
  • Ehcache 1.5.0
  • (and of course a host of other dependencies, notably the jakarta-commons libraries)

The closest I can get to reproducing the problem is a 32-bit machine with lower memory requirements. That I do have control over. I have probed it to death with JProfiler and fixed many performance problems (synchronization issues, precompiling/caching xpath queries, reducing the threadpool, and removing unnecessary hibernate pre-fetching, and overzealous "cache-warming" during processing).

In each case, the profiler showed these as taking up huge amounts of resources for one reason or another, and that these were no longer primary resource hogs once the changes went in.

The Problem:
The JVM seems to completely ignore the memory usage settings, fills all memory and becomes unresponsive. This is an issue for the customer facing end, who expects a regular poll (5 minute basis and 1-minute retry), as well for our operations teams, who are constantly notified that a box has become unresponsive and have to restart it. There is nothing else significant running on this box.

The problem appears to be garbage collection. We are using the ConcurrentMarkSweep (as noted above) collector because the original STW collector was causing JDBC timeouts and became increasingly slow. The logs show that as the memory usage increases, that is begins to throw cms failures, and kicks back to the original stop-the-world collector, which then seems to not properly collect.

However, running with jprofiler, the "Run GC" button seems to clean up the memory nicely rather than showing an increasing footprint, but since I can not connect jprofiler directly to the production box, and resolving proven hotspots doesnt seem to be working I am left with the voodoo of tuning Garbage Collection blind.

What I have tried:

  • Profiling and fixing hotspots.
  • Using STW, Parallel and CMS garbage collectors.
  • Running with min/max heap sizes at 1/2,2/4,4/5,6/6 increments.
  • Running with permgen space in 256M increments up to 1Gb.
  • Many combinations of the above.
  • I have also consulted the JVM [tuning reference](http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html) , but can't really find anything explaining this behavior or any examples of _which_ tuning parameters to use in a situation like this.
  • I have also (unsuccessfully) tried jprofiler in offline mode, connecting with jconsole, visualvm, but I can't seem to find anything that will interperet my gc log data.

Unfortunately, the problem also pops up sporadically, it seems to be unpredictable, it can run for days or even a week without having any problems, or it can fail 40 times in a day, and the only thing I can seem to catch consistently is that garbage collection is acting up.

Can anyone give any advice as to:
a) Why a JVM is using 8 physical gigs and 2 gb of swap space when it is configured to max out at less than 6.
b) A reference to GC tuning that actually explains or gives reasonable examples of when and what kind of setting to use the advanced collections with.
c) A reference to the most common java memory leaks (i understand unclaimed references, but I mean at the library/framework level, or something more inherenet in data structures, like hashmaps).

Thanks for any and all insight you can provide.

EDIT
Emil H:
1) Yes, my development cluster is a mirror of production data, down to the media server. The primary difference is the 32/64bit and the amount of RAM available, which I can't replicate very easily, but the code and queries and settings are identical.

2) There is some legacy code that relies on JaxB, but in reordering the jobs to try to avoid scheduling conflicts, I have that execution generally eliminated since it runs once a day. The primary parser uses XPath queries which call down to the java.xml.xpath package. This was the source of a few hotspots, for one the queries were not being pre-compiled, and two the references to them were in hardcoded strings. I created a threadsafe cache (hashmap) and factored the references to the xpath queries to be final static Strings, which lowered resource consumption significantly. The querying still is a large part of the processing, but it should be because that is the main responsibility of the application.

3) An additional note, the other primary consumer is image operations from JAI (reprocessing images from a feed). I am unfamiliar with java's graphic libraries, but from what I have found they are not particularly leaky.

(thanks for the answers so far, folks!)

UPDATE:
I was able to connect to the production instance with VisualVM, but it had disabled the GC visualization / run-GC option (though i could view it locally). The interesting thing: The heap allocation of the VM is obeying the JAVA_OPTS, and the actual allocated heap is sitting comfortably at 1-1.5 gigs, and doesnt seem to be leaking, but the box level monitoring still shows a leak pattern, but it is not reflected in the VM monitoring. There is nothing else running on this box, so I am stumped.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

夜还是长夜 2024-08-01 21:36:30

好吧,我终于找到了导致此问题的问题,并且我发布了详细的答案,以防其他人遇到这些问题。

我在进程运行时尝试了 jmap,但这通常会导致 jvm 进一步挂起,我必须使用 --force 来运行它。 这导致堆转储似乎丢失了大量数据,或者至少丢失了它们之间的引用。 为了进行分析,我尝试了 jhat,它提供了大量数据,但在如何解释它方面却没有太多内容。 其次,我尝试了基于eclipse的内存分析工具( http://www.eclipse.org/mat/ ),这表明堆中大部分是与tomcat相关的类。

问题是 jmap 没有报告应用程序的实际状态,而只是在关闭时捕获类,其中大部分是 tomcat 类。

我又尝试了几次,发现模型对象的数量非常多(实际上比数据库中标记为公共的多 2-3 倍)。

使用它,我分析了慢查询日志,以及一些不相关的性能问题。 我尝试了超延迟加载( http://docs. jboss.org/hibernate/core/3.3/reference/en/html/performance.html ),以及用直接 jdbc 查询替换一些 hibernate 操作(主要是处理大型集合的加载和操作) -- jdbc 替换直接在连接表上工作),并替换了 mysql 记录的其他一些低效查询。

这些步骤改进了前端性能,但仍然没有解决泄漏问题,应用程序仍然不稳定并且行为不可预测。

最后,我找到了选项: -XX:+HeapDumpOnOutOfMemoryError 。 这最终产生了一个非常大(约 6.5GB)的 hprof 文件,它准确地显示了应用程序的状态。 讽刺的是,该文件太大了,jhat 无法分析它,即使在具有 16GB 内存的盒子上也是如此。 幸运的是,MAT 能够生成一些漂亮的图表并显示一些更好的数据。

这次突出的是单个quartz线程占用了6GB堆中的4.5GB,其中大部分是hibernate StatefulPersistenceContext(https://www.hibernate.org/hib_docs/v3/api/org/hibernate/engine/StatefulPersistenceContext.html )。 Hibernate 在内部使用此类作为其主缓存(我已禁用 EHCache 支持的二级缓存和查询缓存)。

这个类用于启用hibernate的大部分功能,因此不能直接禁用它(您可以直接解决它,但spring不支持无状态会话),如果有这样的我会非常惊讶成熟产品中存在重大内存泄漏。 那为什么现在又漏了呢?

嗯,这是多种因素的结合:
石英线程池使用 threadLocal 的某些东西进行实例化,spring 注入了一个会话工厂,即在石英线程生命周期开始时创建一个会话,然后重用该会话来运行使用 hibernate 会话的各种石英作业。 然后 Hibernate 在会话中进行缓存,这是它的预期行为。

那么问题是线程池永远不会释放会话,因此 hibernate 会在会话的生命周期内保持驻留并维护缓存。 由于这是使用 springs hibernate 模板支持,因此没有显式使用会话(我们使用 dao -> manager -> driver ->quartz-job 层次结构,dao 通过 spring 注入 hibernate 配置,因此操作直接在模板上完成)。

因此,会话永远不会被关闭,hibernate 会维护对缓存对象的引用,因此它们永远不会被垃圾收集,因此每次运行新作业时,它只会继续填充线程本地的缓存,因此甚至没有不同工作之间的任何共享。 此外,由于这是一项写入密集型作业(很少读取),因此缓存大部分被浪费,因此对象不断被创建。

解决方案:创建一个显式调用 session.flush() 和 session.clear() 的 dao 方法,并在每个作业开始时调用该方法。

该应用程序已经运行了几天,没有出现监控问题、内存错误或重新启动。

感谢大家对此的帮助,这是一个非常棘手的错误,因为一切都按照预期进行,但最终 3 行方法设法解决了所有问题。

Well, I finally found the issue that was causing this, and I'm posting a detail answer in case someone else has these issues.

I tried jmap while the process was acting up, but this usually caused the jvm to hang further, and I would have to run it with --force. This resulted in heap dumps that seemed to be missing a lot of data, or at least missing the references between them. For analysis, I tried jhat, which presents a lot of data but not much in the way of how to interpret it. Secondly, I tried the eclipse-based memory analysis tool ( http://www.eclipse.org/mat/ ), which showed that the heap was mostly classes related to tomcat.

The issue was that jmap was not reporting the actual state of the application, and was only catching the classes on shutdown, which was mostly tomcat classes.

I tried a few more times, and noticed that there were some very high counts of model objects (actually 2-3x more than were marked public in the database).

Using this I analyzed the slow query logs, and a few unrelated performance problems. I tried extra-lazy loading ( http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html ), as well as replacing a few hibernate operations with direct jdbc queries (mostly where it was dealing with loading and operating on large collections -- the jdbc replacements just worked directly on the join tables), and replaced some other inefficient queries that mysql was logging.

These steps improved pieces of the frontend performance, but still did not address the issue of the leak, the app was still unstable and acting unpredictably.

Finally, I found the option: -XX:+HeapDumpOnOutOfMemoryError . This finally produced a very large (~6.5GB) hprof file that accurately showed the state of the application. Ironically, the file was so large that jhat could not anaylze it, even on a box with 16gb of ram. Fortunately, MAT was able to produce some nice looking graphs and showed some better data.

This time what stuck out was a single quartz thread was taking up 4.5GB of the 6GB of heap, and the majority of that was a hibernate StatefulPersistenceContext ( https://www.hibernate.org/hib_docs/v3/api/org/hibernate/engine/StatefulPersistenceContext.html ). This class is used by hibernate internally as its primary cache (i had disabled the second-level and query-caches backed by EHCache).

This class is used to enable most of the features of hibernate, so it can't be directly disabled (you can work around it directly, but spring doesn't support stateless session) , and i would be very surprised if this had such a major memory leak in a mature product. So why was it leaking now?

Well, it was a combination of things:
The quartz thread pool instantiates with certain things being threadLocal, spring was injecting a session factory in, that was creating a session at the start of the quartz threads lifecycle, which was then being reused to run the various quartz jobs that used the hibernate session. Hibernate then was caching in the session, which is its expected behavior.

The problem then is that the thread pool was never releasing the session, so hibernate was staying resident and maintaining the cache for the lifecycle of the session. Since this was using springs hibernate template support, there was no explicit use of the sessions (we are using a dao -> manager -> driver -> quartz-job hierarchy, the dao is injected with hibernate configs through spring, so the operations are done directly on the templates).

So the session was never being closed, hibernate was maintaining references to the cache objects, so they were never being garbage collected, so each time a new job ran it would just keep filling up the cache local to the thread, so there was not even any sharing between the different jobs. Also since this is a write-intensive job (very little reading), the cache was mostly wasted, so the objects kept getting created.

The solution: create a dao method that explicitly calls session.flush() and session.clear(), and invoke that method at the beginning of each job.

The app has been running for a few days now with no monitoring issues, memory errors or restarts.

Thanks for everyone's help on this, it was a pretty tricky bug to track down, as everything was doing exactly what it was supposed to, but in the end a 3 line method managed to fix all the problems.

狼性发作 2024-08-01 21:36:30

您可以在启用 JMX 的情况下运行生产环境吗?

-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=<port>
...

使用 JMX 进行监控和管理

然后附加使用 JConsole,VisualVM

可以使用 jmap?

如果是,您可以使用 JProfiler 分析堆转储是否存在泄漏(您已经拥有),jhat、VisualVM、Eclipse MAT。 还可以比较可能有助于查找泄漏/模式的堆转储。

正如您提到的雅加达公共区域。 使用 jakarta-commons-logging 时存在与保留类加载器相关的问题。 要详细阅读该内容,请查看

内存泄漏猎人的一天release(Classloader))

Can you run the production box with JMX enabled?

-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=<port>
...

Monitoring and Management Using JMX

And then attach with JConsole, VisualVM?

Is it ok to do a heap dump with jmap?

If yes you could then analyze the heap dump for leaks with JProfiler (you already have), jhat, VisualVM, Eclipse MAT. Also compare heap dumps that might help to find leaks/patterns.

And as you mentioned jakarta-commons. There is a problem when using the jakarta-commons-logging related to holding onto the classloader. For a good read on that check

A day in the life of a memory leak hunter (release(Classloader))

三月梨花 2024-08-01 21:36:30

似乎堆以外的内存正在泄漏,您提到堆保持稳定。 一个经典的候选者是 permgen(永久代),它由两部分组成:加载的类对象和内部字符串。 由于您报告已与 VisualVM 连接,因此您应该能够看到加载的类的数量,如果加载的类持续增加(重要的是,VisualVM 还显示曾经加载的类的总量,如果这个值上升也没关系,但加载的类的数量应该在一段时间后稳定下来)。

如果结果确实是永久生成泄漏,那么调试就会变得更加棘手,因为与堆相比,永久生成分析工具相当缺乏。 您最好的选择是在服务器上启动一个小脚本,该脚本重复(每小时?)调用:

jmap -permstat <pid> > somefile<timestamp>.txt

带有该参数的 jmap 将生成已加载类的概述以及其大小(以字节为单位)的估计,此报告可以帮助您确定是否确定类不会被卸载。 (注意:我的意思是进程ID,应该是一些生成的时间戳来区分文件)

一旦您确定某些类正在加载且未卸载,您就可以在心里弄清楚这些类可能在哪里生成,否则您可以使用jhat来分析转储使用 jmap -dump 生成。 如果您需要该信息,我将保留该信息以供将来更新。

It seems like memory other than heap is leaking, you mention that heap is remaining stable. A classical candidate is permgen (permanent generation) which consists of 2 things: loaded class objects and interned strings. Since you report having connected with VisualVM you should be able to seem the amount of loaded classes, if there is a continues increase of the loaded classes (important, visualvm also shows the total amount of classes ever loaded, it's okay if this goes up but the amount of loaded classes should stabilize after a certain time).

If it does turn out to be a permgen leak then debugging gets trickier since tooling for permgen analysis is rather lacking in comparison to the heap. Your best bet is to start a small script on the server that repeatedly (every hour?) invokes:

jmap -permstat <pid> > somefile<timestamp>.txt

jmap with that parameter will generate an overview of loaded classes together with an estimate of their size in bytes, this report can help you identify if certain classes do not get unloaded. (note: with I mean the process id and should be some generated timestamp to distinguish the files)

Once you identified certain classes as being loaded and not unloaded you can figure out mentally where these might be generated, otherwise you can use jhat to analyze dumps generated with jmap -dump. I'll keep that for a future update should you need the info.

情丝乱 2024-08-01 21:36:30

我会寻找直接分配的 ByteBuffer。

来自javadoc。

可以通过调用此类的 allocateDirect 工厂方法来创建直接字节缓冲区。 此方法返回的缓冲区通常比非直接缓冲区具有更高的分配和释放成本。 直接缓冲区的内容可能驻留在正常的垃圾收集堆之外,因此它们对应用程序内存占用的影响可能并不明显。 因此,建议主要将直接缓冲区分配给受底层系统本机 I/O 操作影响的大型、长期存在的缓冲区。 一般来说,最好仅在直接缓冲区在程序性能方面产生可测量的增益时才分配它们。

也许 Tomcat 代码使用这个 do 来处理 I/O; 配置 Tomcat 使用不同的连接器。

否则,您可能会有一个定期执行 System.gc() 的线程。 “-XX:+ExplicitGCInvokesConcurrent”可能是一个值得尝试的有趣选项。

I would look for directly allocated ByteBuffer.

From the javadoc.

A direct byte buffer may be created by invoking the allocateDirect factory method of this class. The buffers returned by this method typically have somewhat higher allocation and deallocation costs than non-direct buffers. The contents of direct buffers may reside outside of the normal garbage-collected heap, and so their impact upon the memory footprint of an application might not be obvious. It is therefore recommended that direct buffers be allocated primarily for large, long-lived buffers that are subject to the underlying system's native I/O operations. In general it is best to allocate direct buffers only when they yield a measureable gain in program performance.

Perhaps the Tomcat code uses this do to I/O; configure Tomcat to use a different connector.

Failing that you could have a thread that periodically executes System.gc(). "-XX:+ExplicitGCInvokesConcurrent" might be an interesting option to try.

半城柳色半声笛 2024-08-01 21:36:30

有JAXB吗? 我发现 JAXB 是一个烫发空间填充器。

另外,我发现现在随 JDK 6 一起提供的 visualgc 是一种很好的查看方式记忆中发生了什么。 它完美地展示了 eden、分代和 perm 空间以及 GC 的瞬态行为。 您所需要的只是进程的 PID。 也许这会对您使用 JProfile 有所帮助。

Spring 跟踪/日志记录方面又如何呢? 也许您可以编写一个简单的方面,以声明方式应用它,并以这种方式进行穷人的分析器。

Any JAXB? I find that JAXB is a perm space stuffer.

Also, I find that visualgc, now shipped with JDK 6, is a great way to see what's going on in memory. It shows the eden, generational, and perm spaces and the transient behavior of the GC beautifully. All you need is the PID of the process. Maybe that will help while you work on JProfile.

And what about the Spring tracing/logging aspects? Maybe you can write a simple aspect, apply it declaratively, and do a poor man's profiler that way.

请帮我爱他 2024-08-01 21:36:30

“不幸的是,这个问题也时断时续地出现,似乎不可预测,它可以运行几天甚至一周而没有任何问题,也可以一天失败 40 次,而我唯一能看到的一致地发现垃圾收集正在发挥作用。”

听起来,这与一个每天执行最多 40 次的用例绑定,然后几天就不再执行。 我希望,您不要只追踪症状。 这必须是您可以通过跟踪应用程序参与者(用户、作业、服务)的操作来缩小范围的内容。

如果通过 XML 导入发生这种情况,您应该将第 40 个崩溃日的 XML 数据与在零崩溃日导入的数据进行比较。 也许这是某种逻辑问题,您只是在代码中找不到。

"Unfortunately, the problem also pops up sporadically, it seems to be unpredictable, it can run for days or even a week without having any problems, or it can fail 40 times in a day, and the only thing I can seem to catch consistently is that garbage collection is acting up."

Sounds like, this is bound to a use case which is executed up to 40 times a day and then not anymore for days. I hope, you do not just track only the symptoms. This must be something, that you can narrow down by tracing the actions of the application's actors (users, jobs, services).

If this happens by XML imports, you should compare the XML data of the 40 crashes day with data, that is imported on a zero crash day. Maybe it's some sort of logical problem, that you do not find inside your code, only.

眼眸里的快感 2024-08-01 21:36:30

我遇到了同样的问题,但有一些差异。

我的技术如下:

grails 2.2.4

tomcat7

quartz-plugin< /a> 1.0

我在我的应用程序上使用两个数据源。 这是一个
错误原因的特殊性决定因素。

另一件事要考虑的是,石英插件,在石英线程中注入休眠会话,就像@liam所说,并且石英线程仍然存在,直到我完成应用程序。

我的问题是 grails ORM 与插件处理会话和我的两个数据源的方式相结合的错误。

Quartz 插件有一个监听器来初始化和销毁​​ hibernate 会话。

public class SessionBinderJobListener extends JobListenerSupport {

    public static final String NAME = "sessionBinderListener";

    private PersistenceContextInterceptor persistenceInterceptor;

    public String getName() {
        return NAME;
    }

    public PersistenceContextInterceptor getPersistenceInterceptor() {
        return persistenceInterceptor;
    }

    public void setPersistenceInterceptor(PersistenceContextInterceptor persistenceInterceptor) {
        this.persistenceInterceptor = persistenceInterceptor;
    }

    public void jobToBeExecuted(JobExecutionContext context) {
        if (persistenceInterceptor != null) {
            persistenceInterceptor.init();
        }
    }

    public void jobWasExecuted(JobExecutionContext context, JobExecutionException exception) {
        if (persistenceInterceptor != null) {
            persistenceInterceptor.flush();
            persistenceInterceptor.destroy();
        }
    }
}

在我的例子中,persistenceInterceptor 实例 AggregatePersistenceContextInterceptor,它有一个 HibernatePersistenceContextInterceptor 列表。 每个数据源一个。

每个操作都会通过 AggregatePersistenceContextInterceptor 进行,并传递给 HibernatePersistence,无需任何修改或处理。

下面的静态变量。

当我们在 HibernatePersistenceContextInterceptor 上调用 init() 时,他会增加private static ThreadLocalNestingCount = new ThreadLocal();

我不知道该静态计数的用途。 我只知道由于 AggregatePersistence 实现,它增加了两次,每个数据源一次。

直到这里我只是解释一下情况。

现在问题来了......

当我的quartz工作完成时,插件调用监听器来刷新并销毁hibernate会话,就像您在SessionBinderJobListener的源代码中看到的那样。

刷新完美发生,但销毁没有发生,因为 HibernatePersistence 在关闭 hibernate 会话之前执行一次验证...它检查 nestingCount 以查看该值是否大于 1。如果答案是肯定的,他不会关闭会话。

简化 Hibernate 所做的事情:

if(--nestingCount.getValue() > 0)
    do nothing;
else
    close the session;

这就是我内存泄漏的基础..
Quartz 线程仍然与会话中使用的所有对象一起存在,因为 grails ORM 没有关闭会话,这是因为我有两个数据源引起的错误。

为了解决这个问题,我自定义了侦听器,在销毁之前调用清除,并调用销毁两次(每个数据源一次)。 确保我的会话是清除和销毁的,如果销毁失败,他至少是清除的。

I had the same problem, with couple of differences..

My technology is the following:

grails 2.2.4

tomcat7

quartz-plugin 1.0

I use two datasources on my application. That is a
particularity determinant to bug causes..

Another thing to consider is that quartz-plugin, inject hibernate session in quartz threads, just like @liam says, and quartz threads still alive, untill I finish application.

My problem was a bug on grails ORM combined with the way the plugin handle session and my two datasources.

Quartz plugin had a listener to init and destroy hibernate sessions

public class SessionBinderJobListener extends JobListenerSupport {

    public static final String NAME = "sessionBinderListener";

    private PersistenceContextInterceptor persistenceInterceptor;

    public String getName() {
        return NAME;
    }

    public PersistenceContextInterceptor getPersistenceInterceptor() {
        return persistenceInterceptor;
    }

    public void setPersistenceInterceptor(PersistenceContextInterceptor persistenceInterceptor) {
        this.persistenceInterceptor = persistenceInterceptor;
    }

    public void jobToBeExecuted(JobExecutionContext context) {
        if (persistenceInterceptor != null) {
            persistenceInterceptor.init();
        }
    }

    public void jobWasExecuted(JobExecutionContext context, JobExecutionException exception) {
        if (persistenceInterceptor != null) {
            persistenceInterceptor.flush();
            persistenceInterceptor.destroy();
        }
    }
}

In my case, persistenceInterceptor instances AggregatePersistenceContextInterceptor, and it had a List of HibernatePersistenceContextInterceptor. One for each datasource.

Every opertion do with AggregatePersistenceContextInterceptor its passed to HibernatePersistence, without any modification or treatments.

When we calls init() on HibernatePersistenceContextInterceptor he increment the static variable below

private static ThreadLocal<Integer> nestingCount = new ThreadLocal<Integer>();

I don't know the pourpose of that static count. I just know he it's incremented two times, one per datasource, because of the AggregatePersistence implementation.

Until here I just explain the cenario.

The problem comes now...

When my quartz job finish, the plugin calls the listener to flush and destroy hibernate sessions, like you can see in source code of SessionBinderJobListener.

The flush occurs perfectly, but the destroy not, because HibernatePersistence, do one validation before close hibernate session... It examines nestingCount to see if the value is grather than 1. If the answer is yes, he not close the session.

Simplifying what was did by Hibernate:

if(--nestingCount.getValue() > 0)
    do nothing;
else
    close the session;

That's the base of my memory leak..
Quartz threads still alive with all objects used in session, because grails ORM not close session, because of a bug caused because I have two datasources.

To solve that, I customize the listener, to call clear before destroy, and call destroy two times, (one for each datasource). Ensuring my session was clear and destroyed, and if the destroy fails, he was clear at least.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文