Groovy 程序作为 kubernetes 作业在执行期间转储线程运行的可能原因
我有一个简单的 groovy 脚本,它利用 GPars
库的 withPool
功能向两个内部 API 端点并行启动 HTTP GET 请求。
该脚本在本地运行良好,无论是直接运行还是在 Docker 容器中运行。
当我将其部署为 Kubernetes Job
(在我们的内部 EKS 集群中:1.20
)时,它也会在那里运行,但当它到达第一个 withPool< /code> 调用,我看到一个巨大的线程转储,但执行仍在继续,并成功完成。
注意:集群中的容器使用以下 Pod 安全上下文运行:
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
环境
# From the k8s job container
groovy@app-271df1d7-15848624-mzhhj:/app$ groovy --version
WARNING: Using incubator modules: jdk.incubator.foreign, jdk.incubator.vector
Groovy Version: 4.0.0 JVM: 17.0.2 Vendor: Eclipse Adoptium OS: Linux
groovy@app-271df1d7-15848624-mzhhj:/app$ ps -ef
UID PID PPID C STIME TTY TIME CMD
groovy 1 0 0 21:04 ? 00:00:00 /bin/bash bin/run-script.sh
groovy 12 1 42 21:04 ? 00:00:17 /opt/java/openjdk/bin/java -Xms3g -Xmx3g --add-modules=ALL-SYSTEM -classpath /opt/groovy/lib/groovy-4.0.0.jar -Dscript.name=/usr/bin/groovy -Dprogram.name=groovy -Dgroovy.starter.conf=/opt/groovy/conf/groovy-starter.conf -Dgroovy.home=/opt/groovy -Dtools.jar=/opt/java/openjdk/lib/tools.jar org.codehaus.groovy.tools.GroovyStarter --main groovy.ui.GroovyMain --conf /opt/groovy/conf/groovy-starter.conf --classpath . /tmp/script.groovy
groovy 116 0 0 21:05 pts/0 00:00:00 bash
groovy 160 116 0 21:05 pts/0 00:00:00 ps -ef
脚本(相关部分)
@Grab('org.codehaus.gpars:gpars:1.2.1')
import static groovyx.gpars.GParsPool.withPool
import groovy.json.JsonSlurper
final def jsl = new JsonSlurper()
//...
while (!(nextBatch = getBatch(batchSize)).isEmpty()) {
def devThread = Thread.start {
withPool(poolSize) {
nextBatch.eachParallel { kw ->
String url = dev + "&" + "query=$kw"
try {
def response = jsl.parseText(url.toURL().getText(connectTimeout: 10.seconds, readTimeout: 10.seconds,
useCaches: true, allowUserInteraction: false))
devResponses[kw] = response
} catch (e) {
println("\tFailed to fetch: $url | error: $e")
}
}
}
}
def stgThread = Thread.start {
withPool(poolSize) {
nextBatch.eachParallel { kw ->
String url = stg + "&" + "query=$kw"
try {
def response = jsl.parseText(url.toURL().getText(connectTimeout: 10.seconds, readTimeout: 10.seconds,
useCaches: true, allowUserInteraction: false))
stgResponses[kw] = response
} catch (e) {
println("\tFailed to fetch: $url | error: $e")
}
}
}
}
devThread.join()
stgThread.join()
}
Dockerfile
FROM groovy:4.0.0-jdk17 as builder
USER root
RUN apt-get update && apt-get install -yq bash curl wget jq
WORKDIR /app
COPY bin /app/bin
RUN chmod +x /app/bin/*
USER groovy
ENTRYPOINT ["/bin/bash"]
CMD ["bin/run-script.sh"]
bin/run-script.sh
只是在运行时下载上述 groovy 脚本并执行它。
wget "$GROOVY_SCRIPT" -O "$LOCAL_FILE"
...
groovy "$LOCAL_FILE"
一旦执行第一次调用 withPool(poolSize)
,就会出现一个巨大的线程转储,但执行会继续。
我试图找出什么可能导致这种行为。有什么想法
I have a simple groovy script that leverages the GPars
library's withPool
functionality to launch HTTP GET requests to two internal API endpoints in parallel.
The script runs fine locally, both directly as well as a docker container.
When I deploy it as a Kubernetes Job
(in our internal EKS cluster: 1.20
), it runs there as well, but the moment it hits the first withPool
call, I see a giant thread dump, but the execution continues, and completes successfully.
NOTE: Containers in our cluster run with the following pod security context:
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
Environment
# From the k8s job container
groovy@app-271df1d7-15848624-mzhhj:/app$ groovy --version
WARNING: Using incubator modules: jdk.incubator.foreign, jdk.incubator.vector
Groovy Version: 4.0.0 JVM: 17.0.2 Vendor: Eclipse Adoptium OS: Linux
groovy@app-271df1d7-15848624-mzhhj:/app$ ps -ef
UID PID PPID C STIME TTY TIME CMD
groovy 1 0 0 21:04 ? 00:00:00 /bin/bash bin/run-script.sh
groovy 12 1 42 21:04 ? 00:00:17 /opt/java/openjdk/bin/java -Xms3g -Xmx3g --add-modules=ALL-SYSTEM -classpath /opt/groovy/lib/groovy-4.0.0.jar -Dscript.name=/usr/bin/groovy -Dprogram.name=groovy -Dgroovy.starter.conf=/opt/groovy/conf/groovy-starter.conf -Dgroovy.home=/opt/groovy -Dtools.jar=/opt/java/openjdk/lib/tools.jar org.codehaus.groovy.tools.GroovyStarter --main groovy.ui.GroovyMain --conf /opt/groovy/conf/groovy-starter.conf --classpath . /tmp/script.groovy
groovy 116 0 0 21:05 pts/0 00:00:00 bash
groovy 160 116 0 21:05 pts/0 00:00:00 ps -ef
Script (relevant parts)
@Grab('org.codehaus.gpars:gpars:1.2.1')
import static groovyx.gpars.GParsPool.withPool
import groovy.json.JsonSlurper
final def jsl = new JsonSlurper()
//...
while (!(nextBatch = getBatch(batchSize)).isEmpty()) {
def devThread = Thread.start {
withPool(poolSize) {
nextBatch.eachParallel { kw ->
String url = dev + "&" + "query=$kw"
try {
def response = jsl.parseText(url.toURL().getText(connectTimeout: 10.seconds, readTimeout: 10.seconds,
useCaches: true, allowUserInteraction: false))
devResponses[kw] = response
} catch (e) {
println("\tFailed to fetch: $url | error: $e")
}
}
}
}
def stgThread = Thread.start {
withPool(poolSize) {
nextBatch.eachParallel { kw ->
String url = stg + "&" + "query=$kw"
try {
def response = jsl.parseText(url.toURL().getText(connectTimeout: 10.seconds, readTimeout: 10.seconds,
useCaches: true, allowUserInteraction: false))
stgResponses[kw] = response
} catch (e) {
println("\tFailed to fetch: $url | error: $e")
}
}
}
}
devThread.join()
stgThread.join()
}
Dockerfile
FROM groovy:4.0.0-jdk17 as builder
USER root
RUN apt-get update && apt-get install -yq bash curl wget jq
WORKDIR /app
COPY bin /app/bin
RUN chmod +x /app/bin/*
USER groovy
ENTRYPOINT ["/bin/bash"]
CMD ["bin/run-script.sh"]
The bin/run-script.sh
simply downloads the above groovy script at runtime and executes it.
wget "$GROOVY_SCRIPT" -O "$LOCAL_FILE"
...
groovy "$LOCAL_FILE"
As soon as the execution hits the first call to withPool(poolSize)
, there's a giant thread dump, but execution continues.
I'm trying to figure out what could be causing this behavior. Any ideas ????????♂️?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
为了后代,在这里回答我自己的问题。
问题原来是 这个 log4j2 JVM 热补丁 我们目前正在利用它来修复最近的 log4j2 漏洞。该代理(作为 DaemonSet 运行)修补我们所有 k8s 集群中所有正在运行的 JVM。
这不知何故导致我基于 OpenJDK 17 的应用程序进行线程转储。我在 ElasticSearch
8.1.0
部署中也发现了同样的问题(也使用预打包的 OpenJDK17
)。这是一项服务,所以我几乎每半小时就会看到一次线程转储!有趣的是,还有其他 JVM 服务(以及一些 SOLR8
部署)没有存在此问题For posterity, answering my own question here.
The issue turned out to be this log4j2 JVM hot-patch that we're currently leveraging to fix the recent log4j2 vulnerability. This agent (running as a
DaemonSet
) patches all running JVMs in all our k8s clusters.This, somehow, causes my OpenJDK
17
based app to thread dump. I found the same issue with an ElasticSearch8.1.0
deployment as well (also uses a pre-packaged OpenJDK17
). This one is a service, so I could see a thread dump happening pretty much every half hour! Interestingly, there are other JVM services (and some SOLR8
deployments) that don't have this issue ????????♂️.Anyway, I worked with our devops team to temporarily exclude the node that deployment was running on, and lo and behold, the thread dumps disappeared!
Balance in the universe has been restored ????????♂️.