Zombie Process在使用use_multiprocessing = true = true in keras model.fit()中
当使用KERAS'model.fit()
方法训练神经网络时,我正在遇到僵尸进程。由于< deTuctunct>
过程,培训不会结束,所有受影响的过程都必须用Sigkill杀死。重新启动培训脚本不会再现同样的问题,有时完成执行。当禁用多处理时,不会发生问题:model.fit(use_multiprocessing = false)
这是ps aufx aufx
命令的输出。
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
...
root 774690 0.1 0.0 79032 70048 ? Ss Mai23 17:16 /usr/bin/python3 /usr/bin/tm legacy-worker run mlworker
root 1607844 0.0 0.0 2420 524 ? SNs Jun02 0:00 \_ /bin/sh -c /usr/bin/classifier-train
root 1607845 38.5 4.7 44686436 12505168 ? SNl Jun02 551:05 \_ /opt/venvs/classifier-training-repo/bin/python /usr/bin/classifier-train
root 1639337 0.0 3.7 43834076 10005208 ? SN Jun02 0:00 \_ /opt/venvs/classifier-training-repo/bin/python /usr/bin/classifier-train
root 1639339 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639341 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639343 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639345 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639347 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639349 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
以下是相关的代码段:
def get_keras_model():
# some code here
model = keras.models.Model(
inputs=(input_layer_1, input_layer_2),
outputs=prediction_layer,
)
model.compile(loss=..., optimizer=..., metrics=...)
return model
def preprocess(data):
# Some code here to convert strings values into numpy arrays of dtype=np.uint32
return X, y
class DataSequence(keras.utils.Sequence):
def __init__(self, data, preprocess_func, keys, batch_size=4096):
self.keys = keys
self.data = data
self.batch_size = batch_size
self.preprocess_func = preprocess_func
def __len__(self):
# returns the number of batches
return int(np.ceil(len(self.keys) / float(self.batch_size)))
def __getitem__(self, idx):
keys = self.keys[idx * self.batch_size : (idx + 1) * self.batch_size]
return self.preprocess_func([self.data[key] for key in keys]
def train(model, data, preprocess):
train_sequence = DataSequence(data, preprocess, list(data.keys()))
history = model.fit(
x=train_sequence,
epochs=15,
steps_per_epoch=len(train_sequence),
verbose=2,
workers=8,
use_multiprocessing=True,
)
return model, history
data = {
"key_1": {"name": "black", "y": 0},
"key_2": {"name": "white", "y": 1},
# upto 70M docs in this dictionary
}
model = get_keras_model()
model, history = train(model, data, preprocess) # model training hangs
日志输出:
多个捕获的信号15。终止。
日志消息,当培训脚本完成执行并且不会遇到任何僵尸进程时。线程线程中的异常 - ##
输出中都可以看到相同的行为;当模型训练不受僵尸过程并正常完成执行时,也会发生这种情况。
Jun 09 14:16:22 mlworker tm[575915]: 2022-06-09 14:16:22,024 - MainThread - INFO - Start working on fold 1/5
Jun 09 14:16:22 mlworker tm[575915]: 2022-06-09 14:16:22.725522: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instruc>
Jun 09 14:16:22 mlworker tm[575915]: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Jun 09 14:16:23 mlworker tm[575915]: 2022-06-09 14:16:23.439638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6882 MB memory: -> device: 0, name: Tesla P4, p>
Jun 09 14:16:23 mlworker tm[575915]: 2022-06-09 14:16:23,709 - MainThread - INFO - Fitting model ...
Jun 09 14:16:24 mlworker tm[575915]: Epoch 1/15
Jun 09 14:16:31 mlworker tm[575915]: 3/3 - 7s - loss: 6.9878 - acc: 1.0908e-04 - 7s/epoch - 2s/step
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Epoch 2/15
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: 3/3 - 3s - loss: 6.9392 - acc: 0.0055 - 3s/epoch - 1s/step
...
Jun 09 14:16:48 mlworker tm[575915]: Epoch 7/15
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Exception in thread Thread-87:
Jun 09 14:16:51 mlworker tm[575915]: Traceback (most recent call last):
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
Jun 09 14:16:51 mlworker tm[575915]: self.run()
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/threading.py", line 892, in run
Jun 09 14:16:51 mlworker tm[575915]: self._target(*self._args, **self._kwargs)
Jun 09 14:16:51 mlworker tm[575915]: File "/opt/venvs/classifier-training-repo/lib/python3.9/site-packages/keras/utils/data_utils.py", line 759, in _run
Jun 09 14:16:51 mlworker tm[575915]: with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
Jun 09 14:16:51 mlworker tm[575915]: File "/opt/venvs/classifier-training-repo/lib/python3.9/site-packages/keras/utils/data_utils.py", line 736, in pool_fn
Jun 09 14:16:51 mlworker tm[575915]: pool = get_pool_class(True)(
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/context.py", line 119, in Pool
Jun 09 14:16:51 mlworker tm[575915]: return Pool(processes, initializer, initargs, maxtasksperchild,
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/pool.py", line 212, in __init__
Jun 09 14:16:51 mlworker tm[575915]: self._repopulate_pool()
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool
Jun 09 14:16:51 mlworker tm[575915]: return self._repopulate_pool_static(self._ctx, self.Process,
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
Jun 09 14:16:51 mlworker tm[575915]: w.start()
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/process.py", line 121, in start
Jun 09 14:16:51 mlworker tm[575915]: self._popen = self._Popen(self)
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/context.py", line 277, in _Popen
Jun 09 14:16:51 mlworker tm[575915]: return Popen(process_obj)
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
Jun 09 14:16:51 mlworker tm[575915]: self._launch(process_obj)
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 73, in _launch
Jun 09 14:16:51 mlworker tm[575915]: os._exit(code)
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3/dist-packages/solute/click.py", line 727, in raiser
Jun 09 14:16:51 mlworker tm[575915]: raise Termination(128 + signo)
Jun 09 14:16:51 mlworker tm[575915]: solute.click.Termination: 143
Jun 09 14:16:52 mlworker tm[575915]: 3/3 - 3s - loss: 5.7624 - acc: 0.0726 - 3s/epoch - 1s/step
Jun 09 14:16:51 mlworker tm[575915]: solute.click.Termination: 143
Jun 09 14:16:52 mlworker tm[575915]: 3/3 - 3s - loss: 5.7624 - acc: 0.0726 - 3s/epoch - 1s/step
Jun 09 14:16:52 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:52 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:52 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:52 mlworker tm[575915]: Epoch 8/15
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: 3/3 - 3s - loss: 5.6978 - acc: 0.1000 - 3s/epoch - 1s/step
...
Jun 09 14:17:02 mlworker tm[575915]: Epoch 11/15
Jun 09 14:17:05 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:05 mlworker tm[575915]: 3/3 - 3s - loss: 5.5029 - acc: 0.0804 - 3s/epoch - 1s/step
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Epoch 12/15
Jun 09 14:17:09 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:09 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:09 mlworker tm[575915]: Caught signal 15. Terminating.
最后一条消息后未看到进一步的日志输出。这些过程必须用sudo kill -sigkill
杀死,并且必须重新启动模型培训。
系统信息:
我在不同的GPU和不同Python版本的不同机器上遇到了相同的问题。
- OS平台和分销:Ubuntu的Debian GNU/Linux 11(Bullseye) 20.04.4 LTS
- TensorFlow版本:v2.9.0-18-gd8ce9f9c301 2.9.1(Debian 11),v2.9.0-18-GD8CE9F9C301 2.9.1.9.1(Ubuntu lts)
- Python版本:Python 3.9.2(Debian 11),Python 3.8.10(Ubuntu LTS)
- GPU模型和内存:Tesla T4(16 GB)Debian 11上的Tesla P4(8 GB)(8 GB)在另一台Debian 11 Machine上,Geforce GTX 1080 Ti(12 GB)(12 GB )在Ubuntu Lts上
I am encountering Zombie processes when training a Neural Network using Keras' model.fit()
method. Due to the <defunct>
processes, the training does not end and all the effected processes have to be killed with SIGKILL. Restarting the training script does not reproduce the same problem, and sometimes completes execution. The problem does not occur when multiprocessing is disabled: model.fit(use_multiprocessing=False)
Here is an output of the ps aufx
command.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
...
root 774690 0.1 0.0 79032 70048 ? Ss Mai23 17:16 /usr/bin/python3 /usr/bin/tm legacy-worker run mlworker
root 1607844 0.0 0.0 2420 524 ? SNs Jun02 0:00 \_ /bin/sh -c /usr/bin/classifier-train
root 1607845 38.5 4.7 44686436 12505168 ? SNl Jun02 551:05 \_ /opt/venvs/classifier-training-repo/bin/python /usr/bin/classifier-train
root 1639337 0.0 3.7 43834076 10005208 ? SN Jun02 0:00 \_ /opt/venvs/classifier-training-repo/bin/python /usr/bin/classifier-train
root 1639339 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639341 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639343 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639345 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639347 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639349 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
Here are the relevant code snippets:
def get_keras_model():
# some code here
model = keras.models.Model(
inputs=(input_layer_1, input_layer_2),
outputs=prediction_layer,
)
model.compile(loss=..., optimizer=..., metrics=...)
return model
def preprocess(data):
# Some code here to convert strings values into numpy arrays of dtype=np.uint32
return X, y
class DataSequence(keras.utils.Sequence):
def __init__(self, data, preprocess_func, keys, batch_size=4096):
self.keys = keys
self.data = data
self.batch_size = batch_size
self.preprocess_func = preprocess_func
def __len__(self):
# returns the number of batches
return int(np.ceil(len(self.keys) / float(self.batch_size)))
def __getitem__(self, idx):
keys = self.keys[idx * self.batch_size : (idx + 1) * self.batch_size]
return self.preprocess_func([self.data[key] for key in keys]
def train(model, data, preprocess):
train_sequence = DataSequence(data, preprocess, list(data.keys()))
history = model.fit(
x=train_sequence,
epochs=15,
steps_per_epoch=len(train_sequence),
verbose=2,
workers=8,
use_multiprocessing=True,
)
return model, history
data = {
"key_1": {"name": "black", "y": 0},
"key_2": {"name": "white", "y": 1},
# upto 70M docs in this dictionary
}
model = get_keras_model()
model, history = train(model, data, preprocess) # model training hangs
Log Output:
Multiple Caught signal 15. Terminating.
log messages are displayed, also when the training script finishes execution and does not encounter any Zombie processes. Same behavior is seen with Exception in thread Thread-##
outputs; it occurs also when the model training is not effected by zombie processes and finishes execution normally.
Jun 09 14:16:22 mlworker tm[575915]: 2022-06-09 14:16:22,024 - MainThread - INFO - Start working on fold 1/5
Jun 09 14:16:22 mlworker tm[575915]: 2022-06-09 14:16:22.725522: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instruc>
Jun 09 14:16:22 mlworker tm[575915]: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Jun 09 14:16:23 mlworker tm[575915]: 2022-06-09 14:16:23.439638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6882 MB memory: -> device: 0, name: Tesla P4, p>
Jun 09 14:16:23 mlworker tm[575915]: 2022-06-09 14:16:23,709 - MainThread - INFO - Fitting model ...
Jun 09 14:16:24 mlworker tm[575915]: Epoch 1/15
Jun 09 14:16:31 mlworker tm[575915]: 3/3 - 7s - loss: 6.9878 - acc: 1.0908e-04 - 7s/epoch - 2s/step
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Epoch 2/15
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: 3/3 - 3s - loss: 6.9392 - acc: 0.0055 - 3s/epoch - 1s/step
...
Jun 09 14:16:48 mlworker tm[575915]: Epoch 7/15
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Exception in thread Thread-87:
Jun 09 14:16:51 mlworker tm[575915]: Traceback (most recent call last):
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
Jun 09 14:16:51 mlworker tm[575915]: self.run()
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/threading.py", line 892, in run
Jun 09 14:16:51 mlworker tm[575915]: self._target(*self._args, **self._kwargs)
Jun 09 14:16:51 mlworker tm[575915]: File "/opt/venvs/classifier-training-repo/lib/python3.9/site-packages/keras/utils/data_utils.py", line 759, in _run
Jun 09 14:16:51 mlworker tm[575915]: with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
Jun 09 14:16:51 mlworker tm[575915]: File "/opt/venvs/classifier-training-repo/lib/python3.9/site-packages/keras/utils/data_utils.py", line 736, in pool_fn
Jun 09 14:16:51 mlworker tm[575915]: pool = get_pool_class(True)(
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/context.py", line 119, in Pool
Jun 09 14:16:51 mlworker tm[575915]: return Pool(processes, initializer, initargs, maxtasksperchild,
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/pool.py", line 212, in __init__
Jun 09 14:16:51 mlworker tm[575915]: self._repopulate_pool()
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool
Jun 09 14:16:51 mlworker tm[575915]: return self._repopulate_pool_static(self._ctx, self.Process,
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
Jun 09 14:16:51 mlworker tm[575915]: w.start()
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/process.py", line 121, in start
Jun 09 14:16:51 mlworker tm[575915]: self._popen = self._Popen(self)
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/context.py", line 277, in _Popen
Jun 09 14:16:51 mlworker tm[575915]: return Popen(process_obj)
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
Jun 09 14:16:51 mlworker tm[575915]: self._launch(process_obj)
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 73, in _launch
Jun 09 14:16:51 mlworker tm[575915]: os._exit(code)
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3/dist-packages/solute/click.py", line 727, in raiser
Jun 09 14:16:51 mlworker tm[575915]: raise Termination(128 + signo)
Jun 09 14:16:51 mlworker tm[575915]: solute.click.Termination: 143
Jun 09 14:16:52 mlworker tm[575915]: 3/3 - 3s - loss: 5.7624 - acc: 0.0726 - 3s/epoch - 1s/step
Jun 09 14:16:51 mlworker tm[575915]: solute.click.Termination: 143
Jun 09 14:16:52 mlworker tm[575915]: 3/3 - 3s - loss: 5.7624 - acc: 0.0726 - 3s/epoch - 1s/step
Jun 09 14:16:52 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:52 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:52 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:52 mlworker tm[575915]: Epoch 8/15
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: 3/3 - 3s - loss: 5.6978 - acc: 0.1000 - 3s/epoch - 1s/step
...
Jun 09 14:17:02 mlworker tm[575915]: Epoch 11/15
Jun 09 14:17:05 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:05 mlworker tm[575915]: 3/3 - 3s - loss: 5.5029 - acc: 0.0804 - 3s/epoch - 1s/step
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Epoch 12/15
Jun 09 14:17:09 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:09 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:09 mlworker tm[575915]: Caught signal 15. Terminating.
No further log output was seen after the last message. The processes have to be killed with sudo kill -SIGKILL
and model training has to be restarted again.
System information:
I have encountered the same problem on different machines with different GPUs and different Python versions.
- OS Platform and Distribution: Debian GNU/Linux 11 (bullseye), Ubuntu
20.04.4 LTS - TensorFlow version: v2.9.0-18-gd8ce9f9c301 2.9.1 (Debian 11), v2.9.0-18-gd8ce9f9c301 2.9.1 (Ubuntu LTS)
- Python version: Python
3.9.2 (Debian 11), Python 3.8.10 (Ubuntu LTS) - GPU model and memory: Tesla T4 (16 GB) on Debian 11, Tesla P4 (8 GB) on another Debian 11 machine, GeForce GTX 1080 Ti (12 GB) on Ubuntu LTS
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我们在脚本开始时通过以下行解决了问题:
说明:
我们的脚本中有一个自定义的Sigterm处理程序,该处理程序正在干扰发送到线程的Sigterms。该1线还原了Python的Sigterm默认处理程序,并避免遇到无反应的子过程。
TensorFlow或Keras代码中没有错误:)
We solved the problem with the following line at the start of the script:
Explanation:
We had a custom SIGTERM Handler in our script, which was interfering with the SIGTERMs sent to the Threads. This 1-line restores Python's default handler for SIGTERM and avoids running into unresponsive subprocesses.
There was no Bug in Tensorflow or Keras code :)