KERAS Model.Fit抛出分割故障与误差 - libprotobuf致命检查失败:(value.size())< =(kint32max)

发布于 2025-01-24 10:17:58 字数 2802 浏览 0 评论 0原文

我正在尝试在EMR群集上使用约9000个参数训练一个简单的Tensorflow模型。但是,当我尝试训练模型时,它会在错误之后引发。我尝试增加内存并降低批处理大小。但这无济于事。

libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/wire_format_lite.cc:504] CHECK failed: (value.size()) <= (kint32max):
Segmentation fault

基于 this this ,我将数据集减少到一半,但会导致另一个错误:

Traceback (most recent call last):
  File "/home/hadoop/feed_reco/datalake2/Dropoutnet/dropoutnet_training_test.py", line 175, in <module>
    train_tensorflow(data_dir, "/home/hadoop/temp_model/", learning_rate, dropout_rate, epochs)
  File "/home/hadoop/feed_reco/datalake2/Dropoutnet/dropoutnet_training_test.py", line 165, in train_tensorflow
    model.fit(dataset, epochs=epochs)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit
    tmp_logs = train_function(iterator)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:    indices[0] = -1 is not in [0, 6806177)
     [[{{node embedding_lookup_1}}]]
     [[StatefulPartitionedCall]]
     [[IteratorGetNext]] [Op:__inference_train_function_715]

Function call stack:
train_function -> train_function -> train_function

I am trying to train a simple tensorflow model on emr cluster with around 9000 parameters. But When I try to train the model it throws following error. I tried increasing the memory and decreasing the batch size. But it didn't help.

libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/wire_format_lite.cc:504] CHECK failed: (value.size()) <= (kint32max):
Segmentation fault

Based on one of the suggestion from this, I reduced the dataset to half but it causes another error:

Traceback (most recent call last):
  File "/home/hadoop/feed_reco/datalake2/Dropoutnet/dropoutnet_training_test.py", line 175, in <module>
    train_tensorflow(data_dir, "/home/hadoop/temp_model/", learning_rate, dropout_rate, epochs)
  File "/home/hadoop/feed_reco/datalake2/Dropoutnet/dropoutnet_training_test.py", line 165, in train_tensorflow
    model.fit(dataset, epochs=epochs)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 108, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1098, in fit
    tmp_logs = train_function(iterator)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2829, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/home/hadoop/conda/envs/py-env/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:    indices[0] = -1 is not in [0, 6806177)
     [[{{node embedding_lookup_1}}]]
     [[StatefulPartitionedCall]]
     [[IteratorGetNext]] [Op:__inference_train_function_715]

Function call stack:
train_function -> train_function -> train_function

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文