使用TF.Data.Dataset和Numpy阵列进行模型培训产生不同的结果

发布于 2025-02-11 12:15:33 字数 1799 浏览 1 评论 0原文

我使用KERAS模型训练API并观察到使用Numpy阵列训练模型（X_TRAIN和y_train）和tf.data.data.dataset.forom_tensor_slices（（（（（（（（（（（（（（（（（（（（（（（（（（（（（（，），我， x_train，y_train））。一个最小的工作示例如下：

import numpy as np
import tensorflow as tf

tf.keras.utils.set_random_seed(0)

n_examples, n_dims = (100, 10)
raw_dataset = np.random.randn(n_examples, n_dims)

model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Dense(
            1024, activation="relu", use_bias=True
        ),
        tf.keras.layers.Dense(
            1, activation="linear", use_bias=True
        ),
    ]
)

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss="mse",
)

x_train = raw_dataset[:, :-1]
y_train = raw_dataset[:, -1]
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))

n_epochs = 10
batch_size = 16

use_dataset = True
if use_dataset:
    model.fit(
        dataset.batch(batch_size=batch_size),
        epochs=n_epochs,
    )
else:
    model.fit(
        x=x_train,
        y=y_train,
        batch_size=batch_size,
        epochs=n_epochs,
    )

print("Evaluation:")
model.evaluate(x_train, y_train)
model.evaluate(dataset.batch(batch_size=batch_size))

如果我使用use_dataset = true运行此代码，则最终性能是：

Evaluation:
4/4 [==============================] - 0s 825us/step - loss: 0.4132
7/7 [==============================] - 0s 701us/step - loss: 0.4132

如果我使用use_dataset = false运行它，我得到：

Evaluation:
4/4 [==============================] - 0s 855us/step - loss: 0.4219
7/7 [==============================] - 0s 808us/step - loss: 0.4219

i。预计两个训练循环的性能将相同。有趣的是，如果我设置batch_size = n_examples，则模型性能是相同的。区别似乎与内部处理批处理的方式有关。为什么会发生这种情况？是错误还是功能？

原文

I use the Keras model training API and observed differences when training the model with NumPy arrays (x_train and y_train) and with tf.data.Dataset.from_tensor_slices((x_train, y_train)). A minimal working example is shown below:

import numpy as np
import tensorflow as tf

tf.keras.utils.set_random_seed(0)

n_examples, n_dims = (100, 10)
raw_dataset = np.random.randn(n_examples, n_dims)

model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Dense(
            1024, activation="relu", use_bias=True
        ),
        tf.keras.layers.Dense(
            1, activation="linear", use_bias=True
        ),
    ]
)

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss="mse",
)

x_train = raw_dataset[:, :-1]
y_train = raw_dataset[:, -1]
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))

n_epochs = 10
batch_size = 16

use_dataset = True
if use_dataset:
    model.fit(
        dataset.batch(batch_size=batch_size),
        epochs=n_epochs,
    )
else:
    model.fit(
        x=x_train,
        y=y_train,
        batch_size=batch_size,
        epochs=n_epochs,
    )

print("Evaluation:")
model.evaluate(x_train, y_train)
model.evaluate(dataset.batch(batch_size=batch_size))

If I run this code with use_dataset = True, the final performance is:

Evaluation:
4/4 [==============================] - 0s 825us/step - loss: 0.4132
7/7 [==============================] - 0s 701us/step - loss: 0.4132

If I run it with use_dataset = False, I get:

Evaluation:
4/4 [==============================] - 0s 855us/step - loss: 0.4219
7/7 [==============================] - 0s 808us/step - loss: 0.4219

I expected that the two training loops would perform identically. Interestingly, the model performance is identical if I set batch_size = n_examples. The difference seems to be related with the way that batches are handled internally. Why is this happening? Is it a bug or a feature?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

胡渣熟男 2025-02-18 12:15:33

该行为是由于默认参数shuffle = true in model.fit（*）而不是错误。根据 docs >：：

布尔值（是在每个时期之前将训练数据洗净）还是str（用于“批次”）。当x是tf.data.dataset的生成器或对象时，该参数将被忽略。 “批次”是处理HDF5数据局限性的特殊选择；它在批处理大小的块中散发。当steps_per_epoch不是一个时，没有效果。

因此，当传递tf.data.dataset时，此参数将被忽略，并且在每个时期之后，数据不会像其他数组中的其他方法一样重新封装。
这是获得两种方法相同结果的代码：

import numpy as np
import tensorflow as tf

tf.keras.utils.set_random_seed(0)

n_examples, n_dims = (100, 10)
raw_dataset = np.random.randn(n_examples, n_dims)

model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Dense(
            1024, activation="relu", use_bias=True
        ),
        tf.keras.layers.Dense(
            1, activation="linear", use_bias=True
        ),
    ]
)

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss="mse",
)

x_train = raw_dataset[:, :-1]
y_train = raw_dataset[:, -1]
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))

n_epochs = 10
batch_size = 16

use_dataset = False
if use_dataset:
    model.fit(
        dataset.batch(batch_size=batch_size),
        epochs=n_epochs,
    )
else:
    model.fit(
        x=x_train,
        y=y_train,
        batch_size=batch_size,
        shuffle=False,
        epochs=n_epochs,
    )

print("Evaluation:")
model.evaluate(x_train, y_train)
model.evaluate(dataset.batch(batch_size=batch_size))

The behavior is due to the default parameter shuffle=True in model.fit(*) and not a bug. According to the docs regarding shuffle:

Boolean (whether to shuffle the training data before each epoch) or str (for 'batch'). This argument is ignored when x is a generator or an object of tf.data.Dataset. 'batch' is a special option for dealing with the limitations of HDF5 data; it shuffles in batch-sized chunks. Has no effect when steps_per_epoch is not None.

So this parameter is ignored when a tf.data.Dataset is passed, and the data is not reshuffled after each epoch as in the other approach with arrays.
Here is the code to get the same results for both methods:

import numpy as np
import tensorflow as tf

tf.keras.utils.set_random_seed(0)

n_examples, n_dims = (100, 10)
raw_dataset = np.random.randn(n_examples, n_dims)

model = tf.keras.models.Sequential(
    [
        tf.keras.layers.Dense(
            1024, activation="relu", use_bias=True
        ),
        tf.keras.layers.Dense(
            1, activation="linear", use_bias=True
        ),
    ]
)

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss="mse",
)

x_train = raw_dataset[:, :-1]
y_train = raw_dataset[:, -1]
dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))

n_epochs = 10
batch_size = 16

use_dataset = False
if use_dataset:
    model.fit(
        dataset.batch(batch_size=batch_size),
        epochs=n_epochs,
    )
else:
    model.fit(
        x=x_train,
        y=y_train,
        batch_size=batch_size,
        shuffle=False,
        epochs=n_epochs,
    )

print("Evaluation:")
model.evaluate(x_train, y_train)
model.evaluate(dataset.batch(batch_size=batch_size))

回复收藏 0 原文

~没有更多了~