尽管手动测试显示出良好的结果，为什么我的验证损失/准确性会波动

发布于 2025-01-13 06:26:03 字数 3280 浏览 1 评论 0原文

我正在约 10.000.000 张图像 (128x128x1) 和约 6500 个类的数据集上训练 EfficientNet lite（从头开始）。我的训练损失和训练准确性都在收敛。然而，我的测试损失/准确性是波动的。当我在某些输入上手动测试 CNN 时，它看起来非常好并且能够正确识别（几乎）所有内容。因为我的 GPU 内存只有 8GB，所以我使用批量大小 256 和 fp16 计算进行训练。

现在我的问题是为什么火车损失/加速度波动这么大，有什么可以纠正的吗？

以下是一些（可能）重要的详细信息：

加载数据集：

tr_dataset = tf.keras.preprocessing.image_dataset_from_directory(
    DATA_PATH,
    labels="inferred",
    label_mode="categorical",
    interpolation="bilinear",
    color_mode="grayscale",
    batch_size=bs,
    image_size=img_size,
    shuffle=True,
    seed=123,
    validation_split=val_split,
    subset="training"
)

我的模型官方 TF 实现：

def instantiate_char_cnn(include_augmentation=False, name=NAME):
    eff_net_lite = EfficientNetLiteB0(
        include_top=True,
        weights=None,
        input_shape=(img_size[0], img_size[1], 1),
        classes=len(ls),
        pooling="avg",
        classifier_activation="softmax",
    )

    if(img_augmentation):
        model = tf.keras.Sequential([
            tf.keras.layers.InputLayer(input_shape=(None, None, 1)),
            PreprocessTFLayer(),
            img_augmentation,
            eff_net_lite,
        ],
        name=name)

用于预处理的自定义层：

@tf.function
def preprocess_tf(x):
    """
    Preprocessing for TF Lite.
    
    Args:
        x : a Tensor(batch_size, height, width, channels) of images to preprocess
        
    Return: 
        normalized and resized Tensor of images
    """
    
    batch, height, width, channels = x.shape
    
    # resize images
    x = tf.image.resize(x, img_size, method=tf.image.ResizeMethod.BILINEAR)
    
    # normalize image between [0, 1]
    x = tf.math.divide(x, tf.math.reduce_max(x))

    return x


class PreprocessTFLayer(tf.keras.layers.Layer):
    def __init__(self, name="preprocess_tf", **kwargs):
        super(PreprocessTFLayer, self).__init__(name=name, **kwargs)
        self.preprocess = preprocess_tf

    def call(self, input):
        return self.preprocess(input)

    def get_config(self):
        config = super(PreprocessTFLayer, self).get_config()
        return config
    
    def get_prunable_weights(self):
        return []

用于图像增强的 keras 层：

from tensorflow.keras.layers.experimental.preprocessing import Resizing, Rescaling, RandomZoom, RandomRotation, RandomTranslation

img_augmentation = tf.keras.Sequential(
    [
        RandomErasing.RandomErasing(probability=0.4),
        
        # random data augmentation
        RandomZoom(height_factor=(-0.2, 1.0), width_factor=(-0.2, 1.0), 
            fill_mode='constant', interpolation='bilinear', fill_value=0.0
        ),
        RandomTranslation(0.2, 0.2, fill_mode="constant"),
        RandomRotation(factor=(-0.1, 0.1) , fill_mode='constant', interpolation='bilinear'),
    ],
    name = "img_augmentation"
)

原文

I am training an EfficientNet lite (from scratch) on a dataset of ~10.000.000 images (128x128x1) with ~6500 classes. My training loss is converging as well as my training accuracy.
However, my test loss/accuracy are fluctuating. When I test the CNN manually on some input it is looking very good and recognizes (nearly) everything correctly.
Because my GPU memory is only 8GB I am training with batch size 256 and fp16 calculations.

Now my question is why does the train loss/acc is fluctuating so much and is there something to correct for that?

Here are some (maybe) important Details:

Loading the data Set:

tr_dataset = tf.keras.preprocessing.image_dataset_from_directory(
    DATA_PATH,
    labels="inferred",
    label_mode="categorical",
    interpolation="bilinear",
    color_mode="grayscale",
    batch_size=bs,
    image_size=img_size,
    shuffle=True,
    seed=123,
    validation_split=val_split,
    subset="training"
)

My model official TF implementation:

def instantiate_char_cnn(include_augmentation=False, name=NAME):
    eff_net_lite = EfficientNetLiteB0(
        include_top=True,
        weights=None,
        input_shape=(img_size[0], img_size[1], 1),
        classes=len(ls),
        pooling="avg",
        classifier_activation="softmax",
    )

    if(img_augmentation):
        model = tf.keras.Sequential([
            tf.keras.layers.InputLayer(input_shape=(None, None, 1)),
            PreprocessTFLayer(),
            img_augmentation,
            eff_net_lite,
        ],
        name=name)

The custom layer for preprocessing:

@tf.function
def preprocess_tf(x):
    """
    Preprocessing for TF Lite.
    
    Args:
        x : a Tensor(batch_size, height, width, channels) of images to preprocess
        
    Return: 
        normalized and resized Tensor of images
    """
    
    batch, height, width, channels = x.shape
    
    # resize images
    x = tf.image.resize(x, img_size, method=tf.image.ResizeMethod.BILINEAR)
    
    # normalize image between [0, 1]
    x = tf.math.divide(x, tf.math.reduce_max(x))

    return x


class PreprocessTFLayer(tf.keras.layers.Layer):
    def __init__(self, name="preprocess_tf", **kwargs):
        super(PreprocessTFLayer, self).__init__(name=name, **kwargs)
        self.preprocess = preprocess_tf

    def call(self, input):
        return self.preprocess(input)

    def get_config(self):
        config = super(PreprocessTFLayer, self).get_config()
        return config
    
    def get_prunable_weights(self):
        return []

The keras layers for image augmentation:

from tensorflow.keras.layers.experimental.preprocessing import Resizing, Rescaling, RandomZoom, RandomRotation, RandomTranslation

img_augmentation = tf.keras.Sequential(
    [
        RandomErasing.RandomErasing(probability=0.4),
        
        # random data augmentation
        RandomZoom(height_factor=(-0.2, 1.0), width_factor=(-0.2, 1.0), 
            fill_mode='constant', interpolation='bilinear', fill_value=0.0
        ),
        RandomTranslation(0.2, 0.2, fill_mode="constant"),
        RandomRotation(factor=(-0.1, 0.1) , fill_mode='constant', interpolation='bilinear'),
    ],
    name = "img_augmentation"
)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

瀟灑尐姊 2025-01-20 06:26:03

这种现象背后可能有很多原因，并且可能涉及人为错误。关键是如何排除故障。手动检查有时不会给您有用的提示。

在一些更简单的数据集（例如 ImageNet 或 CIFAR-100）上尝试您的实现，看看是否可以重现相同的现象。这可以帮助您确保评估代码中没有错误。
随机洗牌并分割数据集以训练、验证和测试集。再次训练你的模型，看看是否可以重现相同的现象。这可以帮助您确保训练集、验证集和测试集的分布接近，并且该现象不是由于测试集分布不匹配造成的。
关闭 FP16 并减小批量大小，看看是否可以重现相同的现象。这可以帮助您确保 FP16 不会导致任何数值问题。
使用更可靠的实现，例如 PyTorch 或 TensorFlow 官方神经网络实现（可能是不同的网络，例如 ResNet），让您的任务查看是否可以重现相同的现象。这可以帮助您确保您的模型 (EfficientNet) 实现不会出现太大问题。

回复收藏 0 原文

~没有更多了~