尽管手动测试显示出良好的结果,为什么我的验证损失/准确性会波动

发布于 2025-01-13 06:26:03 字数 3280 浏览 1 评论 0原文

我正在约 10.000.000 张图像 (128x128x1) 和约 6500 个类的数据集上训练 EfficientNet lite(从头开始)。我的训练损失和训练准确性都在收敛。 然而,我的测试损失/准确性是波动的。当我在某些输入上手动测试 CNN 时,它看起来非常好并且能够正确识别(几乎)所有内容。 因为我的 GPU 内存只有 8GB,所以我使用批量大小 256 和 fp16 计算进行训练。

现在我的问题是为什么火车损失/加速度波动这么大,有什么可以纠正的吗?

loss

accuracy

以下是一些(可能)重要的详细信息:

加载数据集:

tr_dataset = tf.keras.preprocessing.image_dataset_from_directory(
    DATA_PATH,
    labels="inferred",
    label_mode="categorical",
    interpolation="bilinear",
    color_mode="grayscale",
    batch_size=bs,
    image_size=img_size,
    shuffle=True,
    seed=123,
    validation_split=val_split,
    subset="training"
)

我的模型 官方 TF 实现

def instantiate_char_cnn(include_augmentation=False, name=NAME):
    eff_net_lite = EfficientNetLiteB0(
        include_top=True,
        weights=None,
        input_shape=(img_size[0], img_size[1], 1),
        classes=len(ls),
        pooling="avg",
        classifier_activation="softmax",
    )

    if(img_augmentation):
        model = tf.keras.Sequential([
            tf.keras.layers.InputLayer(input_shape=(None, None, 1)),
            PreprocessTFLayer(),
            img_augmentation,
            eff_net_lite,
        ],
        name=name)

用于预处理的自定义层:

@tf.function
def preprocess_tf(x):
    """
    Preprocessing for TF Lite.
    
    Args:
        x : a Tensor(batch_size, height, width, channels) of images to preprocess
        
    Return: 
        normalized and resized Tensor of images
    """
    
    batch, height, width, channels = x.shape
    
    # resize images
    x = tf.image.resize(x, img_size, method=tf.image.ResizeMethod.BILINEAR)
    
    # normalize image between [0, 1]
    x = tf.math.divide(x, tf.math.reduce_max(x))

    return x


class PreprocessTFLayer(tf.keras.layers.Layer):
    def __init__(self, name="preprocess_tf", **kwargs):
        super(PreprocessTFLayer, self).__init__(name=name, **kwargs)
        self.preprocess = preprocess_tf

    def call(self, input):
        return self.preprocess(input)

    def get_config(self):
        config = super(PreprocessTFLayer, self).get_config()
        return config
    
    def get_prunable_weights(self):
        return [] 

用于图像增强的 keras 层:

from tensorflow.keras.layers.experimental.preprocessing import Resizing, Rescaling, RandomZoom, RandomRotation, RandomTranslation

img_augmentation = tf.keras.Sequential(
    [
        RandomErasing.RandomErasing(probability=0.4),
        
        # random data augmentation
        RandomZoom(height_factor=(-0.2, 1.0), width_factor=(-0.2, 1.0), 
            fill_mode='constant', interpolation='bilinear', fill_value=0.0
        ),
        RandomTranslation(0.2, 0.2, fill_mode="constant"),
        RandomRotation(factor=(-0.1, 0.1) , fill_mode='constant', interpolation='bilinear'),
    ],
    name = "img_augmentation"
)

I am training an EfficientNet lite (from scratch) on a dataset of ~10.000.000 images (128x128x1) with ~6500 classes. My training loss is converging as well as my training accuracy.
However, my test loss/accuracy are fluctuating. When I test the CNN manually on some input it is looking very good and recognizes (nearly) everything correctly.
Because my GPU memory is only 8GB I am training with batch size 256 and fp16 calculations.

Now my question is why does the train loss/acc is fluctuating so much and is there something to correct for that?

loss

accuracy

Here are some (maybe) important Details:

Loading the data Set:

tr_dataset = tf.keras.preprocessing.image_dataset_from_directory(
    DATA_PATH,
    labels="inferred",
    label_mode="categorical",
    interpolation="bilinear",
    color_mode="grayscale",
    batch_size=bs,
    image_size=img_size,
    shuffle=True,
    seed=123,
    validation_split=val_split,
    subset="training"
)

My model official TF implementation:

def instantiate_char_cnn(include_augmentation=False, name=NAME):
    eff_net_lite = EfficientNetLiteB0(
        include_top=True,
        weights=None,
        input_shape=(img_size[0], img_size[1], 1),
        classes=len(ls),
        pooling="avg",
        classifier_activation="softmax",
    )

    if(img_augmentation):
        model = tf.keras.Sequential([
            tf.keras.layers.InputLayer(input_shape=(None, None, 1)),
            PreprocessTFLayer(),
            img_augmentation,
            eff_net_lite,
        ],
        name=name)

The custom layer for preprocessing:

@tf.function
def preprocess_tf(x):
    """
    Preprocessing for TF Lite.
    
    Args:
        x : a Tensor(batch_size, height, width, channels) of images to preprocess
        
    Return: 
        normalized and resized Tensor of images
    """
    
    batch, height, width, channels = x.shape
    
    # resize images
    x = tf.image.resize(x, img_size, method=tf.image.ResizeMethod.BILINEAR)
    
    # normalize image between [0, 1]
    x = tf.math.divide(x, tf.math.reduce_max(x))

    return x


class PreprocessTFLayer(tf.keras.layers.Layer):
    def __init__(self, name="preprocess_tf", **kwargs):
        super(PreprocessTFLayer, self).__init__(name=name, **kwargs)
        self.preprocess = preprocess_tf

    def call(self, input):
        return self.preprocess(input)

    def get_config(self):
        config = super(PreprocessTFLayer, self).get_config()
        return config
    
    def get_prunable_weights(self):
        return [] 

The keras layers for image augmentation:

from tensorflow.keras.layers.experimental.preprocessing import Resizing, Rescaling, RandomZoom, RandomRotation, RandomTranslation

img_augmentation = tf.keras.Sequential(
    [
        RandomErasing.RandomErasing(probability=0.4),
        
        # random data augmentation
        RandomZoom(height_factor=(-0.2, 1.0), width_factor=(-0.2, 1.0), 
            fill_mode='constant', interpolation='bilinear', fill_value=0.0
        ),
        RandomTranslation(0.2, 0.2, fill_mode="constant"),
        RandomRotation(factor=(-0.1, 0.1) , fill_mode='constant', interpolation='bilinear'),
    ],
    name = "img_augmentation"
)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

瀟灑尐姊 2025-01-20 06:26:03

这种现象背后可能有很多原因,并且可能涉及人为错误。关键是如何排除故障。手动检查有时不会给您有用的提示。

  1. 在一些更简单的数据集(例如 ImageNet 或 CIFAR-100)上尝试您的实现,看看是否可以重现相同的现象。这可以帮助您确保评估代码中没有错误。
  2. 随机洗牌并分割数据集以训练、验证和测试集。再次训练你的模型,看看是否可以重现相同的现象。这可以帮助您确保训练集、验证集和测试集的分布接近,并且该现象不是由于测试集分布不匹配造成的。
  3. 关闭 FP16 并减小批量大小,看看是否可以重现相同的现象。这可以帮助您确保 FP16 不会导致任何数值问题。
  4. 使用更可靠的实现,例如 PyTorch 或 TensorFlow 官方神经网络实现(可能是不同的网络,例如 ResNet),让您的任务查看是否可以重现相同的现象。这可以帮助您确保您的模型 (EfficientNet) 实现不会出现太大问题。

There could be many reasons behind this phenomenon and there could be human error involved. The key is how to troubleshoot. Manual inspection sometimes would not give you useful hints.

  1. Try your implementation on some simpler datasets, such as ImageNet or CIFAR-100 to see whether the same phenomenon could reproduce. This helps you make sure you don't have some bug in your evaluation code.
  2. Random shuffle and split your dataset to train, validation, and test sets. Train your model again to see whether the same phenomenon could reproduce. This helps you make sure that the distribution of the train, validation, and test sets are close and the phenomenon is not due to test set distribution mismatch.
  3. Turn off FP16 and reduce batch size to see whether the same phenomenon could reproduce. This helps you make sure FP16 is not causing any numerical issues.
  4. Use a more reliable implementation, such as the PyTorch or TensorFlow official neural network implementations (it could be a different network such as ResNet), for your task to see whether the same phenomenon could reproduce. This helps you make sure your model (EfficientNet) implementation is not too problematic.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文