尽管手动测试显示出良好的结果,为什么我的验证损失/准确性会波动
我正在约 10.000.000 张图像 (128x128x1) 和约 6500 个类的数据集上训练 EfficientNet lite(从头开始)。我的训练损失和训练准确性都在收敛。 然而,我的测试损失/准确性是波动的。当我在某些输入上手动测试 CNN 时,它看起来非常好并且能够正确识别(几乎)所有内容。 因为我的 GPU 内存只有 8GB,所以我使用批量大小 256 和 fp16 计算进行训练。
现在我的问题是为什么火车损失/加速度波动这么大,有什么可以纠正的吗?
以下是一些(可能)重要的详细信息:
加载数据集:
tr_dataset = tf.keras.preprocessing.image_dataset_from_directory(
DATA_PATH,
labels="inferred",
label_mode="categorical",
interpolation="bilinear",
color_mode="grayscale",
batch_size=bs,
image_size=img_size,
shuffle=True,
seed=123,
validation_split=val_split,
subset="training"
)
我的模型 官方 TF 实现:
def instantiate_char_cnn(include_augmentation=False, name=NAME):
eff_net_lite = EfficientNetLiteB0(
include_top=True,
weights=None,
input_shape=(img_size[0], img_size[1], 1),
classes=len(ls),
pooling="avg",
classifier_activation="softmax",
)
if(img_augmentation):
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(None, None, 1)),
PreprocessTFLayer(),
img_augmentation,
eff_net_lite,
],
name=name)
用于预处理的自定义层:
@tf.function
def preprocess_tf(x):
"""
Preprocessing for TF Lite.
Args:
x : a Tensor(batch_size, height, width, channels) of images to preprocess
Return:
normalized and resized Tensor of images
"""
batch, height, width, channels = x.shape
# resize images
x = tf.image.resize(x, img_size, method=tf.image.ResizeMethod.BILINEAR)
# normalize image between [0, 1]
x = tf.math.divide(x, tf.math.reduce_max(x))
return x
class PreprocessTFLayer(tf.keras.layers.Layer):
def __init__(self, name="preprocess_tf", **kwargs):
super(PreprocessTFLayer, self).__init__(name=name, **kwargs)
self.preprocess = preprocess_tf
def call(self, input):
return self.preprocess(input)
def get_config(self):
config = super(PreprocessTFLayer, self).get_config()
return config
def get_prunable_weights(self):
return []
用于图像增强的 keras 层:
from tensorflow.keras.layers.experimental.preprocessing import Resizing, Rescaling, RandomZoom, RandomRotation, RandomTranslation
img_augmentation = tf.keras.Sequential(
[
RandomErasing.RandomErasing(probability=0.4),
# random data augmentation
RandomZoom(height_factor=(-0.2, 1.0), width_factor=(-0.2, 1.0),
fill_mode='constant', interpolation='bilinear', fill_value=0.0
),
RandomTranslation(0.2, 0.2, fill_mode="constant"),
RandomRotation(factor=(-0.1, 0.1) , fill_mode='constant', interpolation='bilinear'),
],
name = "img_augmentation"
)
I am training an EfficientNet lite (from scratch) on a dataset of ~10.000.000 images (128x128x1) with ~6500 classes. My training loss is converging as well as my training accuracy.
However, my test loss/accuracy are fluctuating. When I test the CNN manually on some input it is looking very good and recognizes (nearly) everything correctly.
Because my GPU memory is only 8GB I am training with batch size 256 and fp16 calculations.
Now my question is why does the train loss/acc is fluctuating so much and is there something to correct for that?
Here are some (maybe) important Details:
Loading the data Set:
tr_dataset = tf.keras.preprocessing.image_dataset_from_directory(
DATA_PATH,
labels="inferred",
label_mode="categorical",
interpolation="bilinear",
color_mode="grayscale",
batch_size=bs,
image_size=img_size,
shuffle=True,
seed=123,
validation_split=val_split,
subset="training"
)
My model official TF implementation:
def instantiate_char_cnn(include_augmentation=False, name=NAME):
eff_net_lite = EfficientNetLiteB0(
include_top=True,
weights=None,
input_shape=(img_size[0], img_size[1], 1),
classes=len(ls),
pooling="avg",
classifier_activation="softmax",
)
if(img_augmentation):
model = tf.keras.Sequential([
tf.keras.layers.InputLayer(input_shape=(None, None, 1)),
PreprocessTFLayer(),
img_augmentation,
eff_net_lite,
],
name=name)
The custom layer for preprocessing:
@tf.function
def preprocess_tf(x):
"""
Preprocessing for TF Lite.
Args:
x : a Tensor(batch_size, height, width, channels) of images to preprocess
Return:
normalized and resized Tensor of images
"""
batch, height, width, channels = x.shape
# resize images
x = tf.image.resize(x, img_size, method=tf.image.ResizeMethod.BILINEAR)
# normalize image between [0, 1]
x = tf.math.divide(x, tf.math.reduce_max(x))
return x
class PreprocessTFLayer(tf.keras.layers.Layer):
def __init__(self, name="preprocess_tf", **kwargs):
super(PreprocessTFLayer, self).__init__(name=name, **kwargs)
self.preprocess = preprocess_tf
def call(self, input):
return self.preprocess(input)
def get_config(self):
config = super(PreprocessTFLayer, self).get_config()
return config
def get_prunable_weights(self):
return []
The keras layers for image augmentation:
from tensorflow.keras.layers.experimental.preprocessing import Resizing, Rescaling, RandomZoom, RandomRotation, RandomTranslation
img_augmentation = tf.keras.Sequential(
[
RandomErasing.RandomErasing(probability=0.4),
# random data augmentation
RandomZoom(height_factor=(-0.2, 1.0), width_factor=(-0.2, 1.0),
fill_mode='constant', interpolation='bilinear', fill_value=0.0
),
RandomTranslation(0.2, 0.2, fill_mode="constant"),
RandomRotation(factor=(-0.1, 0.1) , fill_mode='constant', interpolation='bilinear'),
],
name = "img_augmentation"
)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这种现象背后可能有很多原因,并且可能涉及人为错误。关键是如何排除故障。手动检查有时不会给您有用的提示。
There could be many reasons behind this phenomenon and there could be human error involved. The key is how to troubleshoot. Manual inspection sometimes would not give you useful hints.