如何验证我的训练作业正在读取增强清单文件？

发布于 2025-01-13 22:26:02 字数 3393 浏览 2 评论 0原文

对于这么长的帖子表示歉意。

最初，我将数据存储在 S3 存储桶的一个位置，并使用典型的“文件”模式并传递将数据存储为训练输入的 S3 uri，以此数据训练深度学习图像分类模型。为了尝试加速训练，我想切换到使用：

管道模式，流式传输数据，而不是在训练开始时下载所有数据，更快地开始训练并节省磁盘空间。
增强清单文件与 1. 相结合，这样我就不必将数据放在 S3 上的单个位置，因此我可以避免在训练模型时移动数据。

我正在使我的脚本类似于本示例中的脚本。我打印了解析数据时完成的步骤，但是我注意到数据可能尚未被读取，因为打印时显示以下内容：

step 1 Tensor("ParseSingleExample/ParseExample/ParseExampleV2:0", shape=(), dtype=string)
step 2 Tensor("DecodePng:0", shape=(None, None, 3), dtype=uint8)
step 3 Tensor("Cast:0", shape=(None, None, 3), dtype=float32)

我猜图像没有被读取/找到，因为形状是 [None, None , 3] ，而它应该是 [224, 224, 3] ，所以问题可能出在增强清单文件上？

以下是我的增强清单文件的编写方式的示例：

{"image-ref": "s3://path/to/my/image/image1.png", "label": 1}
{"image-ref": "s3://path/to/my/image/image2.png", "label": 2}
{"image-ref": "s3://path/to/my/image/image3.png", "label": 3}

我可能应该提到的一些其他细节：

当我创建训练输入时，我传递 'content_type': 'application/x-recordio', 'record_wrapping': 'RecordIO'，即使我的数据采用 .png 格式，但我假设在读取增强清单文件时，数据会以 RecordIO 格式包装。
在我的第一点之后，我传递了 PipeModeDataset(channel=channel, record_format='RecordIO')，所以也不确定 RecordIO 的事情。

没有提出实际的错误，就在我开始拟合模型时，什么也没有发生，它继续运行，但实际上没有任何运行，所以我试图找到问题。

编辑：它现在可以正确读取形状，但仍然存在问题，它进入 .fit 方法并且不执行任何操作，只是继续运行而不执行任何操作。找到下面脚本的一部分。

def train_input_fn(train_channel):
    """Returns input function that feeds the model during training"""
    return _input_fn(train_channel)

def _input_fn(channel):
    """
        Returns a Dataset which reads from a SageMaker PipeMode channel.
    """
    
    features = {
        'image-ref': tf.io.FixedLenFeature([], tf.string),
        'label': tf.io.FixedLenFeature([3], tf.int64),
    }
 
    def combine(records):
        return records[0], records[1]
 
    def parse(record):
        
        parsed = tf.io.parse_single_example(record, features)
        
                 

        image = tf.io.decode_png(parsed["image-ref"], channels=3, dtype=tf.uint8)
        image = tf.reshape(image, [224, 224, 3])
        
        lbl = parsed['label']
        print(image, lbl)
        return (image, lbl)
 
    ds = PipeModeDataset(channel=channel, record_format='RecordIO')
    ds = ds.map(parse, num_parallel_calls=AUTOTUNE)
    ds = ds.prefetch(AUTOTUNE)
 
    return ds

def model(dataset):
    """Generate a simple model"""
    inputs = Input(shape=(224, 224, 3))
    prediction_layer = Dense(2, activation = 'softmax')


    x = inputs
    x = tf.keras.applications.mobilenet.MobileNet(include_top=False, input_shape=(224,224,3), weights='imagenet')(x)
    outputs = prediction_layer(x)
    rec_model = tf.keras.Model(inputs, outputs)    
    
    rec_model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
        metrics=['accuracy']
    )
    
    
    rec_model.fit(
        dataset
    )

    return rec_model

def main(params):
    
    epochs = params['epochs']
    train_channel = params['train_channel']
    record_format = params['record_format']
    batch_size = params['batch_size']
        
    train_spec = train_input_fn(train_channel)
    model_classifier = model(train_spec)

原文

Apologies for the long post.

Originally, I had data in one location on an S3 bucket and used to train deep learning image classification models on this data using the typical 'File' mode and passing the S3 uri where the data is stored as training input. To try and accelerate training, I wanted to switch to using:

Pipe mode, to stream data and not download all the data at the beginning of the training, starting training faster and saving disk space.
Augmented Manifest File coupled with 1., so that I don't have to place my data in a single location on S3, so I avoid moving data around when I train models.

I was making my script similar to the one in this example. I printed the steps done when parsing the data, however I noticed that the data might not have been read because when printing it shows the following:

step 1 Tensor("ParseSingleExample/ParseExample/ParseExampleV2:0", shape=(), dtype=string)
step 2 Tensor("DecodePng:0", shape=(None, None, 3), dtype=uint8)
step 3 Tensor("Cast:0", shape=(None, None, 3), dtype=float32)

I guess the image is not being read/found since the shape is [None, None, 3] when it should be [224, 224, 3], so maybe the problem is from the Augmented Manifest file?

Here's an example of how my Augmented Manifest file is written:

{"image-ref": "s3://path/to/my/image/image1.png", "label": 1}
{"image-ref": "s3://path/to/my/image/image2.png", "label": 2}
{"image-ref": "s3://path/to/my/image/image3.png", "label": 3}

Some other details I should probably mention:

When I create the Training Input I pass 'content_type': 'application/x-recordio', 'record_wrapping': 'RecordIO', even though my data are in .png format, but I assumed that as the augmented manifest file is read the data get wrapped in the RecordIO format.
Following my first point, I pass PipeModeDataset(channel=channel, record_format='RecordIO'), so also not sure about the RecordIO thing.

There isn't an actual error that is raised, just when I start fitting the model nothing happens, it keeps on running but nothing actually runs so I'm trying to find the issue.

EDIT: It now reads the shape correctly, but there's still the issue where it enters the .fit method and does nothing, just keeps running without doing anything. Find part of the script below.

def train_input_fn(train_channel):
    """Returns input function that feeds the model during training"""
    return _input_fn(train_channel)

def _input_fn(channel):
    """
        Returns a Dataset which reads from a SageMaker PipeMode channel.
    """
    
    features = {
        'image-ref': tf.io.FixedLenFeature([], tf.string),
        'label': tf.io.FixedLenFeature([3], tf.int64),
    }
 
    def combine(records):
        return records[0], records[1]
 
    def parse(record):
        
        parsed = tf.io.parse_single_example(record, features)
        
                 

        image = tf.io.decode_png(parsed["image-ref"], channels=3, dtype=tf.uint8)
        image = tf.reshape(image, [224, 224, 3])
        
        lbl = parsed['label']
        print(image, lbl)
        return (image, lbl)
 
    ds = PipeModeDataset(channel=channel, record_format='RecordIO')
    ds = ds.map(parse, num_parallel_calls=AUTOTUNE)
    ds = ds.prefetch(AUTOTUNE)
 
    return ds

def model(dataset):
    """Generate a simple model"""
    inputs = Input(shape=(224, 224, 3))
    prediction_layer = Dense(2, activation = 'softmax')


    x = inputs
    x = tf.keras.applications.mobilenet.MobileNet(include_top=False, input_shape=(224,224,3), weights='imagenet')(x)
    outputs = prediction_layer(x)
    rec_model = tf.keras.Model(inputs, outputs)    
    
    rec_model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
        metrics=['accuracy']
    )
    
    
    rec_model.fit(
        dataset
    )

    return rec_model

def main(params):
    
    epochs = params['epochs']
    train_channel = params['train_channel']
    record_format = params['record_format']
    batch_size = params['batch_size']
        
    train_spec = train_input_fn(train_channel)
    model_classifier = model(train_spec)

分享到QQ

分享到微博