如何验证我的训练作业正在读取增强清单文件?

发布于 2025-01-13 22:26:02 字数 3393 浏览 2 评论 0原文

对于这么长的帖子表示歉意。

最初,我将数据存储在 S3 存储桶的一个位置,并使用典型的“文件”模式并传递将数据存储为训练输入的 S3 uri,以此数据训练深度学习图像分类模型。为了尝试加速训练,我想切换到使用:

  1. 管道模式,流式传输数据,而不是在训练开始时下载所有数据,更快地开始训练并节省磁盘空间。
  2. 增强清单文件与 1. 相结合,这样我就不必将数据放在 S3 上的单个位置,因此我可以避免在训练模型时移动数据。

我正在使我的脚本类似于本示例中的脚本 。我打印了解析数据时完成的步骤,但是我注意到数据可能尚未被读取,因为打印时显示以下内容:

step 1 Tensor("ParseSingleExample/ParseExample/ParseExampleV2:0", shape=(), dtype=string)
step 2 Tensor("DecodePng:0", shape=(None, None, 3), dtype=uint8)
step 3 Tensor("Cast:0", shape=(None, None, 3), dtype=float32)

我猜图像没有被读取/找到,因为形状是 [None, None , 3] ,而它应该是 [224, 224, 3] ,所以问题可能出在增强清单文件上?

以下是我的增强清单文件的编写方式的示例:

{"image-ref": "s3://path/to/my/image/image1.png", "label": 1}
{"image-ref": "s3://path/to/my/image/image2.png", "label": 2}
{"image-ref": "s3://path/to/my/image/image3.png", "label": 3}

我可能应该提到的一些其他细节:

  1. 当我创建训练输入时,我传递 'content_type': 'application/x-recordio', 'record_wrapping': 'RecordIO',即使我的数据采用 .png 格式,但我假设在读取增强清单文件时,数据会以 RecordIO 格式包装。
  2. 在我的第一点之后,我传递了 PipeModeDataset(channel=channel, record_format='RecordIO'),所以也不确定 RecordIO 的事情。

没有提出实际的错误,就在我开始拟合模型时,什么也没有发生,它继续运行,但实际上没有任何运行,所以我试图找到问题。


编辑:它现在可以正确读取形状,但仍然存在问题,它进入 .fit 方法并且不执行任何操作,只是继续运行而不执行任何操作。找到下面脚本的一部分。

def train_input_fn(train_channel):
    """Returns input function that feeds the model during training"""
    return _input_fn(train_channel)

def _input_fn(channel):
    """
        Returns a Dataset which reads from a SageMaker PipeMode channel.
    """
    
    features = {
        'image-ref': tf.io.FixedLenFeature([], tf.string),
        'label': tf.io.FixedLenFeature([3], tf.int64),
    }
 
    def combine(records):
        return records[0], records[1]
 
    def parse(record):
        
        parsed = tf.io.parse_single_example(record, features)
        
                 

        image = tf.io.decode_png(parsed["image-ref"], channels=3, dtype=tf.uint8)
        image = tf.reshape(image, [224, 224, 3])
        
        lbl = parsed['label']
        print(image, lbl)
        return (image, lbl)
 
    ds = PipeModeDataset(channel=channel, record_format='RecordIO')
    ds = ds.map(parse, num_parallel_calls=AUTOTUNE)
    ds = ds.prefetch(AUTOTUNE)
 
    return ds

def model(dataset):
    """Generate a simple model"""
    inputs = Input(shape=(224, 224, 3))
    prediction_layer = Dense(2, activation = 'softmax')


    x = inputs
    x = tf.keras.applications.mobilenet.MobileNet(include_top=False, input_shape=(224,224,3), weights='imagenet')(x)
    outputs = prediction_layer(x)
    rec_model = tf.keras.Model(inputs, outputs)    
    
    rec_model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
        metrics=['accuracy']
    )
    
    
    rec_model.fit(
        dataset
    )

    return rec_model

def main(params):
    
    epochs = params['epochs']
    train_channel = params['train_channel']
    record_format = params['record_format']
    batch_size = params['batch_size']
        
    train_spec = train_input_fn(train_channel)
    model_classifier = model(train_spec)

Apologies for the long post.

Originally, I had data in one location on an S3 bucket and used to train deep learning image classification models on this data using the typical 'File' mode and passing the S3 uri where the data is stored as training input. To try and accelerate training, I wanted to switch to using:

  1. Pipe mode, to stream data and not download all the data at the beginning of the training, starting training faster and saving disk space.
  2. Augmented Manifest File coupled with 1., so that I don't have to place my data in a single location on S3, so I avoid moving data around when I train models.

I was making my script similar to the one in this example. I printed the steps done when parsing the data, however I noticed that the data might not have been read because when printing it shows the following:

step 1 Tensor("ParseSingleExample/ParseExample/ParseExampleV2:0", shape=(), dtype=string)
step 2 Tensor("DecodePng:0", shape=(None, None, 3), dtype=uint8)
step 3 Tensor("Cast:0", shape=(None, None, 3), dtype=float32)

I guess the image is not being read/found since the shape is [None, None, 3] when it should be [224, 224, 3], so maybe the problem is from the Augmented Manifest file?

Here's an example of how my Augmented Manifest file is written:

{"image-ref": "s3://path/to/my/image/image1.png", "label": 1}
{"image-ref": "s3://path/to/my/image/image2.png", "label": 2}
{"image-ref": "s3://path/to/my/image/image3.png", "label": 3}

Some other details I should probably mention:

  1. When I create the Training Input I pass 'content_type': 'application/x-recordio', 'record_wrapping': 'RecordIO', even though my data are in .png format, but I assumed that as the augmented manifest file is read the data get wrapped in the RecordIO format.
  2. Following my first point, I pass PipeModeDataset(channel=channel, record_format='RecordIO'), so also not sure about the RecordIO thing.

There isn't an actual error that is raised, just when I start fitting the model nothing happens, it keeps on running but nothing actually runs so I'm trying to find the issue.


EDIT: It now reads the shape correctly, but there's still the issue where it enters the .fit method and does nothing, just keeps running without doing anything. Find part of the script below.

def train_input_fn(train_channel):
    """Returns input function that feeds the model during training"""
    return _input_fn(train_channel)

def _input_fn(channel):
    """
        Returns a Dataset which reads from a SageMaker PipeMode channel.
    """
    
    features = {
        'image-ref': tf.io.FixedLenFeature([], tf.string),
        'label': tf.io.FixedLenFeature([3], tf.int64),
    }
 
    def combine(records):
        return records[0], records[1]
 
    def parse(record):
        
        parsed = tf.io.parse_single_example(record, features)
        
                 

        image = tf.io.decode_png(parsed["image-ref"], channels=3, dtype=tf.uint8)
        image = tf.reshape(image, [224, 224, 3])
        
        lbl = parsed['label']
        print(image, lbl)
        return (image, lbl)
 
    ds = PipeModeDataset(channel=channel, record_format='RecordIO')
    ds = ds.map(parse, num_parallel_calls=AUTOTUNE)
    ds = ds.prefetch(AUTOTUNE)
 
    return ds

def model(dataset):
    """Generate a simple model"""
    inputs = Input(shape=(224, 224, 3))
    prediction_layer = Dense(2, activation = 'softmax')


    x = inputs
    x = tf.keras.applications.mobilenet.MobileNet(include_top=False, input_shape=(224,224,3), weights='imagenet')(x)
    outputs = prediction_layer(x)
    rec_model = tf.keras.Model(inputs, outputs)    
    
    rec_model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
        metrics=['accuracy']
    )
    
    
    rec_model.fit(
        dataset
    )

    return rec_model

def main(params):
    
    epochs = params['epochs']
    train_channel = params['train_channel']
    record_format = params['record_format']
    batch_size = params['batch_size']
        
    train_spec = train_input_fn(train_channel)
    model_classifier = model(train_spec)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

清晨说晚安 2025-01-20 22:26:03

来自此处

PipeModeDataset 可以读取 TFRecord、RecordIO 或文本行记录。

当您尝试读取二进制 (PNG) 文件时。我在这里没有看到相关的 记录阅读器 帮助您做到这一点。
您可以构建自己的格式管道实现,如图所示 这里,但要付出相当大的努力。

或者,您提到您的文件分散在不同的文件夹中,但如果您的文件公共路径包含的文件少于 2M,您可以使用 FastFile 模式流式传输数据。目前,FastFile 仅支持 S3 前缀,因此您将无法使用清单。

另请参阅此 SageMaker 中不同可用存储和输入类型的一般优缺点讨论

From here:

A PipeModeDataset can read TFRecord, RecordIO, or text line records.

While your'e trying to read binary (PNG) files. I don't see a relevant record reader here to help you do that.
You could build your own format pipe implementation like shown here, but it's considerably more effort.

Alternatively, you mentioned your files are scattered in different folders, but if your files common path contains less than 2M files, you could use FastFile mode to stream data. Currently, FastFile only supports an S3 Prefix, so you won't be able to use a manifest.

Also see this general pros/cons discussion of the different available storage and input types available in SageMaker.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文