DataAccessor在TFX中做什么？

发布于 2025-01-19 01:53:33 字数 1283 浏览 3 评论 0原文

我正在阅读 tfx 教程，它全部使用 DataAccessor加载数据。代码看起来像这样：

  return data_accessor.tf_dataset_factory(
      file_pattern,
      tfxio.TensorFlowDatasetOptions(
          batch_size=batch_size, label_key=_LABEL_KEY),
      schema=schema).repeat()

这对于教程的高层来说是有意义的，但当我深入挖掘时我找不到相关文档。 tf_dataset_factory 函数采用 tfxio.TensorFlowDatasetOptions 参数，因此从 args 描述中我推断出类具有某种类型的效果，类似于：

dataset = tfds.load() # load data
dataset = dataset.batch(batch_size) # batch data
dataset = dataset.shuffle() # shuffle data
dataset = dataset.prefetch() # prefetch data

但我不清楚它的应用顺序（如果重要的话）以及如何详细操作数据集。例如，我想应用 Dataset.cache() 但我不清楚在 tf_dataset_factory 之后应用 cache() 是否有意义。

我不明白的另一个例子是 DataAccessor 是否具有预定义的分布式训练支持。即我需要将batch_size乘以strategy.num_replicas_in_sync吗？因为分布式训练教程对此做了明确的说明，但 tfx 教程甚至没有提及它，所以对我来说，是否 num_replicas_in_sync 是否已经得到补偿是 50/50。

想知道其他人是否有同样的想法或有更好的想法？

原文

I'm reading the tfx tutorials, which all uses the DataAccessor to load data. The code looks something like this:

  return data_accessor.tf_dataset_factory(
      file_pattern,
      tfxio.TensorFlowDatasetOptions(
          batch_size=batch_size, label_key=_LABEL_KEY),
      schema=schema).repeat()

This makes sense at a highlevel for the tutorials but I can't find relevant documentations when I dug deeper.
The tf_dataset_factory function take a tfxio.TensorFlowDatasetOptions argument and so from the args description I'm deducing that class has some type of effect similar to:

dataset = tfds.load() # load data
dataset = dataset.batch(batch_size) # batch data
dataset = dataset.shuffle() # shuffle data
dataset = dataset.prefetch() # prefetch data

But it's not clear to me in what order this is applied (if it matters) and how I can manipulate the dataset in details. For example, I want to apply Dataset.cache() but it's not clear to me whether if applying cache() after the tf_dataset_factory makes sense.

Another example of what I don't understand is whether if DataAccessor has predefined distributed training support. i.e. do I need to multiply batch_size by strategy.num_replicas_in_sync? Because the distributed training tutorials makes a clear statement on doing that but the tfx tutorials don't even mention it, so to me it's a 50/50 whether if the num_replicas_in_sync is already compensated for.

Wondering if anyone else are in the same shoes or have better ideas?

分享到QQ

分享到微博