DataAccessor在TFX中做什么?

发布于 2025-01-19 01:53:33 字数 1283 浏览 3 评论 0原文

我正在阅读 tfx 教程,它全部使用 DataAccessor加载数据。代码看起来像这样:

  return data_accessor.tf_dataset_factory(
      file_pattern,
      tfxio.TensorFlowDatasetOptions(
          batch_size=batch_size, label_key=_LABEL_KEY),
      schema=schema).repeat()

这对于教程的高层来说是有意义的,但当我深入挖掘时我找不到相关文档。 tf_dataset_factory 函数采用 tfxio.TensorFlowDatasetOptions 参数,因此从 args 描述中我推断出类具有某种类型的效果,类似于:

dataset = tfds.load() # load data
dataset = dataset.batch(batch_size) # batch data
dataset = dataset.shuffle() # shuffle data
dataset = dataset.prefetch() # prefetch data

但我不清楚它的应用顺序(如果重要的话)以及如何详细操作数据集。例如,我想应用 Dataset.cache() 但我不清楚在 tf_dataset_factory 之后应用 cache() 是否有意义。

我不明白的另一个例子是 DataAccessor 是否具有预定义的分布式训练支持。即我需要将batch_size乘以strategy.num_replicas_in_sync吗?因为分布式训练教程对此做了明确的说明,但 tfx 教程甚至没有提及它,所以对我来说,是否 num_replicas_in_sync 是否已经得到补偿是 50/50。

想知道其他人是否有同样的想法或有更好的想法?

I'm reading the tfx tutorials, which all uses the DataAccessor to load data. The code looks something like this:

  return data_accessor.tf_dataset_factory(
      file_pattern,
      tfxio.TensorFlowDatasetOptions(
          batch_size=batch_size, label_key=_LABEL_KEY),
      schema=schema).repeat()

This makes sense at a highlevel for the tutorials but I can't find relevant documentations when I dug deeper.
The tf_dataset_factory function take a tfxio.TensorFlowDatasetOptions argument and so from the args description I'm deducing that class has some type of effect similar to:

dataset = tfds.load() # load data
dataset = dataset.batch(batch_size) # batch data
dataset = dataset.shuffle() # shuffle data
dataset = dataset.prefetch() # prefetch data

But it's not clear to me in what order this is applied (if it matters) and how I can manipulate the dataset in details. For example, I want to apply Dataset.cache() but it's not clear to me whether if applying cache() after the tf_dataset_factory makes sense.

Another example of what I don't understand is whether if DataAccessor has predefined distributed training support. i.e. do I need to multiply batch_size by strategy.num_replicas_in_sync? Because the distributed training tutorials makes a clear statement on doing that but the tfx tutorials don't even mention it, so to me it's a 50/50 whether if the num_replicas_in_sync is already compensated for.

Wondering if anyone else are in the same shoes or have better ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

筑梦 2025-01-26 01:53:33

查看文档,< code>DataAccessor 似乎是围绕 tf.data.Dataset 工厂的实用程序包装器。返回的对象是 tf.data.Dataset 的实例 - 因此适用于普通 Dataset 对象的任何后续方法在这里也有效。

我无法评论分布式培训支持,并且正如您正确指出的那样,它没有记录在任何地方

Looking through the documentations, DataAccessor seems to be a utility wrapper around a tf.data.Dataset factory. The object that is returned is an instance of tf.data.Dataset - so any subsequent methods that apply to a normal Dataset object are valid here too.

I cant comment on the distributed training support and it is not documented anywhere as you rightly pointed out

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文