DataAccessor在TFX中做什么?
我正在阅读 tfx 教程,它全部使用 DataAccessor
加载数据。代码看起来像这样:
return data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size, label_key=_LABEL_KEY),
schema=schema).repeat()
这对于教程的高层来说是有意义的,但当我深入挖掘时我找不到相关文档。 tf_dataset_factory
函数采用 tfxio.TensorFlowDatasetOptions
参数,因此从 args 描述中我推断出类具有某种类型的效果,类似于:
dataset = tfds.load() # load data
dataset = dataset.batch(batch_size) # batch data
dataset = dataset.shuffle() # shuffle data
dataset = dataset.prefetch() # prefetch data
但我不清楚它的应用顺序(如果重要的话)以及如何详细操作数据集。例如,我想应用 Dataset.cache()
但我不清楚在 tf_dataset_factory
之后应用 cache()
是否有意义。
我不明白的另一个例子是 DataAccessor 是否具有预定义的分布式训练支持。即我需要将batch_size乘以strategy.num_replicas_in_sync吗?因为分布式训练教程对此做了明确的说明,但 tfx 教程甚至没有提及它,所以对我来说,是否 num_replicas_in_sync 是否已经得到补偿是 50/50。
想知道其他人是否有同样的想法或有更好的想法?
I'm reading the tfx tutorials, which all uses the DataAccessor
to load data. The code looks something like this:
return data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size, label_key=_LABEL_KEY),
schema=schema).repeat()
This makes sense at a highlevel for the tutorials but I can't find relevant documentations when I dug deeper.
The tf_dataset_factory
function take a tfxio.TensorFlowDatasetOptions
argument and so from the args description I'm deducing that class has some type of effect similar to:
dataset = tfds.load() # load data
dataset = dataset.batch(batch_size) # batch data
dataset = dataset.shuffle() # shuffle data
dataset = dataset.prefetch() # prefetch data
But it's not clear to me in what order this is applied (if it matters) and how I can manipulate the dataset in details. For example, I want to apply Dataset.cache()
but it's not clear to me whether if applying cache()
after the tf_dataset_factory
makes sense.
Another example of what I don't understand is whether if DataAccessor
has predefined distributed training support. i.e. do I need to multiply batch_size
by strategy.num_replicas_in_sync
? Because the distributed training tutorials makes a clear statement on doing that but the tfx tutorials don't even mention it, so to me it's a 50/50 whether if the num_replicas_in_sync
is already compensated for.
Wondering if anyone else are in the same shoes or have better ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
查看文档,< code>DataAccessor 似乎是围绕
tf.data.Dataset
工厂的实用程序包装器。返回的对象是 tf.data.Dataset 的实例 - 因此适用于普通 Dataset 对象的任何后续方法在这里也有效。我无法评论分布式培训支持,并且正如您正确指出的那样,它没有记录在任何地方
Looking through the documentations,
DataAccessor
seems to be a utility wrapper around atf.data.Dataset
factory. The object that is returned is an instance oftf.data.Dataset
- so any subsequent methods that apply to a normalDataset
object are valid here too.I cant comment on the distributed training support and it is not documented anywhere as you rightly pointed out