TFX-如何检查csvexample的记录

发布于 2025-01-24 10:40:20 字数 1869 浏览 4 评论 0原文

问题

如何检查加载到TFX CSVEXAMPLE中的数据?

CSV

california_housing_train.csv 从下面看。

经度纬度housing_median_agetotal_bedroomstotal_bedroomsupperstoper_incomemedian_value344700
-122.0537.3727388566115376066.60854
csvexample270500csvexample加载csv
CSV

​在我的理解中,XXXExample Gen是生成TF.RECORD实例,因此我想知道是否有一种方法可以通过CSVEXAMPLE的记录进行迭代。

from tfx.components import (
    CsvExampleGen
)
housing = CsvExampleGen("sample_data/california_housing_train.csv")
housing
----------
CsvExampleGen(
    spec: <tfx.types.standard_component_specs.FileBasedExampleGenSpec object at 0x7fcd90435450>,
    executor_spec: <tfx.dsl.components.base.executor_spec.BeamExecutorSpec object at 0x7fcd90435850>,
    driver_class: <class 'tfx.components.example_gen.driver.FileBasedDriver'>,
    component_id: CsvExampleGen,
    inputs: {},
    outputs: {
        'examples': OutputChannel(artifact_type=Examples,
        producer_component_id=CsvExampleGen,
        output_key=examples,
        additional_properties={},
        additional_custom_properties={})
    }
)

实验

for record in housing.outputs['examples']:
    print(record)

TypeError Trackback(最近的最新电话) 在 ----&gt; 1在housing中记录。输出['示例']: 2打印(记录)

TypeError:“ outputchannel”对象是不可能的

Question

How to inspect the data loaded into TFX CsvExampleGen?

CSV

Top 3 rows from the california_housing_train.csv looks below.

longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_value
-122.0537.3727388566115376066.6085344700
-118.334.264315103108092773.599176500
-117.8133.7827358950714844955.7934270500

CsvExampleGen

The CSV is loaded into CsvExampleGen. In my understanding, XXXExampleGen is to generate tf.Record instances, hence I wonder if there is a way to iterate through the records from CsvExampleGen.

from tfx.components import (
    CsvExampleGen
)
housing = CsvExampleGen("sample_data/california_housing_train.csv")
housing
----------
CsvExampleGen(
    spec: <tfx.types.standard_component_specs.FileBasedExampleGenSpec object at 0x7fcd90435450>,
    executor_spec: <tfx.dsl.components.base.executor_spec.BeamExecutorSpec object at 0x7fcd90435850>,
    driver_class: <class 'tfx.components.example_gen.driver.FileBasedDriver'>,
    component_id: CsvExampleGen,
    inputs: {},
    outputs: {
        'examples': OutputChannel(artifact_type=Examples,
        producer_component_id=CsvExampleGen,
        output_key=examples,
        additional_properties={},
        additional_custom_properties={})
    }
)

Experiment

for record in housing.outputs['examples']:
    print(record)

TypeError Traceback (most recent call last)
in
----> 1 for record in housing.outputs['examples']:
2 print(record)

TypeError: 'OutputChannel' object is not iterable

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

水波映月 2025-01-31 10:40:20

您是否有机会查看此 tutorials in Tuteach ,其中说明了如何显示示例组件的工件?您可以修改以下代码(来源: tfx tutorial

# Get the URI of the output artifact representing the training examples, which is a directory
train_uri = os.path.join(example_gen.outputs['examples'].get()[0].uri, 'Split-train')

# Get the list of files in this directory (all compressed TFRecord files)
tfrecord_filenames = [os.path.join(train_uri, name)
                      for name in os.listdir(train_uri)]

# Create a `TFRecordDataset` to read these files
dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type="GZIP")

# Iterate over the first 3 records and decode them.
for tfrecord in dataset.take(3):
  serialized_example = tfrecord.numpy()
  example = tf.train.Example()
  example.ParseFromString(serialized_example)
  pp.pprint(example)

Have you got a chance to take a look at this section in tutorials, which explains how to display the artifacts of ExampleGen component? You can modify the code below (Source: TFX Tutorial) to achieve the same.

# Get the URI of the output artifact representing the training examples, which is a directory
train_uri = os.path.join(example_gen.outputs['examples'].get()[0].uri, 'Split-train')

# Get the list of files in this directory (all compressed TFRecord files)
tfrecord_filenames = [os.path.join(train_uri, name)
                      for name in os.listdir(train_uri)]

# Create a `TFRecordDataset` to read these files
dataset = tf.data.TFRecordDataset(tfrecord_filenames, compression_type="GZIP")

# Iterate over the first 3 records and decode them.
for tfrecord in dataset.take(3):
  serialized_example = tfrecord.numpy()
  example = tf.train.Example()
  example.ParseFromString(serialized_example)
  pp.pprint(example)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文