Python -Apache -Beam使用DataFlow Runner输出一个空文件,可与Direct Runner一起使用。数据流不会引起任何错误
我一直在尝试运行这个 apache-beam 脚本。该脚本每晚通过气流 DAG 运行,并且以这种方式运行得非常好,因此我(相当)确信该脚本是正确的。我认为我的 apache 脚本的相关部分总结于此。
def configure_pipeline(p, opt):
"""Specify PCollection and transformations in pipeline."""
read_input_source = beam.io.ReadFromText(opt.input_path,
skip_header_lines=1,
strip_trailing_newlines=True)
_ = (p
| 'Read input' >> read_input_source
| 'Get prediction' >> beam.ParDo(PredictionFromFeaturesDoFn())
| 'Save to disk' >> beam.io.WriteToText(
opt.output_path, file_name_suffix='.ndjson.gz'))
为了执行该脚本,我运行以下命令:
python beam_process.py \
--project=\my-project \
--region=us-central1 \
--runner=DataflowRunner \
--temp_location=gs://staging/location \
--job_name=beam-process-test \
--max_num_workers=5 \
--runner=DataflowRunner \
--input_path="gs://path/to/file/input-000000000000.jsonl.gz" \
--output_path="gs://path/to/output"
数据流中的作业运行没有错误,但是输出文件除了文件名之外完全是空的。使用直接运行程序并使用本地目录运行此程序,该过程按预期运行,并且我有完全可行的输出。我尝试过使用不同的输入,以及尝试不同的云存储桶。我唯一能想到的是我不知道的权限问题。如果需要,我可以发布数据流作业详细信息(或者至少是我能够看到的内容)。
编辑
对于少数最终看到此问题的人,我最终修复了它,但我仍然不知道原因。通过在整个输入字段周围添加引号:
--runner=DataflowRunner \
'--input_path="gs://path/to/file/input-000000000000.jsonl.gz" \'
--output_path="gs://path/to/output"
允许数据流读取输入流。
I've been trying to run this apache-beam script. This script runs nightly through an airflow DAG, and works perfectly fine that way so I'm (reasonably) confident that the script is correct. I think the relevant part of my apache script is summarized with this.
def configure_pipeline(p, opt):
"""Specify PCollection and transformations in pipeline."""
read_input_source = beam.io.ReadFromText(opt.input_path,
skip_header_lines=1,
strip_trailing_newlines=True)
_ = (p
| 'Read input' >> read_input_source
| 'Get prediction' >> beam.ParDo(PredictionFromFeaturesDoFn())
| 'Save to disk' >> beam.io.WriteToText(
opt.output_path, file_name_suffix='.ndjson.gz'))
And to execute the script, I run this:
python beam_process.py \
--project=\my-project \
--region=us-central1 \
--runner=DataflowRunner \
--temp_location=gs://staging/location \
--job_name=beam-process-test \
--max_num_workers=5 \
--runner=DataflowRunner \
--input_path="gs://path/to/file/input-000000000000.jsonl.gz" \
--output_path="gs://path/to/output"
The job in dataflow runs with no errors, however the output file is completely empty other than the file name. Running this with direct runner and using local directories, the process runs as expected and I have fully workable outputs. I've tried using different inputs, as well as trying different cloud buckets. The only thing I could think of is a permissions problem that I'm unaware of. I can post the dataflow job details (or at least, what I'm able to see of them) if needed.
EDIT
For the few who end up seeing this, I ended up fixing it but the reason is still unknown to me. By adding quotes around the entire input field:
--runner=DataflowRunner \
'--input_path="gs://path/to/file/input-000000000000.jsonl.gz" \'
--output_path="gs://path/to/output"
allows dataflow to read the input stream.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论