用apache beam（python）解压缩文件，但是当使用WritEtotext时，将所有列作为行

发布于 2025-01-18 19:29:37 字数 766 浏览 1 评论 0原文

我对编程和 Apache Beam 非常陌生，我正在尝试读取 GCS 存储桶上的大量 zip 文件并将其解压缩并在 GCS 上再次保存为 csv。

with beam.Pipeline() as pipeline:
readable_files = (
  pipeline
  | beam.io.fileio.MatchFiles('path/file/patter*.zip')
  | beam.io.fileio.ReadMatches()
  | beam.FlatMap(unzip)
  | beam.combiners.ToList())
files_and_contents = (
  readable_files  
  | beam.io.WriteToText('new', file_name_suffix='.csv'))

我正在使用此功能解压缩文件，

def unzip(readable_file):
print(readable_file)
input_zip=zipfile.ZipFile(readable_file.open())
yield {name: input_zip.read(name) for name in input_zip.namelist()}

我仅使用两个文件对其进行了测试，并且所有行都写为列，这是一个示例。标题是一列，所有其他行都是列。

已保存 CSV 文件

原文

I am very new to programming and Apache Beam, and I am trying to read plenty zip files on a a GCS bucket and unzip them and save again as csv on GCS.

with beam.Pipeline() as pipeline:
readable_files = (
  pipeline
  | beam.io.fileio.MatchFiles('path/file/patter*.zip')
  | beam.io.fileio.ReadMatches()
  | beam.FlatMap(unzip)
  | beam.combiners.ToList())
files_and_contents = (
  readable_files  
  | beam.io.WriteToText('new', file_name_suffix='.csv'))

An I am unzipping the files with this function

def unzip(readable_file):
print(readable_file)
input_zip=zipfile.ZipFile(readable_file.open())
yield {name: input_zip.read(name) for name in input_zip.namelist()}

I have tested it with two files only, and all lines were written as columns, here is an example. The header is a column, and all the other lines columns.

CSV file saved

分享到QQ

分享到微博