循环通过压缩GZIP文件抛出“错误[errno 2]没有这样的文件或目录:' part-r-00001.gz;"在第二次迭代 - python

发布于 2025-02-08 16:47:44 字数 1482 浏览 3 评论 0原文

我正在通过S3存储桶中的多个文件循环。第一次迭代效果很好,但是一旦跳到下一个迭代,我会收到“错误[errno 2]否此类文件或目录:'part-r-r-00001.gz'”。 (正确访问了part-r-00000.gz)

我不确定为什么在存储桶中找到该文件。

这是代码:

BUCKET = 'bucket'
PREFIX = 'path'

now = datetime.utcnow()
today = (now - timedelta(days=2)).strftime('%Y-%m-%d')
folder_of_the_day = PREFIX + today + '/'
logger.info("map folder: %s", folder_of_the_day)

client = boto3.client('s3')
response = client.list_objects_v2(Bucket=BUCKET, Prefix=folder_of_the_day)
for content in response.get('Contents', []):
    bucket_file = os.path.split(content["Key"])[-1]
    if bucket_file.endswith('.gz'):
        logger.info("----- starting with file: %s -----", bucket_file)
        try:
            with gzip.open(bucket_file, mode="rt") as file:
                for line in file:
                    //do something

        except Exception as e:
            logger.error(e)
            logger.critical("Failed to open file!")
            sys.exit(4)

第二轮执行后,这是输出:

2022-06-18 12:14:48,027 [root]信息------从文件开始: part-r-00001.gz ----- 2022-06-18 12:14:48,028 [root]错误[errno 2] 没有这样的文件或目录:'part-r-00001.gz'

update 根据评论,我将代码更新为适当的GZIP方法,但仍然存在错误。完成第一次迭代后,就找不到第二个文件。

这是更新的代码:

try:
    with gzip.GzipFile(bucket_file) as gzipfile:
        decompressed_content = gzipfile.read()
        for line in decompressed_content.splitlines():
            //do something
            break

I am looping through multiple files within an s3 bucket. The first iteration works perfectly fine, but once jumping to the next I receive an "ERROR [Errno 2] No such file or directory: 'part-r-00001.gz'". (part-r-00000.gz was accessed correctly)

I am not sure why the file is not found as it is available in the bucket.

This is the code:

BUCKET = 'bucket'
PREFIX = 'path'

now = datetime.utcnow()
today = (now - timedelta(days=2)).strftime('%Y-%m-%d')
folder_of_the_day = PREFIX + today + '/'
logger.info("map folder: %s", folder_of_the_day)

client = boto3.client('s3')
response = client.list_objects_v2(Bucket=BUCKET, Prefix=folder_of_the_day)
for content in response.get('Contents', []):
    bucket_file = os.path.split(content["Key"])[-1]
    if bucket_file.endswith('.gz'):
        logger.info("----- starting with file: %s -----", bucket_file)
        try:
            with gzip.open(bucket_file, mode="rt") as file:
                for line in file:
                    //do something

        except Exception as e:
            logger.error(e)
            logger.critical("Failed to open file!")
            sys.exit(4)

Once executed for the second round, this is the output:

2022-06-18 12:14:48,027 [root] INFO ----- starting with file:
part-r-00001.gz ----- 2022-06-18 12:14:48,028 [root] ERROR [Errno 2]
No such file or directory: 'part-r-00001.gz'

Update
Based on the comment I updated my code to a proper gzip method, but still the error remains. Once the first iteration is done, the second file is not being found.

This is the updated code:

try:
    with gzip.GzipFile(bucket_file) as gzipfile:
        decompressed_content = gzipfile.read()
        for line in decompressed_content.splitlines():
            //do something
            break

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

我也只是我 2025-02-15 16:47:44

我认为您不能直接在S3路径上使用gzip.open

您可能需要一种适当的GZIP方法来读取S3存储桶中的文件。

python中的AWS S3

I think you can not use gzip.open on the S3 path directly.

You may need a proper gzip method to read files in S3 bucket.

Reading contents of a gzip file from a AWS S3 in Python

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文