如何通过通配符匹配从S3选择文件

发布于 2025-02-12 20:21:34 字数 1454 浏览 1 评论 0原文

如何基于与文件名匹配的通配符识别/选择特定文件？

我想根据 Wildcard 文件名匹配的S3上存在的文件，由模式给出：dwh_cust_p665 _*。强>匹配文件的文件名。例如 -

文件“ dwh_cust_p665_20220515_170922.xml”应为选择基于在上面的通配符上
，但文件“ dwh_prod_p223_20220607_102314.xml”应忽略。

我需要返回与上面指定的通配符模式匹配的文件的名称。我提出了以下代码段。但是，我正在努力使它起作用。任何人都可以帮助正确匹配的模式。

import boto3
import os
import re
import xml.etree.ElementTree as ET

class S3Pull:
    def __init__(self, bucket):
        self.bucket = bucket
        self.client = boto3.client('s3')

    def iterate_bucket(self, wildcard=".*"):
        client = boto3.client('s3')
        paginator = client.get_paginator('list_objects_v2')
        page_iterator = paginator.paginate(Bucket=self.bucket)

        regex = re.compile(wildcard)
        for page in page_iterator:
            if page['KeyCount'] > 0:
                for item in page['Contents']:
                    if re.match(regex, item["Key"])
                       print(item['Key'])
    
    return str(item['Key'])

if __name__ == "__main__":
    TYPE = "CUST"
    ID = "P665"
    bucket = "dwh-landing-bucket"
    s3_pull = S3Pull(bucket)
    s3_object = s3_pull.iterate_bucket(f"DWH_{TYPE}_{ID}_*.xml")

    s3 = boto3.client('s3')
    obj = s.get_object(Bucket=bucket, Key=s3_object)

    tree = ET.parse(obj['Body'])

我似乎无法使通配符模式匹配正常工作并返回匹配的文件名。

任何帮助将不胜感激。

很高兴提供更多信息。

原文

How to identify/select specific file from S3 based on wildcard matching the filename ?

I want to select the file present on S3 based on wildcard pattern matching of the filename, given by the pattern: DWH_CUST_P665_*.xml and return the full filename of the matched file.
E.g -

The file "DWH_CUST_P665_20220515_170922.xml" should be selected based
on the wildcard above
But the file "DWH_PROD_P223_20220607_102314.xml" should be ignored.

I need to return the name of the file that matches the wildcard pattern specified above. I have come up with the following code snippet. However, I am struggling to make it work. Can anyone please help to do the pattern matching correctly.

import boto3
import os
import re
import xml.etree.ElementTree as ET

class S3Pull:
    def __init__(self, bucket):
        self.bucket = bucket
        self.client = boto3.client('s3')

    def iterate_bucket(self, wildcard=".*"):
        client = boto3.client('s3')
        paginator = client.get_paginator('list_objects_v2')
        page_iterator = paginator.paginate(Bucket=self.bucket)

        regex = re.compile(wildcard)
        for page in page_iterator:
            if page['KeyCount'] > 0:
                for item in page['Contents']:
                    if re.match(regex, item["Key"])
                       print(item['Key'])
    
    return str(item['Key'])

if __name__ == "__main__":
    TYPE = "CUST"
    ID = "P665"
    bucket = "dwh-landing-bucket"
    s3_pull = S3Pull(bucket)
    s3_object = s3_pull.iterate_bucket(f"DWH_{TYPE}_{ID}_*.xml")

    s3 = boto3.client('s3')
    obj = s.get_object(Bucket=bucket, Key=s3_object)

    tree = ET.parse(obj['Body'])

I can't seem to get the wildcard pattern match to work correctly and return the matching filename.

Any help is appreciated.

Happy to provide more info.

分享到QQ

分享到微博