Python 使用 boto3 一次处理 1 个文件
我有一个 Python 脚本需要读取和处理 100 个文件。我的脚本如何在一次仅处理 1 个文件的情况下运行?我的 print(len(session_results))
应该有 2 个单独的长度。
def fetch_endpoint(endpoint: str, date: str) -> str:
"""Read files from s3 bucket and aggregate them via generator function"""
client = aws_s3_client_auth()
resource = aws_s3_resource_auth()
print(f"{endpoint.upper()}/{date}/")
paginator = client.get_paginator('list_objects_v2')
operation_parameters: dict[str] = {
"Bucket": RAW_BUCKET,
"Prefix": f"{endpoint.upper()}/{date}/"
}
keys_to_process: list[str] = []
page_iterator = paginator.paginate(**operation_parameters, PaginationConfig={'MaxItems': 100})
for page in page_iterator:
if page.get("KeyCount") == 0:
return print(f'No data was found for {endpoint} on {date}')
else:
for content in page.get("Contents"):
# print(content)
keys_to_process.append(content.get("Key"))
print(keys_to_process)
index = 0
while index < len(keys_to_process):
print(keys_to_process[index])
client_events = resource.Object(RAW_BUCKET, key=keys_to_process[index])
file_content = client_events.get()['Body'].read().decode('utf-8')
for line in file_content.splitlines():
data = json.loads(line)
index += 1
yield data
def session(endpoint: str, date: str) -> None:
"""Make a request to s3 to read session in the RAW_BUCKET"""
session_results = [line for line in fetch_endpoint(endpoint, date)]
print(len(session_results))
return None
I have a Python script that needs to read and process 100 files. How can my script run while only processing 1 file at a time? My print(len(session_results))
should have 2 separate lengths.
def fetch_endpoint(endpoint: str, date: str) -> str:
"""Read files from s3 bucket and aggregate them via generator function"""
client = aws_s3_client_auth()
resource = aws_s3_resource_auth()
print(f"{endpoint.upper()}/{date}/")
paginator = client.get_paginator('list_objects_v2')
operation_parameters: dict[str] = {
"Bucket": RAW_BUCKET,
"Prefix": f"{endpoint.upper()}/{date}/"
}
keys_to_process: list[str] = []
page_iterator = paginator.paginate(**operation_parameters, PaginationConfig={'MaxItems': 100})
for page in page_iterator:
if page.get("KeyCount") == 0:
return print(f'No data was found for {endpoint} on {date}')
else:
for content in page.get("Contents"):
# print(content)
keys_to_process.append(content.get("Key"))
print(keys_to_process)
index = 0
while index < len(keys_to_process):
print(keys_to_process[index])
client_events = resource.Object(RAW_BUCKET, key=keys_to_process[index])
file_content = client_events.get()['Body'].read().decode('utf-8')
for line in file_content.splitlines():
data = json.loads(line)
index += 1
yield data
def session(endpoint: str, date: str) -> None:
"""Make a request to s3 to read session in the RAW_BUCKET"""
session_results = [line for line in fetch_endpoint(endpoint, date)]
print(len(session_results))
return None
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
session_results
是一个列表,其中列表中的每个元素都是由fetch_endpoint()
生成的内容。当您
print(len(session_results))
时,您是在告诉 Python 告诉您session_results
中有多少个元素。我不太确定你所说的“应该有两个单独的长度”是什么意思。它是一个列表,一次只能有 1 个长度。
session_results
is a list where each element in the list is something that was yielded byfetch_endpoint()
.When you
print(len(session_results))
you are telling Python to tell you how many elements are insession_results
.I'm not really sure what you mean by "should have 2 separate lengths." It is a list, it can only have 1 length at a time.