根据使用Python，根据条件读取可变块中的巨大文件

发布于 2025-01-30 12:10:25 字数 1150 浏览 2 评论 0原文

我需要阅读一个与S3的巨大管道分离的文件，其中包含以下内容：

Q|A|1|X
78|WQ|
123|ABC
Q|V|5|Y
LK|HJ|
BG|78

我想以这样的方式读取文件：（

1|Q|A|1|X|
1|78|WQ|
1|123|ABC|
2|Q|V|5|Y|
2|LK|HJ|
2|BG|78|

请注意第一列添加。每个以'q'开头的部分都应具有单独的ID ）

因此，我正在使用大熊猫：

previous_ID =0
for chunk in pd.read_csv(io.BytesIO(s3.get_object(Bucket=bucket, Key=f)['Body'].read()), sep=';', header=None, compression='gzip', chunksize=1000):
        chunk = chunk.reset_index(drop=True)
        chunk[['CLM1', 'data']] = chunk[0].str.split("|", n=1, expand=True)
        chunk = chunk.drop([0], axis=1)
        h_count = chunk[chunk['CLM1'] == 'Q'].shape[0]
        chunk.loc[chunk['CLM1'] == 'Q', 'ID'] = range(previous_ID + 1,previous_ID + h_count + 1)
            if (pd.isnull(chunk.loc[0, 'ID'])) and (previous_ID != 0):
                chunk.at[0, 'ID'] = previous_ID

            chunk['ID'] = chunk['ID'].ffill()
            chunk['ID'].fillna(0, inplace=True)
            chunk['ID'] = chunk['ID'].astype(int)
            previous_ID = chunk.iloc[-1]['ID']

这很好，但是我想从社区了解更好，更快的方法。我不想阅读内存中的整个文件，我开放使用使用Pandas以外的其他解决方案

原文

I need to read a huge pipe separated file from s3 with below content:

Q|A|1|X
78|WQ|
123|ABC
Q|V|5|Y
LK|HJ|
BG|78

I want to read file in such a way that my data looks like this:

1|Q|A|1|X|
1|78|WQ|
1|123|ABC|
2|Q|V|5|Y|
2|LK|HJ|
2|BG|78|

(Notice first column added. Each section that starts with 'Q' should have a separate ID)

So, far I am using pandas :

previous_ID =0
for chunk in pd.read_csv(io.BytesIO(s3.get_object(Bucket=bucket, Key=f)['Body'].read()), sep=';', header=None, compression='gzip', chunksize=1000):
        chunk = chunk.reset_index(drop=True)
        chunk[['CLM1', 'data']] = chunk[0].str.split("|", n=1, expand=True)
        chunk = chunk.drop([0], axis=1)
        h_count = chunk[chunk['CLM1'] == 'Q'].shape[0]
        chunk.loc[chunk['CLM1'] == 'Q', 'ID'] = range(previous_ID + 1,previous_ID + h_count + 1)
            if (pd.isnull(chunk.loc[0, 'ID'])) and (previous_ID != 0):
                chunk.at[0, 'ID'] = previous_ID

            chunk['ID'] = chunk['ID'].ffill()
            chunk['ID'].fillna(0, inplace=True)
            chunk['ID'] = chunk['ID'].astype(int)
            previous_ID = chunk.iloc[-1]['ID']

This works fine but I would like to understand from the community if there is any better and faster way. I don't want to read whole file in memory and I am open to use solution that use something other than pandas

分享到QQ

分享到微博