根据使用Python,根据条件读取可变块中的巨大文件
我需要阅读一个与S3的巨大管道分离的文件,其中包含以下内容:
Q|A|1|X
78|WQ|
123|ABC
Q|V|5|Y
LK|HJ|
BG|78
我想以这样的方式读取文件:(
1|Q|A|1|X|
1|78|WQ|
1|123|ABC|
2|Q|V|5|Y|
2|LK|HJ|
2|BG|78|
请注意第一列添加。每个以'q'开头的部分都应具有单独的ID )
因此,我正在使用大熊猫:
previous_ID =0
for chunk in pd.read_csv(io.BytesIO(s3.get_object(Bucket=bucket, Key=f)['Body'].read()), sep=';', header=None, compression='gzip', chunksize=1000):
chunk = chunk.reset_index(drop=True)
chunk[['CLM1', 'data']] = chunk[0].str.split("|", n=1, expand=True)
chunk = chunk.drop([0], axis=1)
h_count = chunk[chunk['CLM1'] == 'Q'].shape[0]
chunk.loc[chunk['CLM1'] == 'Q', 'ID'] = range(previous_ID + 1,previous_ID + h_count + 1)
if (pd.isnull(chunk.loc[0, 'ID'])) and (previous_ID != 0):
chunk.at[0, 'ID'] = previous_ID
chunk['ID'] = chunk['ID'].ffill()
chunk['ID'].fillna(0, inplace=True)
chunk['ID'] = chunk['ID'].astype(int)
previous_ID = chunk.iloc[-1]['ID']
这很好,但是我想从社区了解更好,更快的方法。我不想阅读内存中的整个文件,我开放使用使用Pandas以外的其他解决方案
I need to read a huge pipe separated file from s3 with below content:
Q|A|1|X
78|WQ|
123|ABC
Q|V|5|Y
LK|HJ|
BG|78
I want to read file in such a way that my data looks like this:
1|Q|A|1|X|
1|78|WQ|
1|123|ABC|
2|Q|V|5|Y|
2|LK|HJ|
2|BG|78|
(Notice first column added. Each section that starts with 'Q' should have a separate ID)
So, far I am using pandas :
previous_ID =0
for chunk in pd.read_csv(io.BytesIO(s3.get_object(Bucket=bucket, Key=f)['Body'].read()), sep=';', header=None, compression='gzip', chunksize=1000):
chunk = chunk.reset_index(drop=True)
chunk[['CLM1', 'data']] = chunk[0].str.split("|", n=1, expand=True)
chunk = chunk.drop([0], axis=1)
h_count = chunk[chunk['CLM1'] == 'Q'].shape[0]
chunk.loc[chunk['CLM1'] == 'Q', 'ID'] = range(previous_ID + 1,previous_ID + h_count + 1)
if (pd.isnull(chunk.loc[0, 'ID'])) and (previous_ID != 0):
chunk.at[0, 'ID'] = previous_ID
chunk['ID'] = chunk['ID'].ffill()
chunk['ID'].fillna(0, inplace=True)
chunk['ID'] = chunk['ID'].astype(int)
previous_ID = chunk.iloc[-1]['ID']
This works fine but I would like to understand from the community if there is any better and faster way. I don't want to read whole file in memory and I am open to use solution that use something other than pandas
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论