Python DataFrame操纵:如何快速提取一组列
我需要从研究小组中其他同事使用的数据框架中访问和提取信息。
数据帧结构是:
zee.loc[zee['layer']=='EMB2'].loc[zee['roi']==0]
e et eta phi deta dphi samp hash det layer roi eventNumber
2249 20.677443 20.675829 0.0125 -1.067651 0.025 0.024544 3 2030015444 2 EMB2 0 2
2250 21.635288 21.633598 0.0125 -1.043107 0.025 0.024544 3 2030015445 2 EMB2 0 2
2251 -29.408310 -29.406013 0.0125 -1.018563 0.025 0.024544 3 2030015446 2 EMB2 0 2
2252 43.127533 43.124165 0.0125 -0.994020 0.025 0.024544 3 2030015447 2 EMB2 0 2
2253 -3.025344 -3.025108 0.0125 -0.969476 0.025 0.024544 3 2030015448 2 EMB2 0 2
... ... ... ... ... ... ... ... ... ... ... ... ...
4968988 -5.825550 -5.309279 0.4375 -0.454058 0.025 0.024544 3 2030019821 2 EMB2 0 3955
4968989 39.750645 36.227871 0.4375 -0.429515 0.025 0.024544 3 2030019822 2 EMB2 0 3955
4968990 80.568573 73.428436 0.4375 -0.404971 0.025 0.024544 3 2030019823 2 EMB2 0 3955
4968991 -28.921751 -26.358652 0.4375 -0.380427 0.025 0.024544 3 2030019824 2 EMB2 0 3955
4968992 55.599472 50.672146 0.4375 -0.355884 0.025 0.024544 3 2030019825 2 EMB2 0 3955
因此,我只需要与该层:EMB2和列:ET,ETA,PHI。要拿起这些列,我正在使用以下代码:
EtEtaPhi, EventLens = [], []
events = set(zee.loc[zee['layer']=='EMB2']['eventNumber'].to_numpy())
roi = set(zee.loc[zee['layer']=='EMB2']['roi'].to_numpy())
for ee in events:
for rr in roi:
if len(zee.loc[zee['layer']=='EMB2'].loc[zee['eventNumber']==ee].loc[zee['roi']==rr])==0: break
EtEtaPhi.append(zee[['et','eta','phi']].loc[zee['layer']=='EMB2'].loc[zee['eventNumber']==ee].loc[zee['roi']==rr].to_numpy())
EventLens.append(len(EtEtaPhi[-1]))
但是要阅读4000个事件需要很长时间,每个事件几乎一秒钟。这个结果不好,将近一个小时仅用于提取这些列!
是否有一些方法可以更有效,更快地从数据框架中提取列?
I need to access and extract information from a Dataframe that is used for other colleagues in a research group.
The DataFrame structure is:
zee.loc[zee['layer']=='EMB2'].loc[zee['roi']==0]
e et eta phi deta dphi samp hash det layer roi eventNumber
2249 20.677443 20.675829 0.0125 -1.067651 0.025 0.024544 3 2030015444 2 EMB2 0 2
2250 21.635288 21.633598 0.0125 -1.043107 0.025 0.024544 3 2030015445 2 EMB2 0 2
2251 -29.408310 -29.406013 0.0125 -1.018563 0.025 0.024544 3 2030015446 2 EMB2 0 2
2252 43.127533 43.124165 0.0125 -0.994020 0.025 0.024544 3 2030015447 2 EMB2 0 2
2253 -3.025344 -3.025108 0.0125 -0.969476 0.025 0.024544 3 2030015448 2 EMB2 0 2
... ... ... ... ... ... ... ... ... ... ... ... ...
4968988 -5.825550 -5.309279 0.4375 -0.454058 0.025 0.024544 3 2030019821 2 EMB2 0 3955
4968989 39.750645 36.227871 0.4375 -0.429515 0.025 0.024544 3 2030019822 2 EMB2 0 3955
4968990 80.568573 73.428436 0.4375 -0.404971 0.025 0.024544 3 2030019823 2 EMB2 0 3955
4968991 -28.921751 -26.358652 0.4375 -0.380427 0.025 0.024544 3 2030019824 2 EMB2 0 3955
4968992 55.599472 50.672146 0.4375 -0.355884 0.025 0.024544 3 2030019825 2 EMB2 0 3955
So, I need to work only with the layer: EMB2 and the columns: et, eta, phi. To pick up these columns, I'm using the following code:
EtEtaPhi, EventLens = [], []
events = set(zee.loc[zee['layer']=='EMB2']['eventNumber'].to_numpy())
roi = set(zee.loc[zee['layer']=='EMB2']['roi'].to_numpy())
for ee in events:
for rr in roi:
if len(zee.loc[zee['layer']=='EMB2'].loc[zee['eventNumber']==ee].loc[zee['roi']==rr])==0: break
EtEtaPhi.append(zee[['et','eta','phi']].loc[zee['layer']=='EMB2'].loc[zee['eventNumber']==ee].loc[zee['roi']==rr].to_numpy())
EventLens.append(len(EtEtaPhi[-1]))
But to read 4000 events take so long time, almost one second per event. This result isn't good, almost one hour just to extract those columns!
Is there some way to extract columns from a DataFrame more efficiently and faster?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
代码
您已经在那里已经有某个地方的 应该做您要求的代码。其余的不需要。
The code
which you already have somewhere in there should do what you asked for. The rest is not needed.
只需使用
.loc
:Just use
.loc
: