我可以使用pd.concat添加与DataFrame中其他列相等的新列吗?
我是Python的新手,正在将SQL转换为Python,并想学习处理大型数据集的最有效方法(行≫ 100万和列> 100)。我需要根据数据框架中的其他列创建多个新列。最近,我学会了如何将pd.concat用于新的布尔列,但是我也有一些依赖其他列值的非树树列。
在SQL中,我将使用单个案例语句(年龄> 1000然后sample_id else 0 en as Custom1等。)。在Python中,我可以在2个步骤(pd.concat + loc find& replent)中获得相同的结果,如下所示。我已经在其他帖子中看到了使用申请方法的参考文献,但在其他帖子中也阅读了申请方法的效率低下。
那么,对于下面显示的代码,我的问题是否有更有效的方法来执行此操作?我可以在pd.concat中的一个步骤做到这一点(到目前为止,我还没有能够工作)?如有必要,我可以用2个步骤进行操作。我需要能够在我的自定义元素中处理大整数(1000亿),并且在我的自定义2元素中有小数。
最后,我尝试使用多个独立的NP。在哪里语句,但收到了警告,该警告说我的数据框被分散,我应该尝试使用Concat。因此,我不确定哪种方法总体上是最有效或建议的。
Update - 收到评论并向我指向NP的答案后,我决定测试方法。使用具有270万行和80列的数据集,我添加了25列。第一种方法是使用concat + df.loc替换,如本文所示。第二种方法是使用NP。我进行了10次测试,而NP则在所有10次试验中都更快。如上所述,我认为以这种方式对NP进行重复使用可能会导致碎片化,因此我想现在我的决定归结为更快的NP。在这种情况下,潜在的碎片和较慢的使用Concat而没有碎片的风险。对此最终更新的任何进一步见解将不胜感激。
df = pd.DataFrame({'age': [120, 4000],
'weight': [505.31, 29.01],
'sample_id': [999999999999, 555555555555]},
index=['rock1', 'rock2'])
#step 1: efficiently create starting custom columns using concat
df = pd.concat(
[
df,
(df["age"] > 1000).rename("custom1").astype(int),
(df["weight"] < 100).rename("custom2").astype(float),
],
axis=1,
)
#step2: assign final values to custom columns based on other column values
df.loc[df.custom1 == 1, 'custom1'] = (df['sample_id'])
df.loc[df.custom2 == 1, 'custom2'] = (df['weight'] / 2)
感谢您提供的任何反馈...感谢您的时间帮助我。
I am new to Python and am converting SQL to Python and want to learn the most efficient way to process a large dataset (rows > 1 million and columns > 100). I need to create multiple new columns based on other columns in the DataFrame. I have recently learned how to use pd.concat for new boolean columns, but I also have some non-boolean columns that rely on the values of other columns.
In SQL I would use a single case statement (case when age > 1000 then sample_id else 0 end as custom1, etc...). In Python I can achieve the same result in 2 steps (pd.concat + loc find & replace) as shown below. I have seen references in other posts to using the apply method but have also read in other posts that the apply method can be inefficient.
My question is then, for the code shown below, is there a more efficient way to do this? Can I do it all in one step within the pd.concat (so far I haven't been able to get that to work)? I am okay doing it in 2 steps if necessary. I need to be able to handle large integers (100 billion) in my custom1 element and have decimals in my custom2 element.
And finally, I tried using multiple separate np.where statements but received a warning that my DataFrame was fragmented and that I should try to use concat. So I am not sure which approach overall is most efficient or recommended.
Update - after receiving a comment and an answer pointing me towards use of np.where, I decided to test the approaches. Using a data set with 2.7 million rows and 80 columns, I added 25 new columns. First approach was to use the concat + df.loc replace as shown in this post. Second approach was to use np.where. I ran the test 10 times and np.where was faster in all 10 trials. As noted above, I think repeated use of np.where in this way can cause fragmentation, so I suppose now my decision comes down to faster np.where with potential fragmentation vs. slower use of concat without risk of fragmentation. Any further insight on this final update is appreciated.
df = pd.DataFrame({'age': [120, 4000],
'weight': [505.31, 29.01],
'sample_id': [999999999999, 555555555555]},
index=['rock1', 'rock2'])
#step 1: efficiently create starting custom columns using concat
df = pd.concat(
[
df,
(df["age"] > 1000).rename("custom1").astype(int),
(df["weight"] < 100).rename("custom2").astype(float),
],
axis=1,
)
#step2: assign final values to custom columns based on other column values
df.loc[df.custom1 == 1, 'custom1'] = (df['sample_id'])
df.loc[df.custom2 == 1, 'custom2'] = (df['weight'] / 2)
Thanks for any feedback you can provide...I appreciate your time helping me.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这样做的标准方法是使用numpy
where
:The standard way to do this is using numpy
where
: