如何创建一个新列,其中根据现有列选择值?
如何将 color
列添加到以下数据框> color ='red'否则?
Type Set
1 A Z
2 B Z
3 B X
4 C Y
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(15)
如果您只有两种选择可以从中选择,则使用 <代码> np.Where :
,例如,
产量
如果您有两个以上的条件,则 ,请使用
np.Select
。例如,如果您想要
颜色
为yellow
时(df ['set'] =='z')&amp; (df ['type'] =='a')
blue
<代码>(df ['set'] =='z'))&amp; (df ['type'] =='b')purple
当(df ['type''] =='b')
黑色
,然后使用
产生的
If you only have two choices to select from then use
np.where
:For example,
yields
If you have more than two conditions then use
np.select
. For example, if you wantcolor
to beyellow
when(df['Set'] == 'Z') & (df['Type'] == 'A')
blue
when(df['Set'] == 'Z') & (df['Type'] == 'B')
purple
when(df['Type'] == 'B')
black
,then use
which yields
列表理解是有条件创建另一列的另一种方法。如果您正在使用列中的对象dtypes(例如在示例中),则列表综合通常优于大多数其他方法。
示例列表理解:
%时期测试:
List comprehension is another way to create another column conditionally. If you are working with object dtypes in columns, like in your example, list comprehensions typically outperform most other methods.
Example list comprehension:
%timeit tests:
可以实现这一目标的另一种方法是
Another way in which this could be achieved is
以下速度慢于在这里,但是我们可以根据多个列的内容计算额外的列,可以为额外的列计算两个以上的值。
仅使用“集合”列的简单示例:
示例带有更多颜色和更多列的考虑:
编辑(21/06/2019):使用plydata,
也可以使用 plydata 做此类操作(这似乎比使用
分配
和应用
)。简单
if_else
:嵌套
if_else
:The following is slower than the approaches timed here, but we can compute the extra column based on the contents of more than one column, and more than two values can be computed for the extra column.
Simple example using just the "Set" column:
Example with more colours and more columns taken into account:
Edit (21/06/2019): Using plydata
It is also possible to use plydata to do this kind of things (this seems even slower than using
assign
andapply
, though).Simple
if_else
:Nested
if_else
:您可以简单地使用功能强大的
.loc
方法,并根据需要使用一个或几种条件(用pandas = 1.0.5进行了测试)。代码摘要:
说明:
添加一个“颜色”列,并将所有值设置为“红色”
应用您的单个条件:
或如果需要的话,请多个条件:
您可以在pandas逻辑操作员上阅读,并在此处进行条件选择:
pandas in Boolean Indexing的逻辑操作员
You can simply use the powerful
.loc
method and use one condition or several depending on your need (tested with pandas=1.0.5).Code Summary:
Explanation:
add a 'color' column and set all values to "red"
Apply your single condition:
or multiple conditions if you want:
You can read on Pandas logical operators and conditional selection here:
Logical operators for boolean indexing in Pandas
这是剥皮这只猫的另一种方法,使用字典将新值映射到列表中的键:
它是什么样的:
当您拥有许多
ifelse
-Type语句时,此方法可能非常强大。使(即替换许多独特的值)。当然,您总是可以做到这一点:
但是,该方法的速度是
应用
方法的三倍以上,从上方,我的计算机上。您也可以使用
dict.get
来执行此操作:Here's yet another way to skin this cat, using a dictionary to map new values onto the keys in the list:
What's it look like:
This approach can be very powerful when you have many
ifelse
-type statements to make (i.e. many unique values to replace).And of course you could always do this:
But that approach is more than three times as slow as the
apply
approach from above, on my machine.And you could also do this, using
dict.get
:您可以使用pandas方法 和:
或者
,您可以使用lambda函数使用方法
变换
:输出:
性能比较来自@chai:
You can use pandas methods
where
andmask
:or
Alternatively, you can use the method
transform
with a lambda function:Output:
Performance comparison from @chai:
如果您只有 2个选择,请使用
np.where()
如果您有超过 2个选择,则可能,
apply> apply()可以工作
输入
和ARR是
如果您想要ebe tobe
如果arr.a =='a',则arr.b elif arr.a =='b',然后arr.c elif arr.a =='c'然后arr.d else somings_else
,最后是ARR
if you have only 2 choices, use
np.where()
if you have over 2 choices, maybe
apply()
could workinput
and arr is
if you want the column E tobe
if arr.A =='a' then arr.B elif arr.A=='b' then arr.C elif arr.A == 'c' then arr.D else something_else
and finally the arr is
一个带有
.apply()
方法的衬里如下:之后,
df
数据框架如下所示:One liner with
.apply()
method is following:After that,
df
data frame looks like this:case_when> case_when pyjanitor 是围绕
pd.series.series.mask> mask
and Code> and and Code> and Code> and Code> and Code>和为多种条件提供可链/方便的形式:对于单个条件:
对于多种条件:
可以找到更多示例在这里
The case_when function from pyjanitor is a wrapper around
pd.Series.mask
and offers a chainable/convenient form for multiple conditions:For a single condition:
For multiple conditions:
More examples can be found here
时轻松地完成此操作。
/39154“ rel =” nofollow noreferrer“> https://github.com/pandas-dev/pandas/issues/39154
https://pandas.pydata.org/pandas-docs/version/2.2.0/whatsnew/v2.2.0.html#create-a-pandas-series-------a- pandas-series基于-on-One-One-One-One-Or-More-More-More-More-More-More-Conditions
发行说明中的示例:
This is can be done easily using
case when
if you have Pandas v2.2.0 (Jan 2024)https://github.com/pandas-dev/pandas/issues/39154
https://pandas.pydata.org/pandas-docs/version/2.2.0/whatsnew/v2.2.0.html#create-a-pandas-series-based-on-one-or-more-conditions
example from the release notes:
这是一个简单的单线,当您拥有一个或几个条件 时,您可以使用:
容易且善于走!
在此处查看更多信息: https://numpy.org/numpy.org/doc/stable/参考/生成/numpy.select.html
Here is an easy one-liner you can use when you have one or several conditions:
Easy and good to go!
See more here: https://numpy.org/doc/stable/reference/generated/numpy.select.html
如果您正在使用大量数据,那么回忆的方法将是最好的:
当您有许多重复值时,这种方法将是最快的。我的一般经验法则是记忆时间:
data_size
&gt;10 ** 4
&amp;n_distinct
&lt;data_size/4
ex ex emo在10,000行中进行2,500或更少的不同值。
If you're working with massive data, a memoized approach would be best:
This approach will be fastest when you have many repeated values. My general rule of thumb is to memoize when:
data_size
>10**4
&n_distinct
<data_size/4
E.x. Memoize in a case 10,000 rows with 2,500 or fewer distinct values.
使用
np.Select
的详细方法:A Less verbose approach using
np.select
:这个答案 acharuva 很少修改
Little modification to This Answer of acharuva