系列的真实价值是模棱两可的。使用A.Empty,A.Bool(),A.Item(),A.Any()或A.Al()

发布于 2025-01-21 11:15:03 字数 294 浏览 0 评论 0 原文

我想用条件过滤我的数据框,以使行与特定列的值保持在范围之外的特定列值[ - 0.25,0.25] 。我尝试了:

df = df[(df['col'] < -0.25) or (df['col'] > 0.25)]

但是我得到了错误:

valueerror:系列的真实价值是模棱两可的。使用A.Empty,A.Bool(),A.Item(),a.any()或a.all()。

I want to filter my dataframe with an or condition to keep rows with a particular column's values that are outside the range [-0.25, 0.25]. I tried:

df = df[(df['col'] < -0.25) or (df['col'] > 0.25)]

But I get the error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(15

断肠人 2025-01-28 11:15:04

如果您有多个值:

df['col'].all()

如果它只是一个值:

df['col'].item()

If you have more than one value:

df['col'].all()

If it’s only a single value:

df['col'].item()
何时共饮酒 2025-01-28 11:15:04

我在此命令中遇到了一个错误:

if df != '':
    pass

但是当我将其更改为此时,它有效:

if df is not '':
    pass

I was getting an error in this command:

if df != '':
    pass

But it worked when I changed it to this:

if df is not '':
    pass
沐歌 2025-01-28 11:15:04

您需要使用位运算符 | 而不是&amp; 而不是pandas中的。您不能简单地使用Python的Bool语句。

对于许多复杂的过滤,请创建掩码并在数据框架上应用掩码。
将所有查询放在面具中并涂上。认为,

mask = (df["col1"]>=df["col2"]) & (stock["col1"]<=df["col2"])
df_new = df[mask]

You need to use bitwise operators | instead of or and & instead of and in pandas. You can't simply use the bool statements from python.

For much complex filtering, create a mask and apply the mask on the dataframe.
Put all your query in the mask and apply it. Suppose,

mask = (df["col1"]>=df["col2"]) & (stock["col1"]<=df["col2"])
df_new = df[mask]
寻找一个思念的角度 2025-01-28 11:15:04

在熊猫数据框架中工作时,我遇到了同样的问题。

我已经使用过: numpy.logical.logical.and.logical_and

我正在尝试在这里尝试要选择具有 41D7853 和degreee_type的行,而不是认证

如下所示:

display(df_degrees.loc[np.logical_and(df_degrees['person_id'] == '41d7853' , df_degrees['degree_type'] !='Certification')])

如果我尝试像以下内容一样编写代码:

display(df_degrees.loc[df_degrees['person_id'] == '41d7853' and df_degrees['degree_type'] !='Certification'])

我们将获得错误:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我使用 numpy.logical_and 它对我有用。

I have faced the same issue while working in the Panda dataframe.

I have used: numpy.logical_and:

Here I am trying to select the row with Id matched with 41d7853 and degreee_type not with Certification.

Like below:

display(df_degrees.loc[np.logical_and(df_degrees['person_id'] == '41d7853' , df_degrees['degree_type'] !='Certification')])

If I try to write code like the below:

display(df_degrees.loc[df_degrees['person_id'] == '41d7853' and df_degrees['degree_type'] !='Certification'])

We will get the error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I have used numpy.logical_and it worked for me.

ㄖ落Θ余辉 2025-01-28 11:15:04

我将尝试给出三种最常见方法的基准(也提到上述):

from timeit import repeat

setup = """
import numpy as np;
import random;
x = np.linspace(0,100);
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
"""
stmts = 'x[(x > lb) * (x <= ub)]', 'x[(x > lb) & (x <= ub)]', 'x[np.logical_and(x > lb, x <= ub)]'

for _ in range(3):
    for stmt in stmts:
        t = min(repeat(stmt, setup, number=100_000))
        print('%.4f' % t, stmt)
    print()

结果:

0.4808 x[(x > lb) * (x <= ub)]
0.4726 x[(x > lb) & (x <= ub)]
0.4904 x[np.logical_and(x > lb, x <= ub)]

0.4725 x[(x > lb) * (x <= ub)]
0.4806 x[(x > lb) & (x <= ub)]
0.5002 x[np.logical_and(x > lb, x <= ub)]

0.4781 x[(x > lb) * (x <= ub)]
0.4336 x[(x > lb) & (x <= ub)]
0.4974 x[np.logical_and(x > lb, x <= ub)]

但是,*在熊猫系列中不支持,而numpy阵列比熊猫数据框架快(围绕着熊猫慢1000倍,请参阅编号):

from timeit import repeat

setup = """
import numpy as np;
import random;
import pandas as pd;
x = pd.DataFrame(np.linspace(0,100));
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
"""
stmts = 'x[(x > lb) & (x <= ub)]', 'x[np.logical_and(x > lb, x <= ub)]'

for _ in range(3):
    for stmt in stmts:
        t = min(repeat(stmt, setup, number=100))
        print('%.4f' % t, stmt)
    print()

结果:

0.1964 x[(x > lb) & (x <= ub)]
0.1992 x[np.logical_and(x > lb, x <= ub)]

0.2018 x[(x > lb) & (x <= ub)]
0.1838 x[np.logical_and(x > lb, x <= ub)]

0.1871 x[(x > lb) & (x <= ub)]
0.1883 x[np.logical_and(x > lb, x <= ub)]

注意:添加一行代码 x = x.to_numpy()将需要大约20 µs。

对于那些喜欢%TimeIt 的人:

import numpy as np
import random
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
lb, ub
x = pd.DataFrame(np.linspace(0,100))

def asterik(x):
    x = x.to_numpy()
    return x[(x > lb) * (x <= ub)]

def and_symbol(x):
    x = x.to_numpy()
    return x[(x > lb) & (x <= ub)]

def numpy_logical(x):
    x = x.to_numpy()
    return x[np.logical_and(x > lb, x <= ub)]

for i in range(3):
    %timeit asterik(x)
    %timeit and_symbol(x)
    %timeit numpy_logical(x)
    print('\n')

结果:

23 µs ± 3.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.6 µs ± 9.53 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
31.3 µs ± 8.9 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


21.4 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.9 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
21.7 µs ± 500 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


25.1 µs ± 3.71 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
36.8 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
28.2 µs ± 5.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I'll try to give the benchmark of the three most common way (also mentioned above):

from timeit import repeat

setup = """
import numpy as np;
import random;
x = np.linspace(0,100);
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
"""
stmts = 'x[(x > lb) * (x <= ub)]', 'x[(x > lb) & (x <= ub)]', 'x[np.logical_and(x > lb, x <= ub)]'

for _ in range(3):
    for stmt in stmts:
        t = min(repeat(stmt, setup, number=100_000))
        print('%.4f' % t, stmt)
    print()

Result:

0.4808 x[(x > lb) * (x <= ub)]
0.4726 x[(x > lb) & (x <= ub)]
0.4904 x[np.logical_and(x > lb, x <= ub)]

0.4725 x[(x > lb) * (x <= ub)]
0.4806 x[(x > lb) & (x <= ub)]
0.5002 x[np.logical_and(x > lb, x <= ub)]

0.4781 x[(x > lb) * (x <= ub)]
0.4336 x[(x > lb) & (x <= ub)]
0.4974 x[np.logical_and(x > lb, x <= ub)]

But, * is not supported in Panda Series, and NumPy Array is faster than pandas data frame (around 1000 times slower, see number):

from timeit import repeat

setup = """
import numpy as np;
import random;
import pandas as pd;
x = pd.DataFrame(np.linspace(0,100));
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
"""
stmts = 'x[(x > lb) & (x <= ub)]', 'x[np.logical_and(x > lb, x <= ub)]'

for _ in range(3):
    for stmt in stmts:
        t = min(repeat(stmt, setup, number=100))
        print('%.4f' % t, stmt)
    print()

Result:

0.1964 x[(x > lb) & (x <= ub)]
0.1992 x[np.logical_and(x > lb, x <= ub)]

0.2018 x[(x > lb) & (x <= ub)]
0.1838 x[np.logical_and(x > lb, x <= ub)]

0.1871 x[(x > lb) & (x <= ub)]
0.1883 x[np.logical_and(x > lb, x <= ub)]

Note: adding one line of code x = x.to_numpy() will need about 20 µs.

For those who prefer %timeit:

import numpy as np
import random
lb, ub = np.sort([random.random() * 100, random.random() * 100]).tolist()
lb, ub
x = pd.DataFrame(np.linspace(0,100))

def asterik(x):
    x = x.to_numpy()
    return x[(x > lb) * (x <= ub)]

def and_symbol(x):
    x = x.to_numpy()
    return x[(x > lb) & (x <= ub)]

def numpy_logical(x):
    x = x.to_numpy()
    return x[np.logical_and(x > lb, x <= ub)]

for i in range(3):
    %timeit asterik(x)
    %timeit and_symbol(x)
    %timeit numpy_logical(x)
    print('\n')

Result:

23 µs ± 3.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.6 µs ± 9.53 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
31.3 µs ± 8.9 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


21.4 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
21.9 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
21.7 µs ± 500 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


25.1 µs ± 3.71 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
36.8 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
28.2 µs ± 5.97 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
鯉魚旗 2025-01-28 11:15:04

我遇到了同样的错误,并被a pyspark dataframe持续了几天。 我能够通过用0 填充Na值成功解决它,因为我正在比较两个字段的整数值。

I encountered the same error and got stalled with a PySpark dataframe for few days. I was able to resolve it successfully by filling na values with 0 since I was comparing integer values from two fields.

因为看清所以看轻 2025-01-28 11:15:04

一件小事,这浪费了我的时间。

将条件(如果使用“ =”,“”,!=“)在括号中。不这样做也会提出这个例外。

这将起作用:

df[(some condition) conditional operator (some conditions)]

这不会:

df[some condition conditional-operator some condition]

One minor thing, which wasted my time.

Put the conditions (if comparing using " = ", " != ") in parentheses. Failing to do so also raises this exception.

This will work:

df[(some condition) conditional operator (some conditions)]

This will not:

df[some condition conditional-operator some condition]
薔薇婲 2025-01-28 11:15:04

就我而言,由于此错误正在增加。确保对比较运算符给予相同的数据类型元素以进行比较。

In my case I was having a type value error due to which this error was raising. Make sure the comparison operator been given the same datatype element to compare.

枯寂 2025-01-28 11:15:04

可能出现此错误的另一种情况是,当熊猫单元包含numpy ndarrays时,您想执行比较,例如&gt; ==

df = pd.DataFrame({'A': [np.array([1, 2]), np.array([3, 1])]})
df['A'] > 2              # <--- ValueError: The truth value of ...

。在执行相同的工作之前,它会成为适当的ndarray。

np.stack(df['A']) > 2    # <--- OK

或使用 .str 访问者访问值:

df['A'].str[0] > 2       # <--- OK

Another situation where this error may show up is when a pandas cell contains numpy ndarrays and you want to perform comparisons such as >, == etc.

df = pd.DataFrame({'A': [np.array([1, 2]), np.array([3, 1])]})
df['A'] > 2              # <--- ValueError: The truth value of ...

A solution is to convert it into a proper ndarray before performing the same job.

np.stack(df['A']) > 2    # <--- OK

or access values using .str accessor:

df['A'].str[0] > 2       # <--- OK
美羊羊 2025-01-28 11:15:03

和和 python语句需要 truth - 值。对于熊猫,这些被认为是模棱两可的,因此您应该使用“ bitwise” | (OR)或&amp; (和)操作:

df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]

这些数据结构已超载要产生元素


只是为了在此语句中添加更多说明:

当您要获得 pandas.Series.Series bool 时,会引发

>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

异常>隐式将操作数转换为 bool (您使用了,但也发生在,,如果 > and ):

>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

除了这四个语句之外,还有几个python函数隐藏了一些 bool 呼叫(例如任何, /code>,<代码>过滤器,...)。对于 pandas.Series ,这些通常不是问题的,但是对于完整性,我想提及这些。


就您而言,例外并没有真正的帮助,因为它没有提及正确的替代方案。对于和或,如果要进行元素的比较,则可以使用:

  • numpy.logical.or_or

     &gt;&gt;&gt;导入numpy作为NP
    &gt;&gt;&gt; np.logical_or(x,y)
     

    或简单的 | 操作员:

     &gt;&gt;&gt; X | y
     
  • numpy.logical_and

     &gt;&gt;&gt; np.logical_and(x,y)
     

    或简单的&amp; 操作员:

     &gt;&gt;&gt; X&amp; y
     

如果您使用的是操作员,请确保因为

几个逻辑numpy函数 em>在 pandas.series 上工作。


如果您在执行> 时遇到时遇到时遇到的替代方案,则更适合。我会尽快解释其中的每一个:

  • 如果您想检查您的系列是否为

     &gt;&gt;&gt; x = pd.Series([])
    &gt;&gt;&gt; X.Empty
    真的
    &gt;&gt;&gt; x = pd.Series([1])
    &gt;&gt;&gt; X.Empty
    错误的
     

    python通常将容器的 gth解释(例如 list tuple ,...)为真实值,如果没有明确的布尔解释。因此,如果您想要类似Python的检查,则可以执行:如果X.Size (如果不是X.Empty )而不是,则如果x.sempt> ,则可以。

  • 如果您的系列包含一个且仅一个布尔值:

     &gt;&gt;&gt; x = pd.Series([100])
    &gt;&gt;&gt; (x&gt; 50).bool()
    真的
    &gt;&gt;&gt; (x&lt; 50).bool()
    错误的
     
  • 如果您想先检查 (例如 .bool(),但即使对非树立内容也有效):

     &gt;&gt;&gt; x = pd.Series([100])
    &gt;&gt;&gt; X.Item()
    100
     
  • 如果您想要检查是否任何项目都不为零,不是空的或非false:

     &gt;&gt;&gt; x = pd.Series([0,1,2])
    &gt;&gt;&gt; x.all()#因为一个元素为零
    错误的
    &gt;&gt;&gt; x.any()#,因为一个(或更多)元素是非零的
    真的
     

The or and and Python statements require truth-values. For pandas, these are considered ambiguous, so you should use "bitwise" | (or) or & (and) operations:

df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]

These are overloaded for these kinds of data structures to yield the element-wise or or and.


Just to add some more explanation to this statement:

The exception is thrown when you want to get the bool of a pandas.Series:

>>> import pandas as pd
>>> x = pd.Series([1])
>>> bool(x)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

You hit a place where the operator implicitly converted the operands to bool (you used or but it also happens for and, if and while):

>>> x or x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> x and x
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> if x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
>>> while x:
...     print('fun')
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Besides these four statements, there are several Python functions that hide some bool calls (like any, all, filter, ...). These are normally not problematic with pandas.Series, but for completeness I wanted to mention these.


In your case, the exception isn't really helpful, because it doesn't mention the right alternatives. For and and or, if you want element-wise comparisons, you can use:

  • numpy.logical_or:

    >>> import numpy as np
    >>> np.logical_or(x, y)
    

    or simply the | operator:

    >>> x | y
    
  • numpy.logical_and:

    >>> np.logical_and(x, y)
    

    or simply the & operator:

    >>> x & y
    

If you're using the operators, then be sure to set your parentheses correctly because of operator precedence.

There are several logical NumPy functions which should work on pandas.Series.


The alternatives mentioned in the Exception are more suited if you encountered it when doing if or while. I'll shortly explain each of these:

  • If you want to check if your Series is empty:

    >>> x = pd.Series([])
    >>> x.empty
    True
    >>> x = pd.Series([1])
    >>> x.empty
    False
    

    Python normally interprets the length of containers (like list, tuple, ...) as truth-value if it has no explicit Boolean interpretation. So if you want the Python-like check, you could do: if x.size or if not x.empty instead of if x.

  • If your Series contains one and only one Boolean value:

    >>> x = pd.Series([100])
    >>> (x > 50).bool()
    True
    >>> (x < 50).bool()
    False
    
  • If you want to check the first and only item of your Series (like .bool(), but it works even for non-Boolean contents):

    >>> x = pd.Series([100])
    >>> x.item()
    100
    
  • If you want to check if all or any item is not-zero, not-empty or not-False:

    >>> x = pd.Series([0, 1, 2])
    >>> x.all()   # Because one element is zero
    False
    >>> x.any()   # because one (or more) elements are non-zero
    True
    
走野 2025-01-28 11:15:03

PANDAS使用Bitwise &amp; | 。另外,每个条件都应包裹在()中。

这起作用:

data_query = data[(data['year'] >= 2005) & (data['year'] <= 2010)]

但是没有括号的相同查询没有:

data_query = data[(data['year'] >= 2005 & data['year'] <= 2010)]

Pandas uses bitwise & |. Also, each condition should be wrapped inside ( ).

This works:

data_query = data[(data['year'] >= 2005) & (data['year'] <= 2010)]

But the same query without parentheses does not:

data_query = data[(data['year'] >= 2005 & data['year'] <= 2010)]
三五鸿雁 2025-01-28 11:15:03

对于布尔逻辑,请使用&amp; |

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))

>>> df

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

要查看正在发生的事情,您将获得每次比较的布尔值,例如,

df.C > 0.25

0     True
1    False
2    False
3     True
4     True
Name: C, dtype: bool

当您有多个条件时,您将返回多个列。这就是为什么联接逻辑模棱两可的原因。使用或或分别处理每列,因此首先需要将该列减少到单个布尔值。例如,查看每个列中的任何值或所有值是否为true。

# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()

True

# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()

False

实现同一件事的一种复杂的方法是将所有这些列一起拉链,并执行适当的逻辑。

>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

有关更多详细信息,请参阅

For Boolean logic, use & and |.

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))

>>> df

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
2  0.950088 -0.151357 -0.103219
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

>>> df.loc[(df.C > 0.25) | (df.C < -0.25)]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

To see what is happening, you get a column of Booleans for each comparison, e.g.,

df.C > 0.25

0     True
1    False
2    False
3     True
4     True
Name: C, dtype: bool

When you have multiple criteria, you will get multiple columns returned. This is why the join logic is ambiguous. Using and or or treats each column separately, so you first need to reduce that column to a single Boolean value. For example, to see if any value or all values in each of the columns is True.

# Any value in either column is True?
(df.C > 0.25).any() or (df.C < -0.25).any()

True

# All values in either column is True?
(df.C > 0.25).all() or (df.C < -0.25).all()

False

One convoluted way to achieve the same thing is to zip all of these columns together, and perform the appropriate logic.

>>> df[[any([a, b]) for a, b in zip(df.C > 0.25, df.C < -0.25)]]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.443863

For more details, refer to Boolean Indexing in the documentation.

韵柒 2025-01-28 11:15:03

对于初学者来说,这是在熊猫中做出多种条件时的一个普遍问题。一般而言,有两个可能导致此错误的条件:

条件1:Python操作员优先

另一个常见的操作是使用布尔向量过滤数据。运算符为: | 用于&amp; 用于,以及对于而不是。这些必须使用括号

进行分组。

默认情况下,Python将评估 df ['a']&gt之类的表达式; 2&amp; df ['b']&lt; 3 as df ['a']&gt; (2&amp; df ['b'])&lt; 3 ,而所需的评估顺序为(df ['a'']&gt; 2)&amp; (df ['b']&lt; 3)

# Wrong
df['col'] < -0.25 | df['col'] > 0.25

# Right
(df['col'] < -0.25) | (df['col'] > 0.25)

有一些可能的方法可以摆脱括号,稍后我将介绍。


条件2:不正确的操作员/语句

如先前的报价所述,您需要使用 | ,,&amp; 用于,以及

# Wrong
(df['col'] < -0.25) or (df['col'] > 0.25)

# Right
(df['col'] < -0.25) | (df['col'] > 0.25)

另一个可能的情况是,如果语句,您正在使用中的布尔系列。

# Wrong
if pd.Series([True, False]):
    pass

显然,如果语句接受布尔式表达而不是熊猫系列,则Python 。您应该使用 pandas.series.series.yany yany 或在错误消息中列出的方法,根据您的需要将系列转换为值。

例如:

# Right
if df['col'].eq(0).all():
    # If you want all column values equal to zero
    print('do something')

# Right
if df['col'].eq(0).any():
    # If you want at least one column value equal to zero
    print('do something')

让我们谈谈在第一种情况下逃脱括号的方法。

  1. 使用熊猫数学功能

    pandas定义了许多数学功能,包括比较,如下:


    结果,您可以使用

      df = df [(df ['col']&lt; -0.25)| (df ['col']&gt; 0.25)]
    
    #等于
    
    df = df [df ['col']。lt(-0.25)| df ['col']。gt(0.25)]
     

  2. 使用 noreferrer“> <<代码> pandas.series.between()

    如果要在两个值之间选择行,则可以使用 pandas.series.between

    • df ['col] .betweew(左,右)等于
      (左&lt; = df ['col'])&amp; (df ['col']&lt; = right);
    • df ['col] .betweew(左,右,包含='左)等于
      (左&lt; = df ['col'])&amp; (df ['col']&lt; right);
    • df ['col] .betweew(左,右,包含='right')等于
      (左&lt; df ['col'])&amp; (df ['col']&lt; = right);
    • df ['col] .betweew(左,右,包含='no oter')等于
      (左&lt; df ['col'])&amp; (df ['col']&lt; right);
      df = df [(df ['col']&gt; -0.25)&amp; (df ['col']&lt; 0.25)]
    
    #等于
    
    df = df [df ['col']。(-0.25,0.25,ancyuse ='not'')]
     
  3. 使用 ()

    文档之前引用的文档方法很好地解释了这一点。

    pandas.dataframe.query()可以帮助您选择带有条件字符串的数据框架。在查询字符串中,您可以同时使用位运算符(&amp; and | )及其boolean cousins( and and or /代码>)。此外,您可以省略括号,但我不建议出于可读性原因。

      df = df [(df ['col']&lt; -0.25)| (df ['col']&gt; 0.25)]
    
    #等于
    
    df = df.query('col&lt; -0.25或col&gt; 0.25')
     

  4. 使用 ()

    pandas.dataframe.eval()评估描述数据框架上操作的字符串。因此,我们可以使用此方法来构建多种条件。语法与 pandas.dataframe.query()

    相同

      df = df [(df ['col']&lt; -0.25)| (df ['col']&gt; 0.25)]
    
    #等于
    
    df = df [df.eval('col&lt; -0.25或col&gt; 0.25')]]
     

    pandas.dataframe.query() pandas.dataframe.eval()可以做的比我在这里描述的更多的事情。建议您阅读他们的文档并与他们一起玩。

This is quite a common question for beginners when making multiple conditions in Pandas. Generally speaking, there are two possible conditions causing this error:

Condition 1: Python Operator Precedence

There is a paragraph of Boolean indexing | Indexing and selecting data — pandas documentation explains this:

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses.

By default Python will evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3).

# Wrong
df['col'] < -0.25 | df['col'] > 0.25

# Right
(df['col'] < -0.25) | (df['col'] > 0.25)

There are some possible ways to get rid off the parentheses, and I will cover this later.


Condition 2: Improper Operator/Statement

As is explained in the previous quotation, you need use | for or, & for and, and ~ for not.

# Wrong
(df['col'] < -0.25) or (df['col'] > 0.25)

# Right
(df['col'] < -0.25) | (df['col'] > 0.25)

Another possible situation is that you are using a Boolean Series in an if statement.

# Wrong
if pd.Series([True, False]):
    pass

It's clear that the Python if statement accepts Boolean-like expression rather than Pandas Series. You should use pandas.Series.any or methods listed in the error message to convert the Series to a value according to your need.

For example:

# Right
if df['col'].eq(0).all():
    # If you want all column values equal to zero
    print('do something')

# Right
if df['col'].eq(0).any():
    # If you want at least one column value equal to zero
    print('do something')

Let's talk about ways to escape the parentheses in the first situation.

  1. Use Pandas mathematical functions

    Pandas has defined a lot of mathematical functions, including comparison, as follows:

    As a result, you can use

    df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]
    
    # is equal to
    
    df = df[df['col'].lt(-0.25) | df['col'].gt(0.25)]
    
  2. Use pandas.Series.between()

    If you want to select rows in between two values, you can use pandas.Series.between:

    • df['col].between(left, right) is equal to
      (left <= df['col']) & (df['col'] <= right);
    • df['col].between(left, right, inclusive='left) is equal to
      (left <= df['col']) & (df['col'] < right);
    • df['col].between(left, right, inclusive='right') is equal to
      (left < df['col']) & (df['col'] <= right);
    • df['col].between(left, right, inclusive='neither') is equal to
      (left < df['col']) & (df['col'] < right);
    df = df[(df['col'] > -0.25) & (df['col'] < 0.25)]
    
    # is equal to
    
    df = df[df['col'].between(-0.25, 0.25, inclusive='neither')]
    
  3. Use pandas.DataFrame.query()

    Document referenced before has a chapter The query() Method explains this well.

    pandas.DataFrame.query() can help you select a DataFrame with a condition string. Within the query string, you can use both bitwise operators (& and |) and their boolean cousins (and and or). Moreover, you can omit the parentheses, but I don't recommend it for readability reasons.

    df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]
    
    # is equal to
    
    df = df.query('col < -0.25 or col > 0.25')
    
  4. Use pandas.DataFrame.eval()

    pandas.DataFrame.eval() evaluates a string describing operations on DataFrame columns. Thus, we can use this method to build our multiple conditions. The syntax is the same with pandas.DataFrame.query().

    df = df[(df['col'] < -0.25) | (df['col'] > 0.25)]
    
    # is equal to
    
    df = df[df.eval('col < -0.25 or col > 0.25')]
    

    pandas.DataFrame.query() and pandas.DataFrame.eval() can do more things than I describe here. You are recommended to read their documentation and have fun with them.

彩虹直至黑白 2025-01-28 11:15:03

或者,或者,您可以使用运算符模块。更详细的信息是在Python文档中

import operator
import numpy as np
import pandas as pd

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.4438

Or, alternatively, you could use the operator module. More detailed information is in the Python documentation:

import operator
import numpy as np
import pandas as pd

np.random.seed(0)
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[operator.or_(df.C > 0.25, df.C < -0.25)]

          A         B         C
0  1.764052  0.400157  0.978738
1  2.240893  1.867558 -0.977278
3  0.410599  0.144044  1.454274
4  0.761038  0.121675  0.4438
蓝梦月影 2025-01-28 11:15:03

这个出色的答案很好地解释了正在发生的事情并提供了解决方案。我想添加另一个可能在类似情况下可能适合的解决方案:使用 查询 方法:

df = df.query("(col > 0.25) or (col < -0.25)")

另请参见 索引和选择数据

(我目前正在使用的一些数据框的某些测试表明,此方法比在一系列布尔值中使用位运算符慢一点:2&nbsp; ms vs. 870&nbsp; µs)

一条警告 :至少在列名称恰好是Python表达式时,这种情况并不简单。我有名为 wt_38hph_ip_2 wt_38hph_input_2 log2(WT_38HPH_IP_2/WT_38HPH_INPUT_2) wt_38hph_input_2 )&gt; 1)和(wt_38hph_ip_2&gt;

wt_38hph_ip_2/ wt_38hph_input_2

  • 20 定义
  • value error:“ log2”不是一个受支持的函数

我想这是因为查询解析器试图从前两列中制作某些东西,而不是用名称识别表达式第三列。

提出了可能的解决方法在这里

This excellent answer explains very well what is happening and provides a solution. I would like to add another solution that might be suitable in similar cases: using the query method:

df = df.query("(col > 0.25) or (col < -0.25)")

See also Indexing and selecting data.

(Some tests with a dataframe I'm currently working with suggest that this method is a bit slower than using the bitwise operators on series of Booleans: 2 ms vs. 870 µs)

A piece of warning: At least one situation where this is not straightforward is when column names happen to be Python expressions. I had columns named WT_38hph_IP_2, WT_38hph_input_2 and log2(WT_38hph_IP_2/WT_38hph_input_2) and wanted to perform the following query: "(log2(WT_38hph_IP_2/WT_38hph_input_2) > 1) and (WT_38hph_IP_2 > 20)"

I obtained the following exception cascade:

  • KeyError: 'log2'
  • UndefinedVariableError: name 'log2' is not defined
  • ValueError: "log2" is not a supported function

I guess this happened because the query parser was trying to make something from the first two columns instead of identifying the expression with the name of the third column.

A possible workaround is proposed here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文