基于列值删除大熊猫中的dataframe行

发布于 2025-02-05 17:16:17 字数 1805 浏览 4 评论 0 原文

我有以下数据框:

             daysago  line_race rating        rw    wrating
 line_date                                                 
2007-03-31       62         11     56  1.000000  56.000000
2007-03-10       83         11     67  1.000000  67.000000
2007-02-10      111          9     66  1.000000  66.000000
2007-01-13      139         10     83  0.880678  73.096278
2006-12-23      160         10     88  0.793033  69.786942
2006-11-09      204          9     52  0.636655  33.106077
2006-10-22      222          8     66  0.581946  38.408408
2006-09-29      245          9     70  0.518825  36.317752
2006-09-16      258         11     68  0.486226  33.063381
2006-08-30      275          8     72  0.446667  32.160051
2006-02-11      475          5     65  0.164591  10.698423
2006-01-13      504          0     70  0.142409   9.968634
2006-01-02      515          0     64  0.134800   8.627219
2005-12-06      542          0     70  0.117803   8.246238
2005-11-29      549          0     70  0.113758   7.963072
2005-11-22      556          0     -1  0.109852  -0.109852
2005-11-01      577          0     -1  0.098919  -0.098919
2005-10-20      589          0     -1  0.093168  -0.093168
2005-09-27      612          0     -1  0.083063  -0.083063
2005-09-07      632          0     -1  0.075171  -0.075171
2005-06-12      719          0     69  0.048690   3.359623
2005-05-29      733          0     -1  0.045404  -0.045404
2005-05-02      760          0     -1  0.039679  -0.039679
2005-04-02      790          0     -1  0.034160  -0.034160
2005-03-13      810          0     -1  0.030915  -0.030915
2004-11-09      934          0     -1  0.016647  -0.016647

我需要删除 line_race 等于 0 的行。做这件事的最有效方法是什么?

I have the following DataFrame:

             daysago  line_race rating        rw    wrating
 line_date                                                 
2007-03-31       62         11     56  1.000000  56.000000
2007-03-10       83         11     67  1.000000  67.000000
2007-02-10      111          9     66  1.000000  66.000000
2007-01-13      139         10     83  0.880678  73.096278
2006-12-23      160         10     88  0.793033  69.786942
2006-11-09      204          9     52  0.636655  33.106077
2006-10-22      222          8     66  0.581946  38.408408
2006-09-29      245          9     70  0.518825  36.317752
2006-09-16      258         11     68  0.486226  33.063381
2006-08-30      275          8     72  0.446667  32.160051
2006-02-11      475          5     65  0.164591  10.698423
2006-01-13      504          0     70  0.142409   9.968634
2006-01-02      515          0     64  0.134800   8.627219
2005-12-06      542          0     70  0.117803   8.246238
2005-11-29      549          0     70  0.113758   7.963072
2005-11-22      556          0     -1  0.109852  -0.109852
2005-11-01      577          0     -1  0.098919  -0.098919
2005-10-20      589          0     -1  0.093168  -0.093168
2005-09-27      612          0     -1  0.083063  -0.083063
2005-09-07      632          0     -1  0.075171  -0.075171
2005-06-12      719          0     69  0.048690   3.359623
2005-05-29      733          0     -1  0.045404  -0.045404
2005-05-02      760          0     -1  0.039679  -0.039679
2005-04-02      790          0     -1  0.034160  -0.034160
2005-03-13      810          0     -1  0.030915  -0.030915
2004-11-09      934          0     -1  0.016647  -0.016647

I need to remove the rows where line_race is equal to 0. What's the most efficient way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(20

烟酉 2025-02-12 17:16:17

如果我正确理解,那应该很简单:

df = df[df.line_race != 0]

If I'm understanding correctly, it should be as simple as:

df = df[df.line_race != 0]
江心雾 2025-02-12 17:16:17

但是对于任何将来的旁路人,您都可以提到 df = df [df.line_race!= 0] 在尝试过滤 none /缺失值时,无需做任何事情。

确实有效:

df = df[df.line_race != 0]

什么也不做:

df = df[df.line_race != None]

有效:

df = df[df.line_race.notnull()]

But for any future bypassers you could mention that df = df[df.line_race != 0] doesn't do anything when trying to filter for None/missing values.

Does work:

df = df[df.line_race != 0]

Doesn't do anything:

df = df[df.line_race != None]

Does work:

df = df[df.line_race.notnull()]
独自←快乐 2025-02-12 17:16:17

只是为了添加另一个解决方案,如果您使用新的大熊猫评估师,其他解决方案将取代原始熊猫并丢失评估者

df.drop(df.loc[df['line_race']==0].index, inplace=True)

just to add another solution, particularly useful if you are using the new pandas assessors, other solutions will replace the original pandas and lose the assessors

df.drop(df.loc[df['line_race']==0].index, inplace=True)
假扮的天使 2025-02-12 17:16:17

在多个值和str dtype 的情况下,

我使用以下来过滤col中的给定值:

def filter_rows_by_values(df, col, values):
    return df[~df[col].isin(values)]

示例:

在dataframe中,我想删除列中具有“ b”和“ c”的行“ str”

df = pd.DataFrame({"str": ["a","a","a","a","b","b","c"], "other": [1,2,3,4,5,6,7]})
df
   str  other
0   a   1
1   a   2
2   a   3
3   a   4
4   b   5
5   b   6
6   c   7

filter_rows_by_values(df, "str", ["b","c"])

   str  other
0   a   1
1   a   2
2   a   3
3   a   4

In case of multiple values and str dtype

I used the following to filter out given values in a col:

def filter_rows_by_values(df, col, values):
    return df[~df[col].isin(values)]

Example:

In a DataFrame I want to remove rows which have values "b" and "c" in column "str"

df = pd.DataFrame({"str": ["a","a","a","a","b","b","c"], "other": [1,2,3,4,5,6,7]})
df
   str  other
0   a   1
1   a   2
2   a   3
3   a   4
4   b   5
5   b   6
6   c   7

filter_rows_by_values(df, "str", ["b","c"])

   str  other
0   a   1
1   a   2
2   a   3
3   a   4
迟到的我 2025-02-12 17:16:17

如果要根据列的多个值删除行,则可以使用:

df[(df.line_race != 0) & (df.line_race != 10)]

line_race 为0和10的所有行。

If you want to delete rows based on multiple values of the column, you could use:

df[(df.line_race != 0) & (df.line_race != 10)]

To drop all rows with values 0 and 10 for line_race.

北斗星光 2025-02-12 17:16:17

尽管以前的答案几乎与我要做的事情相似,但是使用索引方法不需要使用其他索引方法.loc()。可以以类似但精确的方式完成

df.drop(df.index[df['line_race'] == 0], inplace = True)

Though the previous answer are almost similar to what I am going to do, but using the index method does not require using another indexing method .loc(). It can be done in a similar but precise manner as

df.drop(df.index[df['line_race'] == 0], inplace = True)
浮华 2025-02-12 17:16:17

最好的方法是使用布尔屏蔽:

In [56]: df
Out[56]:
     line_date  daysago  line_race  rating    raw  wrating
0   2007-03-31       62         11      56  1.000   56.000
1   2007-03-10       83         11      67  1.000   67.000
2   2007-02-10      111          9      66  1.000   66.000
3   2007-01-13      139         10      83  0.881   73.096
4   2006-12-23      160         10      88  0.793   69.787
5   2006-11-09      204          9      52  0.637   33.106
6   2006-10-22      222          8      66  0.582   38.408
7   2006-09-29      245          9      70  0.519   36.318
8   2006-09-16      258         11      68  0.486   33.063
9   2006-08-30      275          8      72  0.447   32.160
10  2006-02-11      475          5      65  0.165   10.698
11  2006-01-13      504          0      70  0.142    9.969
12  2006-01-02      515          0      64  0.135    8.627
13  2005-12-06      542          0      70  0.118    8.246
14  2005-11-29      549          0      70  0.114    7.963
15  2005-11-22      556          0      -1  0.110   -0.110
16  2005-11-01      577          0      -1  0.099   -0.099
17  2005-10-20      589          0      -1  0.093   -0.093
18  2005-09-27      612          0      -1  0.083   -0.083
19  2005-09-07      632          0      -1  0.075   -0.075
20  2005-06-12      719          0      69  0.049    3.360
21  2005-05-29      733          0      -1  0.045   -0.045
22  2005-05-02      760          0      -1  0.040   -0.040
23  2005-04-02      790          0      -1  0.034   -0.034
24  2005-03-13      810          0      -1  0.031   -0.031
25  2004-11-09      934          0      -1  0.017   -0.017

In [57]: df[df.line_race != 0]
Out[57]:
     line_date  daysago  line_race  rating    raw  wrating
0   2007-03-31       62         11      56  1.000   56.000
1   2007-03-10       83         11      67  1.000   67.000
2   2007-02-10      111          9      66  1.000   66.000
3   2007-01-13      139         10      83  0.881   73.096
4   2006-12-23      160         10      88  0.793   69.787
5   2006-11-09      204          9      52  0.637   33.106
6   2006-10-22      222          8      66  0.582   38.408
7   2006-09-29      245          9      70  0.519   36.318
8   2006-09-16      258         11      68  0.486   33.063
9   2006-08-30      275          8      72  0.447   32.160
10  2006-02-11      475          5      65  0.165   10.698

更新:现在pandas 0.13出来了,另一种做到这一点的方法是 df.query('line_race!= 0')。

The best way to do this is with boolean masking:

In [56]: df
Out[56]:
     line_date  daysago  line_race  rating    raw  wrating
0   2007-03-31       62         11      56  1.000   56.000
1   2007-03-10       83         11      67  1.000   67.000
2   2007-02-10      111          9      66  1.000   66.000
3   2007-01-13      139         10      83  0.881   73.096
4   2006-12-23      160         10      88  0.793   69.787
5   2006-11-09      204          9      52  0.637   33.106
6   2006-10-22      222          8      66  0.582   38.408
7   2006-09-29      245          9      70  0.519   36.318
8   2006-09-16      258         11      68  0.486   33.063
9   2006-08-30      275          8      72  0.447   32.160
10  2006-02-11      475          5      65  0.165   10.698
11  2006-01-13      504          0      70  0.142    9.969
12  2006-01-02      515          0      64  0.135    8.627
13  2005-12-06      542          0      70  0.118    8.246
14  2005-11-29      549          0      70  0.114    7.963
15  2005-11-22      556          0      -1  0.110   -0.110
16  2005-11-01      577          0      -1  0.099   -0.099
17  2005-10-20      589          0      -1  0.093   -0.093
18  2005-09-27      612          0      -1  0.083   -0.083
19  2005-09-07      632          0      -1  0.075   -0.075
20  2005-06-12      719          0      69  0.049    3.360
21  2005-05-29      733          0      -1  0.045   -0.045
22  2005-05-02      760          0      -1  0.040   -0.040
23  2005-04-02      790          0      -1  0.034   -0.034
24  2005-03-13      810          0      -1  0.031   -0.031
25  2004-11-09      934          0      -1  0.017   -0.017

In [57]: df[df.line_race != 0]
Out[57]:
     line_date  daysago  line_race  rating    raw  wrating
0   2007-03-31       62         11      56  1.000   56.000
1   2007-03-10       83         11      67  1.000   67.000
2   2007-02-10      111          9      66  1.000   66.000
3   2007-01-13      139         10      83  0.881   73.096
4   2006-12-23      160         10      88  0.793   69.787
5   2006-11-09      204          9      52  0.637   33.106
6   2006-10-22      222          8      66  0.582   38.408
7   2006-09-29      245          9      70  0.519   36.318
8   2006-09-16      258         11      68  0.486   33.063
9   2006-08-30      275          8      72  0.447   32.160
10  2006-02-11      475          5      65  0.165   10.698

UPDATE: Now that pandas 0.13 is out, another way to do this is df.query('line_race != 0').

π浅易 2025-02-12 17:16:17

但是,给定的答案是正确的,但是上面的人说您可以使用 df.Query('line_race!= 0')哪些取决于您的问题要快得多。强烈推荐。

The given answer is correct nontheless as someone above said you can use df.query('line_race != 0') which depending on your problem is much faster. Highly recommend.

别念他 2025-02-12 17:16:17

有多种方法可以实现这一目标。根据一个人的用例的特殊性,将留下以下各种选择,可以使用。

人们会认为OP的数据帧存储在变量 df 中。


选项1

情况

 df_new = df[df != 0].dropna()
 
[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

对于OP的 并非总是如此,建议检查以下选项,其中指定列名。


选项2

tshauck的方法最终比选项1更好,因为一个人能够指定柱子。但是,还有其他变化,具体取决于一个人要参考该列的方式:

例如,使用数据框中的位置

df_new = df[df[df.columns[2]] != 0]

或明确指示列以下的列,

df_new = df[df['line_race'] != 0]

也可以遵循相同的登录,但使用自定义的lambda函数,这样AS

df_new = df[df.apply(lambda x: x['line_race'] != 0, axis=1)]

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

选项3

使用 pandas 。

df_new = df['line_race'].map(lambda x: x != 0)

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

​dataframe.drop.html“ rel =“ noreferrer”> pandas.dataframe.drop 如下所示

df_new = df.drop(df[df['line_race'] == 0].index)

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

选项5

使用

df_new = df.query('line_race != 0')

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

​强>

使用 pandas.dataframe.query.query 如下所示

df_new = df.drop(df.query('line_race == 0').index)

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

选项7

如果对输出没有强烈的意见,则可以使用

df_new = np.select([df != 0], [df], default=np.nan)

[Out]:
[['2007-03-31' 62 11.0 56 1.0 56.0]
 ['2007-03-10' 83 11.0 67 1.0 67.0]
 ['2007-02-10' 111 9.0 66 1.0 66.0]
 ['2007-01-13' 139 10.0 83 0.880678 73.096278]
 ['2006-12-23' 160 10.0 88 0.793033 69.786942]
 ['2006-11-09' 204 9.0 52 0.636655 33.106077]
 ['2006-10-22' 222 8.0 66 0.581946 38.408408]
 ['2006-09-29' 245 9.0 70 0.518825 36.317752]
 ['2006-09-16' 258 11.0 68 0.486226 33.063381]
 ['2006-08-30' 275 8.0 72 0.446667 32.160051]
 ['2006-02-11' 475 5.0 65 0.164591 10.698423]]

转换为dataframe

df_new = pd.DataFrame(df_new, columns=df.columns)

[Out]:
     line_date daysago line_race rating        rw    wrating
0   2007-03-31      62      11.0     56       1.0       56.0
1   2007-03-10      83      11.0     67       1.0       67.0
2   2007-02-10     111       9.0     66       1.0       66.0
3   2007-01-13     139      10.0     83  0.880678  73.096278
4   2006-12-23     160      10.0     88  0.793033  69.786942
5   2006-11-09     204       9.0     52  0.636655  33.106077
6   2006-10-22     222       8.0     66  0.581946  38.408408
7   2006-09-29     245       9.0     70  0.518825  36.317752
8   2006-09-16     258      11.0     68  0.486226  33.063381
9   2006-08-30     275       8.0     72  0.446667  32.160051
10  2006-02-11     475       5.0     65  0.164591  10.698423

这也可以通过最有效的解决方案 这将取决于人们想要如何衡量效率。假设一个人想衡量执行时间,那么一种可以执行此操作的方式是 time.perf_counter()

如果一个人测量上述所有选项的执行时间,则获得以下

       method                   time
0    Option 1 0.00000110000837594271
1  Option 2.1 0.00000139995245262980
2  Option 2.2 0.00000369996996596456
3  Option 2.3 0.00000160001218318939
4    Option 3 0.00000110000837594271
5    Option 4 0.00000120000913739204
6    Option 5 0.00000140001066029072
7    Option 6 0.00000159995397552848
8    Option 7 0.00000150001142174006

“在此处输入图像说明”

但是,这可能会根据一个人使用的数据框架(例如硬件)等而改变。


注意:

There are various ways to achieve that. Will leave below various options, that one can use, depending on specificities of one's use case.

One will consider that OP's dataframe is stored in the variable df.


Option 1

For OP's case, considering that the only column with values 0 is the line_race, the following will do the work

 df_new = df[df != 0].dropna()
 
[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

However, as that is not always the case, would recommend checking the following options where one will specify the column name.


Option 2

tshauck's approach ends up being better than Option 1, because one is able to specify the column. There are, however, additional variations depending on how one wants to refer to the column:

For example, using the position in the dataframe

df_new = df[df[df.columns[2]] != 0]

Or by explicitly indicating the column as follows

df_new = df[df['line_race'] != 0]

One can also follow the same login but using a custom lambda function, such as

df_new = df[df.apply(lambda x: x['line_race'] != 0, axis=1)]

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

Option 3

Using pandas.Series.map and a custom lambda function

df_new = df['line_race'].map(lambda x: x != 0)

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

Option 4

Using pandas.DataFrame.drop as follows

df_new = df.drop(df[df['line_race'] == 0].index)

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

Option 5

Using pandas.DataFrame.query as follows

df_new = df.query('line_race != 0')

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

Option 6

Using pandas.DataFrame.drop and pandas.DataFrame.query as follows

df_new = df.drop(df.query('line_race == 0').index)

[Out]:
     line_date  daysago  line_race  rating        rw    wrating
0   2007-03-31       62       11.0      56  1.000000  56.000000
1   2007-03-10       83       11.0      67  1.000000  67.000000
2   2007-02-10      111        9.0      66  1.000000  66.000000
3   2007-01-13      139       10.0      83  0.880678  73.096278
4   2006-12-23      160       10.0      88  0.793033  69.786942
5   2006-11-09      204        9.0      52  0.636655  33.106077
6   2006-10-22      222        8.0      66  0.581946  38.408408
7   2006-09-29      245        9.0      70  0.518825  36.317752
8   2006-09-16      258       11.0      68  0.486226  33.063381
9   2006-08-30      275        8.0      72  0.446667  32.160051
10  2006-02-11      475        5.0      65  0.164591  10.698423

Option 7

If one doesn't have strong opinions on the output, one can use a vectorized approach with numpy.select

df_new = np.select([df != 0], [df], default=np.nan)

[Out]:
[['2007-03-31' 62 11.0 56 1.0 56.0]
 ['2007-03-10' 83 11.0 67 1.0 67.0]
 ['2007-02-10' 111 9.0 66 1.0 66.0]
 ['2007-01-13' 139 10.0 83 0.880678 73.096278]
 ['2006-12-23' 160 10.0 88 0.793033 69.786942]
 ['2006-11-09' 204 9.0 52 0.636655 33.106077]
 ['2006-10-22' 222 8.0 66 0.581946 38.408408]
 ['2006-09-29' 245 9.0 70 0.518825 36.317752]
 ['2006-09-16' 258 11.0 68 0.486226 33.063381]
 ['2006-08-30' 275 8.0 72 0.446667 32.160051]
 ['2006-02-11' 475 5.0 65 0.164591 10.698423]]

This can also be converted to a dataframe with

df_new = pd.DataFrame(df_new, columns=df.columns)

[Out]:
     line_date daysago line_race rating        rw    wrating
0   2007-03-31      62      11.0     56       1.0       56.0
1   2007-03-10      83      11.0     67       1.0       67.0
2   2007-02-10     111       9.0     66       1.0       66.0
3   2007-01-13     139      10.0     83  0.880678  73.096278
4   2006-12-23     160      10.0     88  0.793033  69.786942
5   2006-11-09     204       9.0     52  0.636655  33.106077
6   2006-10-22     222       8.0     66  0.581946  38.408408
7   2006-09-29     245       9.0     70  0.518825  36.317752
8   2006-09-16     258      11.0     68  0.486226  33.063381
9   2006-08-30     275       8.0     72  0.446667  32.160051
10  2006-02-11     475       5.0     65  0.164591  10.698423

With regards to the most efficient solution, that would depend on how one wants to measure efficiency. Assuming that one wants to measure the time of execution, one way that one can go about doing it is with time.perf_counter().

If one measures the time of execution for all the options above, one gets the following

       method                   time
0    Option 1 0.00000110000837594271
1  Option 2.1 0.00000139995245262980
2  Option 2.2 0.00000369996996596456
3  Option 2.3 0.00000160001218318939
4    Option 3 0.00000110000837594271
5    Option 4 0.00000120000913739204
6    Option 5 0.00000140001066029072
7    Option 6 0.00000159995397552848
8    Option 7 0.00000150001142174006

enter image description here

However, this might change depending on the dataframe one uses, on the requirements (such as hardware), and more.


Notes:

我的影子我的梦 2025-02-12 17:16:17

高效且熊猫的方法之一是使用 eq()方法:

df[~df.line_race.eq(0)]

One of the efficient and pandaic way is using eq() method:

df[~df.line_race.eq(0)]
请恋爱 2025-02-12 17:16:17

另一种做到这一点的方式。可能不是最有效的方法,因为代码看起来比其他答案中提到的代码更为复杂,但仍然可以替代执行同一操作的方法。

  df = df.drop(df[df['line_race']==0].index)

Another way of doing it. May not be the most efficient way as the code looks a bit more complex than the code mentioned in other answers, but still alternate way of doing the same thing.

  df = df.drop(df[df['line_race']==0].index)
疯狂的代价 2025-02-12 17:16:17

我编译并运行代码。这是准确的代码。您可以自己尝试。

data = pd.read_excel('file.xlsx')

您在列名中有任何特殊字符或空间

data = data[data['expire/t'].notnull()]
print (date)

如果
角色您可以直接访问它。

data = data[data.expire ! = 0]
print (date)

I compiled and run my code. This is accurate code. You can try it your own.

data = pd.read_excel('file.xlsx')

If you have any special character or space in column name you can write it in '' like in the given code:

data = data[data['expire/t'].notnull()]
print (date)

If there is just a single string column name without any space or special
character you can directly access it.

data = data[data.expire ! = 0]
print (date)
瀟灑尐姊 2025-02-12 17:16:17

提供了很多选项(或者我没有太多关注,对不起,如果是这样),但是没有人提到这一点:
我们可以在熊猫中使用此符号:

df = df[~df["line_race"] == 0]

so many options provided(or maybe i didnt pay much attention to it, sorry if its the case), but no one mentioned this:
we can use this notation in pandas: ~ (this gives us the inverse of the condition)

df = df[~df["line_race"] == 0]
野生奥特曼 2025-02-12 17:16:17

只需在所有列上添加另一种方法以扩展数据框:

for column in df.columns:
   df = df[df[column]!=0]

示例:

def z_score(data,count):
   threshold=3
   for column in data.columns:
       mean = np.mean(data[column])
       std = np.std(data[column])
       for i in data[column]:
           zscore = (i-mean)/std
           if(np.abs(zscore)>threshold):
               count=count+1
               data = data[data[column]!=i]
   return data,count

Just adding another way for DataFrame expanded over all columns:

for column in df.columns:
   df = df[df[column]!=0]

Example:

def z_score(data,count):
   threshold=3
   for column in data.columns:
       mean = np.mean(data[column])
       std = np.std(data[column])
       for i in data[column]:
           zscore = (i-mean)/std
           if(np.abs(zscore)>threshold):
               count=count+1
               data = data[data[column]!=i]
   return data,count
情栀口红 2025-02-12 17:16:17

以防万一您需要删除行,但值可以在不同的列中。
就我而言,我正在使用百分比,因此我想删除任何列中具有值1的行,因为这意味着

for x in df:
    df.drop(df.loc[df[x]==1].index, inplace=True)

如果您的DF有太多列,则100%不是最佳的。

Just in case you need to delete the row, but the value can be in different columns.
In my case I was using percentages so I wanted to delete the rows which has a value 1 in any column, since that means that it's the 100%

for x in df:
    df.drop(df.loc[df[x]==1].index, inplace=True)

Is not optimal if your df have too many columns.

2025-02-12 17:16:17

您可以尝试使用此信息:

df.drop(df[df.line_race != 0].index, inplace = True)

You can try using this:

df.drop(df[df.line_race != 0].index, inplace = True)

.

已下线请稍等 2025-02-12 17:16:17

如果您需要根据索引值删除行,则最高答案中的布尔索引也可以进行调整。例如,在以下代码中,删除索引在3和7之间的行。

df = pd.DataFrame({'A': range(10), 'B': range(50,60)})

x = df[(df.index < 3) | (df.index > 7)]
# or equivalently
y = df[~((df.index >= 3) & (df.index <= 7))]

# or using query
z = df.query("~(3 <= index <= 7)")


# if the index has a name (as in the OP), use the name
# to select rows in 2007:
df.query("line_date.dt.year == 2007")

正如其他人提到的那样, query()是一个非常可读的功能,非常适合此任务。实际上,对于大型数据框,它是该任务的最快方法(请参见此答案有关基准结果)。

一些带有 query()的常见问题:

  1. 对于带有空间的列名称,请使用Backticks。
      df = pd.dataframe({'col a':[0,1,2,0],'col b':['a','b ','cd','e']})
    
    #将列名与Backticks的空间包装
    x = df.query(''col a`!= 0')
     
  2. 要参考本地环境中的变量,请以@的前缀为前缀。
      to_exclude = [0,2]
    y = df.query(''col a`!= @to_exclude')
     
  3. 也可以调用系列方法。
     #删除列中字符串长度不是1的行
    z = df.query(“``col b`.str.len()== 1“)
     

If you need to remove rows based on index values, the boolean indexing in the top answer may be adapted as well. For example, in the following code, rows where the index is between 3 and 7 are removed.

df = pd.DataFrame({'A': range(10), 'B': range(50,60)})

x = df[(df.index < 3) | (df.index > 7)]
# or equivalently
y = df[~((df.index >= 3) & (df.index <= 7))]

# or using query
z = df.query("~(3 <= index <= 7)")


# if the index has a name (as in the OP), use the name
# to select rows in 2007:
df.query("line_date.dt.year == 2007")

As others have mentioned, query() is a very readable function that is perfect for this task. In fact, for large dataframes, it is the fastest method for this task (see this answer for benchmark results).

Some common questions with query():

  1. For column names with a space, use backticks.
    df = pd.DataFrame({'col A': [0, 1, 2, 0], 'col B': ['a', 'b', 'cd', 'e']})
    
    # wrap a column name with space by backticks
    x = df.query('`col A` != 0')
    
  2. To refer to variables in the local environment, prefix it with an @.
    to_exclude = [0, 2]
    y = df.query('`col A` != @to_exclude')
    
  3. Can call Series methods as well.
    # remove rows where the length of the string in column B is not 1
    z = df.query("`col B`.str.len() == 1")
    
不甘平庸 2025-02-12 17:16:17

该线程中有几个涉及索引的答案,如果索引具有重复项,则大多数答案将无效。是的,这已经在上面的至少一条评论中指出,并且还指出,重新索引是解决这个问题的一种方式。这是一个重复索引以说明问题的示例。

df = pd.DataFrame(data=[(1,'A'), (0,'B'), (1,'C')], index=[1,2,2],
                  columns=['line_race','C2'])
print("Original with a duplicate index entry:")
print(df)

df = pd.DataFrame(data=[(1,'A'), (0,'B'), (1,'C')], index=[1,2,2],
                  columns=['line_race','C2'])
df.drop(df[df.line_race == 0].index, inplace = True)
print("\nIncorrect rows removed:")
print(df)

df = pd.DataFrame(data=[(1,'A'), (0,'B'), (1,'C')], index=[1,2,2],
                  columns=['line_race','C2'])
df.reset_index(drop=False, inplace=True)
df.drop(df[df.line_race == 0].index, inplace = True)
df.set_index('index', drop=True, inplace=True)
df.index.name = None
print("\nCorrect row removed:")
print(df)

这是输出:

Original with a duplicate index entry:
   line_race C2
1          1  A
2          0  B
2          1  C

Incorrect rows removed:
   line_race C2
1          1  A

Correct row removed:
   line_race C2
1          1  A
2          1  C

There are several answers in this thread involving the index, and most of those answers will not work if the index has duplicates. And yes, that has been pointed out in at least one of the comments above, and it has also been pointed out that re-indexing is a way around this issue. Here is an example with a repeated index to illustrate the issue.

df = pd.DataFrame(data=[(1,'A'), (0,'B'), (1,'C')], index=[1,2,2],
                  columns=['line_race','C2'])
print("Original with a duplicate index entry:")
print(df)

df = pd.DataFrame(data=[(1,'A'), (0,'B'), (1,'C')], index=[1,2,2],
                  columns=['line_race','C2'])
df.drop(df[df.line_race == 0].index, inplace = True)
print("\nIncorrect rows removed:")
print(df)

df = pd.DataFrame(data=[(1,'A'), (0,'B'), (1,'C')], index=[1,2,2],
                  columns=['line_race','C2'])
df.reset_index(drop=False, inplace=True)
df.drop(df[df.line_race == 0].index, inplace = True)
df.set_index('index', drop=True, inplace=True)
df.index.name = None
print("\nCorrect row removed:")
print(df)

This is the output:

Original with a duplicate index entry:
   line_race C2
1          1  A
2          0  B
2          1  C

Incorrect rows removed:
   line_race C2
1          1  A

Correct row removed:
   line_race C2
1          1  A
2          1  C
や莫失莫忘 2025-02-12 17:16:17

使用 .loc 不使用 .drop ,您可以使用:

df = df.loc[df['line_race']!=0]

Using .loc without using .drop, you could use :

df = df.loc[df['line_race']!=0]
千年*琉璃梦 2025-02-12 17:16:17

对于这样的简单示例,它并没有太大的不同,但是对于复杂的逻辑,我更喜欢在删除行时使用 drop(),因为它比使用倒数逻辑更简单。例如,删除 A = 1和(B = 2或C = 3)的行。

这是一种易于理解并可以处理复杂逻辑的可扩展语法:

df.drop( df.query(" `line_race` == 0 ").index)

It doesn't make much difference for simple example like this, but for complicated logic, I prefer to use drop() when deleting rows because it is more straightforward than using inverse logic. For example, delete rows where A=1 AND (B=2 OR C=3).

Here's a scalable syntax that is easy to understand and can handle complicated logic:

df.drop( df.query(" `line_race` == 0 ").index)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文