如何旋转数据框?

发布于 2025-02-08 03:57:28 字数 5968 浏览 0 评论 0 原文

  • 什么是枢轴?
  • 我如何旋转?
  • 长格式到广泛的格式?

我看过很多问题,即使他们不知道枢轴表。编写 canonical canonical canonical canonical otanical几乎不可能。问答涵盖了旋转的所有方面...但是我要去一试。


现有问题和答案的问题在于,这个问题通常集中在OP概括以使用许多现有良好答案的细微差别上。但是,没有一个答案试图给出全面的解释(因为这是一项艰巨的任务)。查看我的 Google搜索

  1. 如何在Pandas中旋转dataframe?好的问题和答案。但是答案只有很少的解释就回答了特定问题。
  2. pandas pivot表到数据框架 - OP与枢轴的输出有关,即列的外观。 OP希望它看起来像R。这对Pandas用户并不是很有帮助。
  3. pandas旋转数据框。

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)
     key   row   item   col  val0  val1
0   key0  row3  item1  col3  0.81  0.04
1   key1  row2  item1  col2  0.44  0.07
2   key1  row0  item1  col0  0.77  0.01
3   key0  row4  item0  col2  0.15  0.59
4   key1  row0  item2  col1  0.81  0.64
5   key1  row2  item2  col4  0.13  0.88
6   key2  row4  item1  col3  0.88  0.39
7   key1  row4  item1  col1  0.10  0.07
8   key1  row0  item2  col4  0.65  0.02
9   key1  row2  item0  col2  0.35  0.61
10  key2  row0  item2  col1  0.40  0.85
11  key2  row4  item1  col2  0.64  0.25
12  key0  row2  item2  col3  0.50  0.44
13  key0  row4  item1  col4  0.24  0.46
14  key1  row3  item2  col3  0.28  0.11
15  key0  row3  item1  col1  0.31  0.23
16  key0  row0  item2  col3  0.86  0.01
17  key0  row4  item0  col3  0.64  0.21
18  key2  row2  item2  col0  0.13  0.45
19  key0  row2  item0  col4  0.37  0.70

问题

  1. 为什么我得到 value eRror:索引包含重复条目,无法重塑

  2. 在是值?

      col0 col1 col2 col2 col3 col4
    排
    Row0 0.77 0.605 NAN 0.860 0.65
    Row2 0.13 NAN 0.395 0.500 0.25
    Row3 Nan 0.310 Nan 0.545 Nan
    Row4 Nan 0.100 0.395 0.760 0.24
     
  3. 我如何制作它,以使丢失值为 0

      col0 col1 col2 col2 col3 col4
    排
    Row0 0.77 0.605 0.000 0.860 0.65
    Row2 0.13 0.000 0.395 0.500 0.25
    Row3 0.00 0.310 0.000 0.545 0.00
    Row4 0.00 0.100 0.395 0.760 0.24
     
  4. 我可以得到含义以外的其他东西,例如 sum

      col0 col1 col2 col2 col3 col4
    排
    Row0 0.77 1.21 0.00 0.86 0.65
    Row2 0.13 0.00 0.79 0.50 0.50
    Row3 0.00 0.31 0.00 1.09 0.00
    Row4 0.00 0.10 0.79 1.52 0.24
     
  5. 我一次可以做更多的聚合吗?

     总和
    COL0 COL1 COL2 COL3 COL4 COL0 COL1 COL2 COL3 COL4
    排
    Row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
    Row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
    Row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
    Row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24
     
  6. 我可以在多个值列上汇总吗?

      val0 val1
    COL0 COL1 COL2 COL3 COL4 COL0 COL1 COL2 COL3 COL4
    排
    Row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
    Row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
    Row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
    Row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
     
  7. 我可以通过多个列细分吗?

      item0 item0 item1 item2
    Col2 Col3 Col4 Col0 Col1 Col2 Col3 Col4 Col4 Col0 Col1 Col3 Col4 Col4
    排
    Row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
    Row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.50 0.13
    Row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.000 0.28 0.00
    Row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.000 0.00 0.00
     
  8.   item0 item0 item1 item2
    Col2 Col3 Col4 Col0 Col1 Col2 Col3 Col4 Col4 Col0 Col1 Col3 Col4 Col4
    钥匙行
    钥匙0行0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
         Row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.50 0.00
         Row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00 0.00
         Row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00 0.00
    KEY1 Row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
         Row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.00 0.13
         Row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
         Row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    键2行0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 0.00
         Row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
         Row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00 0.00
     
  9. 我可以汇总列和行在一起的频率,又称“交叉表”?

      col0 col1 col2 col2 col3 col4
    排
    Row0 1 2 0 1 1
    Row2 1 0 2 1 2
    Row3 0 1 0 2 0
    Row4 0 1 2 2 1
     
  10. 如何仅在两列上旋转,如何将数据帧从长时间转换为宽?给出,

      np.random.seed([[3,1415])
    df2 = pd.dataframe({'a':list('aaaabbbc'),'b':np.random.choice(15,8)})
    DF2
       ab
    0 A 0
    1 A 11
    2 A 2
    3 A 11
    4 B 10
    5 B 10
    6 B 14
    7 C 7
     

    预期应该看起来像

      ABC
    0 0.0 10.0 7.0
    1 11.0 10.0 Nan
    2 2.0 14.0 nan
    3 11.0 Nan Nan
     
  11. 如何在 pivot 之后将多个索引变为单个索引?

    来自

      1 2
       1 1 2
    A 2 1 1
    B 2 1 0
    C 1 0 0
     

      1 | 1 2 | 1 2 | 2
    A 2 1 1
    B 2 1 0
    C 1 0 0
     
  • What is pivot?
  • How do I pivot?
  • Long format to wide format?

I've seen a lot of questions that ask about pivot tables, even if they don't know it. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting... But I'm going to give it a go.


The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task). Look at a few examples from my Google search:

  1. How to pivot a dataframe in Pandas? - Good question and answer. But the answer only answers the specific question with little explanation.
  2. pandas pivot table to data frame - OP is concerned with the output of the pivot, namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.
  3. pandas pivoting a dataframe, duplicate rows - Another decent question but the answer focuses on one method, namely pd.DataFrame.pivot

Setup

I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below.

import numpy as np
import pandas as pd
from numpy.core.defchararray import add

np.random.seed([3,1415])
n = 20

cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)

df = pd.DataFrame(
    add(cols, arr1), columns=cols
).join(
    pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)
     key   row   item   col  val0  val1
0   key0  row3  item1  col3  0.81  0.04
1   key1  row2  item1  col2  0.44  0.07
2   key1  row0  item1  col0  0.77  0.01
3   key0  row4  item0  col2  0.15  0.59
4   key1  row0  item2  col1  0.81  0.64
5   key1  row2  item2  col4  0.13  0.88
6   key2  row4  item1  col3  0.88  0.39
7   key1  row4  item1  col1  0.10  0.07
8   key1  row0  item2  col4  0.65  0.02
9   key1  row2  item0  col2  0.35  0.61
10  key2  row0  item2  col1  0.40  0.85
11  key2  row4  item1  col2  0.64  0.25
12  key0  row2  item2  col3  0.50  0.44
13  key0  row4  item1  col4  0.24  0.46
14  key1  row3  item2  col3  0.28  0.11
15  key0  row3  item1  col1  0.31  0.23
16  key0  row0  item2  col3  0.86  0.01
17  key0  row4  item0  col3  0.64  0.21
18  key2  row2  item2  col0  0.13  0.45
19  key0  row2  item0  col4  0.37  0.70

Questions

  1. Why do I get ValueError: Index contains duplicate entries, cannot reshape?

  2. How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?

    col   col0   col1   col2   col3  col4
    row
    row0  0.77  0.605    NaN  0.860  0.65
    row2  0.13    NaN  0.395  0.500  0.25
    row3   NaN  0.310    NaN  0.545   NaN
    row4   NaN  0.100  0.395  0.760  0.24
    
  3. How do I make it so that missing values are 0?

    col   col0   col1   col2   col3  col4
    row
    row0  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.100  0.395  0.760  0.24
    
  4. Can I get something other than mean, like maybe sum?

    col   col0  col1  col2  col3  col4
    row
    row0  0.77  1.21  0.00  0.86  0.65
    row2  0.13  0.00  0.79  0.50  0.50
    row3  0.00  0.31  0.00  1.09  0.00
    row4  0.00  0.10  0.79  1.52  0.24
    
  5. Can I do more that one aggregation at a time?

           sum                          mean
    col   col0  col1  col2  col3  col4  col0   col1   col2   col3  col4
    row
    row0  0.77  1.21  0.00  0.86  0.65  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.00  0.79  0.50  0.50  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.31  0.00  1.09  0.00  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.10  0.79  1.52  0.24  0.00  0.100  0.395  0.760  0.24
    
  6. Can I aggregate over multiple value columns?

          val0                             val1
    col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
    row
    row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
    row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
    row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
    row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
    
  7. Can I subdivide by multiple columns?

    item item0             item1                         item2
    col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
    row
    row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
    row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
    row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
    row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
    
  8. Or

    item      item0             item1                         item2
    col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
    key  row
    key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
         row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
         row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
         row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
    key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
         row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
         row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
         row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
    key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
         row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
         row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
    
  9. Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

    col   col0  col1  col2  col3  col4
    row
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
    
  10. How do I convert a DataFrame from long to wide by pivoting on ONLY two columns? Given,

    np.random.seed([3, 1415])
    df2 = pd.DataFrame({'A': list('aaaabbbc'), 'B': np.random.choice(15, 8)})
    df2
       A   B
    0  a   0
    1  a  11
    2  a   2
    3  a  11
    4  b  10
    5  b  10
    6  b  14
    7  c   7
    

    The expected should look something like

          a     b    c
    0   0.0  10.0  7.0
    1  11.0  10.0  NaN
    2   2.0  14.0  NaN
    3  11.0   NaN  NaN
    
  11. How do I flatten the multiple index to single index after pivot?

    From

       1  2
       1  1  2
    a  2  1  1
    b  2  1  0
    c  1  0  0
    

    To

       1|1  2|1  2|2
    a    2    1    1
    b    2    1    0
    c    1    0    0
    

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

浅忆 2025-02-15 03:57:28

这是我们可以使用的成语列表,我们可以用

  1. pd.dataframe.pivot_table

    • groupby 具有更直观的API的荣耀版本。对于许多人来说,这是首选的方法。这是开发人员的预期方法。
    • 指定行级,列级,要汇总的值以及函数以执行聚合。
  2. +

    • 良好的一般方法用于几乎任何类型的枢轴
    • 您指定将构成一个组中的所有列的行级别和列级的所有列。您可以选择要汇总的剩余列以及要执行聚合的功能。最后,您 unstack 您想在列索引中的级别。
  3. +

    • 对于某些人来说方便而直观(包括我自己)。无法处理重复的分组键。
    • 类似于 groupby 范式,我们指定了最终将是行或列级并将其设置为索引的所有列。然后,我们 unstack 我们在列中想要的级别。如果剩余的索引级别或列级不是唯一的,则此方法将失败。
  4. 代码>

    • set_index非常相似,因为它共享重复的密钥限制。 API也非常有限。它仅采用 index values
    • 的标量值。

    • 类似于 pivot_table 方法,因为我们选择了枢轴的行,列和值。但是,我们不能汇总,如果行或列不是唯一的,则此方法将失败。
    • 这是 pivot_table 的专业版,最直观的方式是执行多个任务的最直观的方式。

  5. +

    • 这是一种非常高级的技术,非常晦涩,但非常快。它在任何情况下都不能使用,但是当它可以使用并且您可以使用它时,您将获得性能奖励。
  6. +

    • 我将其用于巧妙地执行交叉表。

另请参阅:


问题1问题1

为什么我获得 value eRror:索引包含重复条目,无法重塑

这是因为熊猫试图重新索引 index index 带有重复的对象条目。有不同的方法可以执行枢轴。当他们被要求旋转的密钥重复时,其中一些人不适合。例如:考虑 pd.dataframe.pivot 。我知道有重复的条目共享 col 值:

df.duplicated(['row', 'col']).any()

True

因此,当i pivot 使用i时,

df.pivot(index='row', columns='col', values='val0')

我会得到上面提到的错误。实际上,当我尝试执行相同的任务时,我会遇到相同的错误:

df.set_index(['row', 'col'])['val0'].unstack()

示例

我要为每个后续问题做什么是使用 pd.dataframe.pivot_table_pivot_table 。然后,我将提供执行相同任务的替代方案。

问题2和3

在是值?

  • noreferrer“>

      df.pivot_table(
        values ='val0',index ='row',列='col',
        aggfunc ='平均')
    
    COL0 COL1 COL2 COL3 COL4
    排                                  
    Row0 0.77 0.605 NAN 0.860 0.65
    Row2 0.13 NAN 0.395 0.500 0.25
    Row3 Nan 0.310 Nan 0.545 Nan
    Row4 Nan 0.100 0.395 0.760 0.24
     
    • aggfunc ='平均'是默认值,我不必设置它。我将其包括在内。

我该如何制作以使丢失值为0?

  • noreferrer“>

    • fill_value 默认设置未设置。我倾向于适当地设置它。在这种情况下,我将其设置为 0
      df.pivot_table(
        values ='val0',index ='row',列='col',
        fill_value = 0,aggfunc ='mean')
    
    COL0 COL1 COL2 COL3 COL4
    排
    Row0 0.77 0.605 0.000 0.860 0.65
    Row2 0.13 0.000 0.395 0.500 0.25
    Row3 0.00 0.310 0.000 0.545 0.00
    Row4 0.00 0.100 0.395 0.760 0.24
     
  •   df.groupby([['row','col'])['val0']。eyan()。unstack(fill_value = 0)
     
  •   pd.crosstab(
        index = df ['row'],列= df ['col'],
        值= df ['val0'],aggfunc ='mean')。填充(0)
     

问题4

我可以得到含义以外的其他东西,例如 sum

  • noreferrer“>

      df.pivot_table(
        values ='val0',index ='row',列='col',
        fill_value = 0,aggfunc ='sum')
    
    COL0 COL1 COL2 COL3 COL4
    排
    Row0 0.77 1.21 0.00 0.86 0.65
    Row2 0.13 0.00 0.79 0.50 0.50
    Row3 0.00 0.31 0.00 1.09 0.00
    Row4 0.00 0.10 0.79 1.52 0.24
     
  • pd。 dataframe.groupby

      df.groupby(['row','col'])['val0']。sum()。unstack(fill_value = 0)
     
  •   pd.crosstab(
        index = df ['row'],列= df ['col'],
        值= df ['val0'],aggfunc ='sum')。填充(0)
     

问题5

我一次可以做更多的聚合吗?

请注意,对于 pivot_table crosstab 我需要传递可可的列表。另一方面, groupby.agg 能够为有限数量的特殊功能带上字符串。 groupby.agg 也将采取与我们传递给其他人相同的可喊声,但是由于要获得的效率,要利用字符串函数名称通常更有效。

  • noreferrer“>

      df.pivot_table(
        values ='val0',index ='row',列='col',
        fill_value = 0,aggfunc = [np.size,np.mean])
    
         尺寸均值
    COL0 COL1 COL2 COL3 COL4 COL0 COL1 COL2 COL3 COL4
    排
    Row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
    Row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
    Row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
    Row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24
     
  • pd。 dataframe.groupby

      df.groupby(['row','col'])['val0']。agg(['size','shay'])。unstack(fill_value = 0)
     
  •   pd.crosstab(
        index = df ['row'],列= df ['col'],
        值= df ['val0'],aggfunc = [np.size,np.mean])。
     

问题6

我可以在多个值列上汇总吗?

  • noreferrer“> 我们通过 values = ['val0','val1'] ,但我们可以完全将其留下

      df.pivot_table(
        values = ['val0','val1'],index ='row',列='col',
        fill_value = 0,aggfunc ='mean')
    
          val0 val1
    COL0 COL1 COL2 COL3 COL4 COL0 COL1 COL2 COL3 COL4
    排
    Row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
    Row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
    Row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
    Row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46
     
  • pd.dataframe.groupby.groupby

      df.groupby([['row','col'])['val0','val1']。mean()。unstack(fill_value = 0)
     

问题7

我可以通过多个列细分吗?

  • noreferrer“>

      df.pivot_table(
        values ='val0',index ='row',列= ['item','col'],,
        fill_value = 0,aggfunc ='mean')
    
    Item0 Item0 Item1项目2
    Col2 Col3 Col4 Col0 Col1 Col2 Col3 Col4 Col4 Col0 Col1 Col3 Col4 Col4
    排
    Row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
    Row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.50 0.13
    Row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.000 0.28 0.00
    Row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.000 0.00 0.00
     
  • pd。 dataframe.groupby

      df.groupby(
        ['row','item','col']
    )['val0']。平均()。unstack(['item','col'])。填充(0).sort_index(1)
     

问题8

我可以通过多个列细分吗?

  • noreferrer“>

      df.pivot_table(
        values ='val0',index = ['键','row'],列= ['item','col'],,
        fill_value = 0,aggfunc ='mean')
    
    Item0 Item0 Item1项目2
    Col2 Col3 Col4 Col0 Col1 Col2 Col3 Col4 Col4 Col0 Col1 Col3 Col4 Col4
    钥匙行
    钥匙0行0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
         Row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.50 0.00
         Row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00 0.00
         Row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00 0.00
    KEY1 Row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
         Row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.00 0.13
         Row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
         Row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
    键2行0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00 0.00
         Row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
         Row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00 0.00
     
  • pd。 dataframe.groupby

      df.groupby(
        ['键','row','item','col']
    )['val0']。平均()。unstack(['item','col'])。填充(0).sort_index(1)
     
  • 因为键的集合对于行和列都是唯一的

      df.set_index(
        ['键','row','item','col']
    )。
     

问题9

都是唯一的

我可以汇总列和行在一起的频率,又称“交叉表”?

  • noreferrer“>

      df.pivot_table(index ='row',columns ='col',fill_value = 0,aggfunc ='size')
    
    COL0 COL1 COL2 COL3 COL4
    排
    Row0 1 2 0 1 1
    Row2 1 0 2 1 2
    Row3 0 1 0 2 0
    Row4 0 1 2 2 1
     
  • pd。 dataframe.groupby

      df.groupby(['row','col'])['val0']。size()。unstack(fill_value = 0)
     
  •   pd.crosstab(df ['row'],df ['col'])
     
  • p> pd.pd.pd.factorize + <

     #获取整数分解`i i'和唯一值`r`
    #for专栏''row'
    i,r = pd.factorize(df ['row']。值)
    #获取整数分解`j`和唯一值`c`
    #for专栏`'col'
    j,c = pd.factorize(df ['col']。值)
    #`n`将是行的数量
    #`m`将是列的数量
    n,m = r.size,c.size
    #`i * m + j`是一种计数的巧妙方式
    #分解箱假定长度平坦
    #`n * m`。这就是为什么我们随后重塑为`(n,m)'
    b = np.bincount(i * m + j,minlength = n * m).Reshape(n,m)
    #顺便说一句,每当我读这篇文章时,我都认为“豆,米饭和奶酪”
    pd.dataframe(b,r,c)
    
          COL3 COL2 COL0 COL1 COL4
    Row3 2 0 0 1 0
    Row2 1 2 1 0 2
    Row0 1 0 1 2 1
    Row4 2 2 2 0 1 1
     
  •   pd.get_dummies(df ['row'])。t.dot(pd.get_dummies(df ['col']))
    
          Col0 Col1 Col2 Col3 Col4
    Row0 1 2 0 1 1
    Row2 1 0 2 1 2
    Row3 0 1 0 2 0
    Row4 0 1 2 2 1
     

10

如何通过仅在两个上转换数据框
列?

  • noreferrer“> dataframe.pivot

    第一步是为每行分配一个数字 - 此数字将是枢纽结果中该值的行索引。这是使用

      df2.insert(0,'count',df2.groupby('a')。cumcount())
    DF2
    
       计数AB
    0 0 A 0
    1 1 A 11
    2 2 A 2
    3 3 A 11
    4 0 B 10
    5 1 B 10
    6 2 B 14
    7 0 C 7
     

    第二步是将新创建的列用作索引 dataframe.pivot

      df2.pivot(*df2)
    #df2.pivot(index ='count',列='a',values ='b')
    
    ABC
    数数
    0 0.0 10.0 7.0
    1 11.0 10.0 Nan
    2 2.0 14.0 nan
    3 11.0 Nan Nan
     

  • > 仅接受列, dataframe.pivot_table 也接受数组,因此 groupby.cumcount 可以直接作为 index 而无需创建显式列。

      df2.pivot_table(index = df2.groupby('a')。cumcount(),列='a',values ='b')
    
    ABC
    0 0.0 10.0 7.0
    1 11.0 10.0 Nan
    2 2.0 14.0 nan
    3 11.0 Nan Nan
     

问题11

如何在 pivot

之后将多个索引变成单个索引

如果 type object 使用字符串 join join

df.columns = df.columns.map('|'.join)

else else >格式

df.columns = df.columns.map('{0[0]}|{0[1]}'.format)

Here is a list of idioms we can use to pivot

  1. pd.DataFrame.pivot_table

    • A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And it is the intended approach by the developers.
    • Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
  2. pd.DataFrame.groupby + pd.DataFrame.unstack

    • Good general approach for doing just about any type of pivot
    • You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.
  3. pd.DataFrame.set_index + pd.DataFrame.unstack

    • Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
    • Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
  4. pd.DataFrame.pivot

    • Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.
    • Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
  5. pd.crosstab

    • This a specialized version of pivot_table and in its purest form is the most intuitive way to perform several tasks.
  6. pd.factorize + np.bincount

    • This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
  7. pd.get_dummies + pd.DataFrame.dot

    • I use this for cleverly performing cross tabulation.

See also:


Question 1

Why do I get ValueError: Index contains duplicate entries, cannot reshape

This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys on which it is being asked to pivot. For example: Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

df.duplicated(['row', 'col']).any()

True

So when I pivot using

df.pivot(index='row', columns='col', values='val0')

I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

df.set_index(['row', 'col'])['val0'].unstack()

Examples

What I'm going to do for each subsequent question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.

Questions 2 and 3

How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns='col',
        aggfunc='mean')
    
    col   col0   col1   col2   col3  col4
    row                                  
    row0  0.77  0.605    NaN  0.860  0.65
    row2  0.13    NaN  0.395  0.500  0.25
    row3   NaN  0.310    NaN  0.545   NaN
    row4   NaN  0.100  0.395  0.760  0.24
    
    • aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.

How do I make it so that missing values are 0?

  • pd.DataFrame.pivot_table

    • fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0.
    df.pivot_table(
        values='val0', index='row', columns='col',
        fill_value=0, aggfunc='mean')
    
    col   col0   col1   col2   col3  col4
    row
    row0  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.100  0.395  0.760  0.24
    
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
    
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc='mean').fillna(0)
    

Question 4

Can I get something other than mean, like maybe sum?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns='col',
        fill_value=0, aggfunc='sum')
    
    col   col0  col1  col2  col3  col4
    row
    row0  0.77  1.21  0.00  0.86  0.65
    row2  0.13  0.00  0.79  0.50  0.50
    row3  0.00  0.31  0.00  1.09  0.00
    row4  0.00  0.10  0.79  1.52  0.24
    
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
    
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc='sum').fillna(0)
    

Question 5

Can I do more that one aggregation at a time?

Notice that for pivot_table and crosstab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns='col',
        fill_value=0, aggfunc=[np.size, np.mean])
    
         size                      mean
    col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
    row
    row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
    row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
    row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
    row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24
    
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
    
  • pd.crosstab

    pd.crosstab(
        index=df['row'], columns=df['col'],
        values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
    

Question 6

Can I aggregate over multiple value columns?

  • pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely

    df.pivot_table(
        values=['val0', 'val1'], index='row', columns='col',
        fill_value=0, aggfunc='mean')
    
          val0                             val1
    col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
    row
    row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
    row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
    row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
    row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46
    
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
    

Question 7

Can I subdivide by multiple columns?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index='row', columns=['item', 'col'],
        fill_value=0, aggfunc='mean')
    
    item item0             item1                         item2
    col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
    row
    row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
    row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
    row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
    row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00
    
  • pd.DataFrame.groupby

    df.groupby(
        ['row', 'item', 'col']
    )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
    

Question 8

Can I subdivide by multiple columns?

  • pd.DataFrame.pivot_table

    df.pivot_table(
        values='val0', index=['key', 'row'], columns=['item', 'col'],
        fill_value=0, aggfunc='mean')
    
    item      item0             item1                         item2
    col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
    key  row
    key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
         row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
         row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
         row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
    key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
         row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
         row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
         row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
    key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
         row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
         row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00
    
  • pd.DataFrame.groupby

    df.groupby(
        ['key', 'row', 'item', 'col']
    )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
    
  • pd.DataFrame.set_index because the set of keys are unique for both rows and columns

    df.set_index(
        ['key', 'row', 'item', 'col']
    )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
    

Question 9

Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

  • pd.DataFrame.pivot_table

    df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
    
    col   col0  col1  col2  col3  col4
    row
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
    
  • pd.DataFrame.groupby

    df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
    
  • pd.crosstab

    pd.crosstab(df['row'], df['col'])
    
  • pd.factorize + np.bincount

    # get integer factorization `i` and unique values `r`
    # for column `'row'`
    i, r = pd.factorize(df['row'].values)
    # get integer factorization `j` and unique values `c`
    # for column `'col'`
    j, c = pd.factorize(df['col'].values)
    # `n` will be the number of rows
    # `m` will be the number of columns
    n, m = r.size, c.size
    # `i * m + j` is a clever way of counting the
    # factorization bins assuming a flat array of length
    # `n * m`.  Which is why we subsequently reshape as `(n, m)`
    b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
    # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
    pd.DataFrame(b, r, c)
    
          col3  col2  col0  col1  col4
    row3     2     0     0     1     0
    row2     1     2     1     0     2
    row0     1     0     1     2     1
    row4     2     2     0     1     1
    
  • pd.get_dummies

    pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
    
          col0  col1  col2  col3  col4
    row0     1     2     0     1     1
    row2     1     0     2     1     2
    row3     0     1     0     2     0
    row4     0     1     2     2     1
    

Question 10

How do I convert a DataFrame from long to wide by pivoting on ONLY two
columns?

  • DataFrame.pivot

    The first step is to assign a number to each row - this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount:

    df2.insert(0, 'count', df2.groupby('A').cumcount())
    df2
    
       count  A   B
    0      0  a   0
    1      1  a  11
    2      2  a   2
    3      3  a  11
    4      0  b  10
    5      1  b  10
    6      2  b  14
    7      0  c   7
    

    The second step is to use the newly created column as the index to call DataFrame.pivot.

    df2.pivot(*df2)
    # df2.pivot(index='count', columns='A', values='B')
    
    A         a     b    c
    count
    0       0.0  10.0  7.0
    1      11.0  10.0  NaN
    2       2.0  14.0  NaN
    3      11.0   NaN  NaN
    
  • DataFrame.pivot_table

    Whereas DataFrame.pivot only accepts columns, DataFrame.pivot_table also accepts arrays, so the GroupBy.cumcount can be passed directly as the index without creating an explicit column.

    df2.pivot_table(index=df2.groupby('A').cumcount(), columns='A', values='B')
    
    A         a     b    c
    0       0.0  10.0  7.0
    1      11.0  10.0  NaN
    2       2.0  14.0  NaN
    3      11.0   NaN  NaN
    

Question 11

How do I flatten the multiple index to single index after pivot

If columns type object with string join

df.columns = df.columns.map('|'.join)

else format

df.columns = df.columns.map('{0[0]}|{0[1]}'.format)
つ低調成傷 2025-02-15 03:57:28

扩展@pirSquared的答案 62218881/how-to-to Transpose-a-specif-column-a-a-a-dataframe-a-dataframe and-group-by-python>问题10

问题10.1

dataframe:

d = data = {'A': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 5},
 'B': {0: 'a', 1: 'b', 2: 'c', 3: 'a', 4: 'b', 5: 'a', 6: 'c'}}
df = pd.DataFrame(d)

   A  B
0  1  a
1  1  b
2  1  c
3  2  a
4  2  b
5  3  a
6  5  c

output: output:output:

   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None

使用 df.groupbybyby pd.series.tolist

t = df.groupby('A')['B'].apply(list)
out = pd.DataFrame(t.tolist(),index=t.index)
out
   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None


使用代码> with

t = df.pivot_table(index='A',values='B',aggfunc=list).squeeze()
out = pd.DataFrame(t.tolist(),index=t.index)

To extend @piRSquared's answer another version of Question 10

Question 10.1

DataFrame:

d = data = {'A': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 5},
 'B': {0: 'a', 1: 'b', 2: 'c', 3: 'a', 4: 'b', 5: 'a', 6: 'c'}}
df = pd.DataFrame(d)

   A  B
0  1  a
1  1  b
2  1  c
3  2  a
4  2  b
5  3  a
6  5  c

Output:

   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None

Using df.groupby and pd.Series.tolist

t = df.groupby('A')['B'].apply(list)
out = pd.DataFrame(t.tolist(),index=t.index)
out
   0     1     2
A
1  a     b     c
2  a     b  None
3  a  None  None
5  c  None  None

Or
A much better alternative using pd.pivot_table with df.squeeze.

t = df.pivot_table(index='A',values='B',aggfunc=list).squeeze()
out = pd.DataFrame(t.tolist(),index=t.index)
祁梦 2025-02-15 03:57:28

更好地了解函数如何 Pivot 如果您具有重复索引列( foo - bar )组合(例如 df )第二个示例):

“

pivot 相反/pandas.pydata.org/docs/reference/api/pandas.dataframe.pivot_table.html“ rel =” nofollow noreferrer“> pivot_table”> pivot_table 支持使用 MANE function支持数据汇总。这是一个带有 sum 聚合函数的示例:

“

To better understand how the function pivot works you can look at the example from Pandas documentation. However pivot will fail if you have repeating index-columns (foo-bar) combinations (like df in the second example):

pivot

In opposite to pivot the function pivot_table supports data aggregation using the mean function by default. Here is an example with the sum aggregation function:

pivot_table

冷血 2025-02-15 03:57:28

调用 reset_index()(以及 add_suffix()

通常, reset_index()在您调用 pivot_table 之后,需要或 Pivot 。例如,要进行以下转换(其中一个列成为列名)

< img src =“ https://i.sstatic.net/slcqf.png” alt =“ res”>

您使用以下代码,在 pivot 之后,您可以在新创建的列名称并将索引转换为(在这种情况下为“电影” )回到列中,然后删除轴名称的名称:

df.pivot(index='movie', columns='week', values='sales').add_prefix('week_').reset_index().rename_axis(columns=None)

如提到的其他答案,“枢轴”可能是指2个不同的操作:

  1. 的结果。
  2. 未堆放的聚合(即制作 groupby.agg 更宽 在r)

1中。聚合

pivot_table crosstab 只是 groupby.agg 操作的未堆放结果。实际上,

  • pivot_table = groupby + unstack 在此处阅读以获取更多信息。)
  • crosstab = pivot_table

nb您可以将列名称列表用作 index ,<代码>列和 values 参数。

df.groupby(rows+cols)[vals].agg(aggfuncs).unstack(cols)
# equivalently,
df.pivot_table(vals, rows, cols, aggfuncs)
1.1。 crosstab pivot_table 的特例;因此, groupby + unstack

以下是等效的:

  • pd.crosstab(df ['cola'],df ['colb'])
  • <代码> df.pivot_table(index ='cola',列='colb',aggfunc ='size',fill_value = 0)
  • df.groupby(['cola','colb'])。 size()。unstack(fill_value = 0)

请注意, pd.crosstab 的开销明显更大,因此比这两个 pivot_table 都明显慢得多。 groupby + <代码> unstack 。实际上,AS 在这里注明 pivot_table group> group + unstack

2。重塑

pivot pivot_table 的更有限的版本,其中其目的是将长数据框架重塑为长期。

df.set_index(rows+cols)[vals].unstack(cols)
# equivalently, 
df.pivot(index=rows, columns=cols, values=vals)
2.1。增强行/列作为问题10中,

您还可以将问题10的见解应用于多列枢轴操作。有两种情况:

  • “远到长” :通过增加指数

    来重塑

    “

    代码:

      df = pd.dataframe({'a':[1,1,1,1,2,2,2,2],'b':[*'xxyyzz' ],, 
                       'c':[*'ccdcdd'],'e':[100,200,300,400,500,600]})
    行,cols,vals = ['a','b'],['c'],'e'
    
    #使用枢轴语法
    df1 =((
        df.Assign(ix = df.groupby(行+cols).cumcount())
        。
        。
        。
    )
    
    #等效地,使用set_index + unstack语法
    df1 =((
        DF
        .set_index([ *rows,df.groupby(rows+cols).cumcount(), *cols])[vals])[vals]
        .unstack(fill_value = 0)
        。
    )
     
  • “ toff to with wide” :通过增强列来重塑列

    “

    代码:

      df1 =(
        df.Assign(ix = df.groupby(行+cols).cumcount())
        。
        。
    )
    df1 = df1.set_axis([f“ {c [0]} _ {c [1]}” df1中的C,axis = 1).Reset_index()
    
    #等效地,使用set_index + unstack语法
    df1 =((
        DF
        .set_index([ *rows,df.groupby(rows+cols).cumcount(), *cols])[vals])[vals]
        。
    )
    df1 = df1.set_axis([f“ {c [0]} _ {c [1]}” df1中的C,axis = 1).Reset_index()
     
  • 使用 set_index + unstack 最小情况>语法:

    “

    代码:

      df1 = df.set_index(['a',df.groupby('a') = 0).add_prefix('col')。reset_index()
     

1 pivot_table()值并解开它。具体来说,它使用传递的聚合方法创建了索引和列的单个平面列表,将其作为groupby()使用此列表,并使用传递的聚合方法(默认值为 earge> eargegator >) )。然后,在汇总之后,它通过列列表调用 unstack()。因此,在内部, pivot_table = groupby + unstack 。此外,如果传递了 fill_value ,则调用 fillna()

换句话说,产生 pv_1 的方法与以下示例中产生 gb_1 的方法相同。

pv_1 = df.pivot_table(index=rows, columns=cols, values=vals, aggfunc=aggfuncs, fill_value=0)
# internal operation of `pivot_table()`
gb_1 = df.groupby(rows+cols)[vals].agg(aggfuncs).unstack(cols).fillna(0, downcast="infer")
pv_1.equals(gb_1) # True

2 crosstab()调用 pivot_table(),即, crosstab = pivot_table 。具体来说,它从传递的值阵列中构建一个数据框,通过通用索引和调用 pivot_table()对其进行过滤。它比 pivot_table()更有限制列作为 value

Call reset_index() (along with add_suffix())

Oftentimes, reset_index() is needed after you call pivot_table or pivot. For example, to make the following transformation (where one column become column names)

res

you use the following code, where after pivot, you add prefix to the newly created column names and convert the index (in this case "movies") back into a column and remove the name of the axis name:

df.pivot(index='movie', columns='week', values='sales').add_prefix('week_').reset_index().rename_axis(columns=None)

As the other answers mentioned, "pivot" may refer to 2 different operations:

  1. Unstacked aggregation (i.e. make the results of groupby.agg wider.)
  2. Reshaping (similar to pivot in Excel, reshape in numpy or pivot_wider in R)

1. Aggregation

pivot_table or crosstab are simply unstacked results of groupby.agg operation. In fact, the source code shows that, under the hood, the following are true:

  • pivot_table = groupby + unstack (read here for more info.)
  • crosstab = pivot_table

N.B. You can use list of column names as index, columns and values arguments.

df.groupby(rows+cols)[vals].agg(aggfuncs).unstack(cols)
# equivalently,
df.pivot_table(vals, rows, cols, aggfuncs)
1.1. crosstab is a special case of pivot_table; thus of groupby + unstack

The following are equivalent:

  • pd.crosstab(df['colA'], df['colB'])
  • df.pivot_table(index='colA', columns='colB', aggfunc='size', fill_value=0)
  • df.groupby(['colA', 'colB']).size().unstack(fill_value=0)

Note that pd.crosstab has a significantly larger overhead, so it's significantly slower than both pivot_table and groupby + unstack. In fact, as noted here, pivot_table is slower than groupby + unstack as well.

2. Reshaping

pivot is a more limited version of pivot_table where its purpose is to reshape a long dataframe into a long one.

df.set_index(rows+cols)[vals].unstack(cols)
# equivalently, 
df.pivot(index=rows, columns=cols, values=vals)
2.1. Augment rows/columns as in Question 10

You can also apply the insight from Question 10 to multi-column pivot operation as well. There are two cases:

  • "long-to-long": reshape by augmenting the indices

    case1

    Code:

    df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2], 'B': [*'xxyyzz'], 
                       'C': [*'CCDCDD'], 'E': [100, 200, 300, 400, 500, 600]})
    rows, cols, vals = ['A', 'B'], ['C'], 'E'
    
    # using pivot syntax
    df1 = (
        df.assign(ix=df.groupby(rows+cols).cumcount())
        .pivot(index=[*rows, 'ix'], columns=cols, values=vals)
        .fillna(0, downcast='infer')
        .droplevel(-1).reset_index().rename_axis(columns=None)
    )
    
    # equivalently, using set_index + unstack syntax
    df1 = (
        df
        .set_index([*rows, df.groupby(rows+cols).cumcount(), *cols])[vals]
        .unstack(fill_value=0)
        .droplevel(-1).reset_index().rename_axis(columns=None)
    )
    
  • "long-to-wide": reshape by augmenting the columns

    case2

    Code:

    df1 = (
        df.assign(ix=df.groupby(rows+cols).cumcount())
        .pivot(index=rows, columns=[*cols, 'ix'])[vals]
        .fillna(0, downcast='infer')
    )
    df1 = df1.set_axis([f"{c[0]}_{c[1]}" for c in df1], axis=1).reset_index()
    
    # equivalently, using the set_index + unstack syntax
    df1 = (
        df
        .set_index([*rows, df.groupby(rows+cols).cumcount(), *cols])[vals]
        .unstack([-1, *range(-2, -len(cols)-2, -1)], fill_value=0)
    )
    df1 = df1.set_axis([f"{c[0]}_{c[1]}" for c in df1], axis=1).reset_index()
    
  • minimum case using the set_index + unstack syntax:

    case3

    Code:

    df1 = df.set_index(['A', df.groupby('A').cumcount()])['E'].unstack(fill_value=0).add_prefix('Col').reset_index()
    

1 pivot_table() aggregates the values and unstacks it. Specifically, it creates a single flat list out of index and columns, calls groupby() with this list as the grouper and aggregates using the passed aggregator methods (the default is mean). Then after aggregation, it calls unstack() by the list of columns. So internally, pivot_table = groupby + unstack. Moreover, if fill_value is passed, fillna() is called.

In other words, the method that produces pv_1 is the same as the method that produces gb_1 in the example below.

pv_1 = df.pivot_table(index=rows, columns=cols, values=vals, aggfunc=aggfuncs, fill_value=0)
# internal operation of `pivot_table()`
gb_1 = df.groupby(rows+cols)[vals].agg(aggfuncs).unstack(cols).fillna(0, downcast="infer")
pv_1.equals(gb_1) # True

2 crosstab() calls pivot_table(), i.e., crosstab = pivot_table. Specifically, it builds a DataFrame out of the passed arrays of values, filters it by the common indices and calls pivot_table(). It's more limited than pivot_table() because it only allows a one-dimensional array-like as values, unlike pivot_table() that can have multiple columns as values.

半葬歌 2025-02-15 03:57:28

熊猫中的枢轴函数具有与Excel中的枢轴操作相同的功能。我们可以将数据集从长度格式转换为广泛的格式。

让我们有一个示例

​我们可以使用枢轴函数执行此数据操作。

旋转数据集,

pivot_df = pd.pivot(df, index =['Date'], columns ='Country', values =['NewConfirmed'])
## renaming the columns  
pivot_df.columns = df['Country'].sort_values().unique()

我们可以通过重置索引来将新列与索引列数据相同。

重置修改列级

pivot_df = pivot_df.reset_index()

i.sstatic.net/boxhg.png“ alt =”在此处输入图像说明”>

The pivot function in pandas has the same functionality as the pivot operation in excel. We can transform a dataset from a long format to a wide format.

enter image description here

Lets have a example

enter image description here

We want to convert the dataset into a form such that each country becomes a column and the new confirmed cases as values corresponding to the countries. We can perform this data manipulation using the pivot function.

enter image description here

Pivot the dataset

pivot_df = pd.pivot(df, index =['Date'], columns ='Country', values =['NewConfirmed'])
## renaming the columns  
pivot_df.columns = df['Country'].sort_values().unique()

We can bring the new columns to the same level as the index column Data by resetting the index.

reset the index to modify the column levels

pivot_df = pivot_df.reset_index()

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文