从熊猫中的宽形式组成的小组

发布于 2025-02-07 19:17:15 字数 1849 浏览 1 评论 0原文

我有这样的数据框架:

customer_type   age    satisfaction    design  food    wifi    service    distance
        Loyal    28   Not Satisfied         0    1        2          2        13.5
        Loyal    55       Satisfied         5    3        5          4        34.2
     Disloyal    36   Not Satisfied         2    0        2          4        55.8
     Disloyal    28   Not Satisfied         3    1        2          2        13.5
     Disloyal    33   Not Satisfied         2    1        2          2        13.5
     Disloyal    35   Not Satisfied         2    1        2          2        13.5
     Disloyal    39   Not Satisfied         1    1        2          2        13.5
     Disloyal    31   Not Satisfied         2    1        2          2        13.5
        Loyal    28   Not Satisfied         0    1        2          2        13.5
     Disloyal    31   Not Satisfied         2    1        2          2        13.5
     Disloyal    40   Not Satisfied         2    1        2          2        13.5
     Disloyal    35   Not Satisfied         2    1        2          2        13.5
     Disloyal    35   Not Satisfied         2    2        2          2        13.5

我想找出不忠不满足的客户的特征他们的评分:

 service   ratings_count    age   age_count   population_pct
  design               8    40    1           7.69
                            36    1           7.69
                            35    3           23.07
                            33    1           7.69
                            31    2           15.38
    food               1    35    1           7.69

我怀疑我必须使用代码> ,但我无法从那里弄清楚如何groupby

I have a DataFrame like this one:

customer_type   age    satisfaction    design  food    wifi    service    distance
        Loyal    28   Not Satisfied         0    1        2          2        13.5
        Loyal    55       Satisfied         5    3        5          4        34.2
     Disloyal    36   Not Satisfied         2    0        2          4        55.8
     Disloyal    28   Not Satisfied         3    1        2          2        13.5
     Disloyal    33   Not Satisfied         2    1        2          2        13.5
     Disloyal    35   Not Satisfied         2    1        2          2        13.5
     Disloyal    39   Not Satisfied         1    1        2          2        13.5
     Disloyal    31   Not Satisfied         2    1        2          2        13.5
        Loyal    28   Not Satisfied         0    1        2          2        13.5
     Disloyal    31   Not Satisfied         2    1        2          2        13.5
     Disloyal    40   Not Satisfied         2    1        2          2        13.5
     Disloyal    35   Not Satisfied         2    1        2          2        13.5
     Disloyal    35   Not Satisfied         2    2        2          2        13.5

I want to find out the characteristics of the Disloyal and Not Satisfied customers that are between 30 and 40 years old, grouping them by the service they have rated:

 service   ratings_count    age   age_count   population_pct
  design               8    40    1           7.69
                            36    1           7.69
                            35    3           23.07
                            33    1           7.69
                            31    2           15.38
    food               1    35    1           7.69

I suspect I have to use melt but I can't figure out how to groupby from there.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

年少掌心 2025-02-14 19:17:15

使用以下玩具数据框,受您的启发,但有点异质

import pandas as pd

df = pd.DataFrame(
    {
        "customer_type": [
            "Loyal",
            "Loyal",
            "Disloyal",
            "Disloyal",
            "Loyal",
            "Disloyal",
            "Disloyal",
            "Disloyal",
            "Loyal",
            "Disloyal",
            "Disloyal",
            "Loyal",
            "Disloyal",
        ],
        "age": [28, 55, 27, 31, 42, 35, 39, 31, 28, 51, 40, 35, 35],
        "satisfaction": [
            "Not Satisfied",
            "Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Satisfied",
        ],
        "design": [0, 5, 2, 4, 2, 2, 1, 0, 1, 2, 0, 2, 1],
        "food": [1, 3, 0, 0, 1, 0, 1, 0, 1, 0, 1, 3, 2],
        "wifi": [2, 5, 2, 1, 2, 2, 0, 0, 1, 2, 2, 0, 2],
        "service": [2, 4, 1, 2, 2, 2, 0, 2, 3, 0, 1, 2, 2],
        "distance": [
            13.5,
            34.2,
            55.8,
            21.2,
            23.5,
            13.5,
            9.0,
            13.5,
            13.5,
            22.7,
            13.5,
            13.5,
            18.4,
        ],
    }
)

这是一个较少的子尾部,尽管更容易做到这一点:

# Setup
amenities = ["design", "food", "wifi", "service"]
dfs = []
new_df = df[
    df["age"].between(30, 40)
    & (df["customer_type"] == "Disloyal")
    & (df["satisfaction"] == "Not Satisfied")
]

# Iterate on each amenity, groupby age and add other columns/values
for amenity in amenities:
    temp_df = (
        new_df[new_df[amenity] > 0]
        .groupby("age")
        .agg({amenity: "count", "distance": "mean"})
        .reset_index(drop=False)
        .rename(columns={amenity: "age_count", "distance": "distance_mean"})
        .pipe(
            lambda df_: df_.assign(
                population_pct=(100 * df_["age_count"] / df_["age_count"].sum()).round(
                    2
                )
            )
        )
    )
    temp_df.loc[0, "amenities"] = amenity
    temp_df.loc[0, "ratings_count"] = temp_df["age_count"].sum()
    temp_df = pd.concat(
        [
            temp_df,
            new_df[(new_df[amenity] != 0)]
            .sort_values(by=["age"])
            .groupby("age")
            .agg({amenity: list})
            .reset_index(drop=True),
        ],
        axis=1,
    )
    temp_df = temp_df.rename(columns={amenity: "ratings"})
    temp_df["ratings"] = temp_df["ratings"].apply(lambda x: x[0] if len(x) == 1 else x)
    dfs.append(temp_df)

# Get final dataframe and cleanup
new_df = (
    pd.concat(dfs)
    .fillna(method="ffill")
    .reindex(
        columns=[
            "amenities",
            "ratings_count",
            "age",
            "age_count",
            "ratings",
            "distance_mean",
            "population_pct",
        ]
    )
    .astype({"ratings_count": "int"})
    .set_index(["amenities", "ratings_count"])
)

因此:

print(new_df)
# Output
                         age  age_count ratings  distance_mean  population_pct
amenities ratings_count                                                       
design    3               31          1       4          21.20           33.33
          3               35          1       2          13.50           33.33
          3               39          1       1           9.00           33.33
food      2               39          1       1           9.00           50.00
          2               40          1       1          13.50           50.00
wifi      3               31          1       1          21.20           33.33
          3               35          1       2          13.50           33.33
          3               40          1       2          13.50           33.33
service   4               31          2  [2, 2]          17.35           50.00
          4               35          1       2          13.50           25.00
          4               40          1       1          13.50           25.00

从这里,您可以过滤ratings_count&lt ; = 2这样:

print(new_df[new_df.index.get_level_values(1) <= 2])
# Output
                         age  age_count ratings  distance_mean  population_pct
amenities ratings_count                                                       
food      2               39          1       1            9.0            50.0
          2               40          1       1           13.5            50.0

With the following toy dataframe, inspired by yours but a bit more heterogeneous:

import pandas as pd

df = pd.DataFrame(
    {
        "customer_type": [
            "Loyal",
            "Loyal",
            "Disloyal",
            "Disloyal",
            "Loyal",
            "Disloyal",
            "Disloyal",
            "Disloyal",
            "Loyal",
            "Disloyal",
            "Disloyal",
            "Loyal",
            "Disloyal",
        ],
        "age": [28, 55, 27, 31, 42, 35, 39, 31, 28, 51, 40, 35, 35],
        "satisfaction": [
            "Not Satisfied",
            "Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Satisfied",
            "Not Satisfied",
            "Not Satisfied",
            "Satisfied",
        ],
        "design": [0, 5, 2, 4, 2, 2, 1, 0, 1, 2, 0, 2, 1],
        "food": [1, 3, 0, 0, 1, 0, 1, 0, 1, 0, 1, 3, 2],
        "wifi": [2, 5, 2, 1, 2, 2, 0, 0, 1, 2, 2, 0, 2],
        "service": [2, 4, 1, 2, 2, 2, 0, 2, 3, 0, 1, 2, 2],
        "distance": [
            13.5,
            34.2,
            55.8,
            21.2,
            23.5,
            13.5,
            9.0,
            13.5,
            13.5,
            22.7,
            13.5,
            13.5,
            18.4,
        ],
    }
)

Here is one less subtel, although easier, way to do it:

# Setup
amenities = ["design", "food", "wifi", "service"]
dfs = []
new_df = df[
    df["age"].between(30, 40)
    & (df["customer_type"] == "Disloyal")
    & (df["satisfaction"] == "Not Satisfied")
]

# Iterate on each amenity, groupby age and add other columns/values
for amenity in amenities:
    temp_df = (
        new_df[new_df[amenity] > 0]
        .groupby("age")
        .agg({amenity: "count", "distance": "mean"})
        .reset_index(drop=False)
        .rename(columns={amenity: "age_count", "distance": "distance_mean"})
        .pipe(
            lambda df_: df_.assign(
                population_pct=(100 * df_["age_count"] / df_["age_count"].sum()).round(
                    2
                )
            )
        )
    )
    temp_df.loc[0, "amenities"] = amenity
    temp_df.loc[0, "ratings_count"] = temp_df["age_count"].sum()
    temp_df = pd.concat(
        [
            temp_df,
            new_df[(new_df[amenity] != 0)]
            .sort_values(by=["age"])
            .groupby("age")
            .agg({amenity: list})
            .reset_index(drop=True),
        ],
        axis=1,
    )
    temp_df = temp_df.rename(columns={amenity: "ratings"})
    temp_df["ratings"] = temp_df["ratings"].apply(lambda x: x[0] if len(x) == 1 else x)
    dfs.append(temp_df)

# Get final dataframe and cleanup
new_df = (
    pd.concat(dfs)
    .fillna(method="ffill")
    .reindex(
        columns=[
            "amenities",
            "ratings_count",
            "age",
            "age_count",
            "ratings",
            "distance_mean",
            "population_pct",
        ]
    )
    .astype({"ratings_count": "int"})
    .set_index(["amenities", "ratings_count"])
)

So that:

print(new_df)
# Output
                         age  age_count ratings  distance_mean  population_pct
amenities ratings_count                                                       
design    3               31          1       4          21.20           33.33
          3               35          1       2          13.50           33.33
          3               39          1       1           9.00           33.33
food      2               39          1       1           9.00           50.00
          2               40          1       1          13.50           50.00
wifi      3               31          1       1          21.20           33.33
          3               35          1       2          13.50           33.33
          3               40          1       2          13.50           33.33
service   4               31          2  [2, 2]          17.35           50.00
          4               35          1       2          13.50           25.00
          4               40          1       1          13.50           25.00

From here, you can filter ratings_count <=2 like this:

print(new_df[new_df.index.get_level_values(1) <= 2])
# Output
                         age  age_count ratings  distance_mean  population_pct
amenities ratings_count                                                       
food      2               39          1       1            9.0            50.0
          2               40          1       1           13.5            50.0
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文