在熊猫中的groupby与其中的子句获得计数

发布于 01-21 00:22 字数 1306 浏览 1 评论 0原文

我有带有日期字段、已创建(user_accout) 和user_id 的数据表。

日期创建customer_id
2022-01-012021-05-07user1。
2022-01-022022-01-02使用2。
2022-01-032021-02-02使用3。
2022-01-042022-01-05使用4。
2022-01-052022-01-05使用5。
2022-01-062022-01-08使用6。

user1分组的新用户的计数(作为基于创建的字段的新用户)

我想获得按日期字段创建的customer_idnew_users(基于日期'colum1)
2022-01-012021-05-070
2022-01-022022-01-02使用2。1
2022-01-032021-02-02使用3。0
2022-01-042022-01-05使用4。0
2022-01-052022-01-05使用5。2
2022-01-062022-01-08使用6。0

我尝试使用 groupby 但无法分配创建的日期 == 来获取特定日期字段上新用户的计数。

I have data table with date field, created(user_accout) and user_id.

datecreatedcustomer_id
2022-01-012021-05-07user1.
2022-01-022022-01-02use2.
2022-01-032021-02-02use3.
2022-01-042022-01-05use4.
2022-01-052022-01-05use5.
2022-01-062022-01-08use6.

I want to get a count (as new users based on the created field)of new users grouped by the date field

datecreatedcustomer_idnew_users(based on date'colum1)
2022-01-012021-05-07user1.0
2022-01-022022-01-02use2.1
2022-01-032021-02-02use3.0
2022-01-042022-01-05use4.0
2022-01-052022-01-05use5.2
2022-01-062022-01-08use6.0

i tried using the groupby but i could not able to assign date == created to get count of the new users on particular date field.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

小清晰的声音2025-01-28 00:22:03

首先,我认为最好将您的数据分为两个不同的表。在第一表中,您只有creation_date s和customer_id s。这是您的实际输入。看起来像这样:

created_table = pd.DataFrame(
    dict(
        created=pd.Series(
            [
                "2022-01-07",
                "2022-01-02",
                "2022-01-05",
                "2022-01-02",
                "2022-01-05",
                "2022-01-05",
            ],
            dtype='datetime64[ns]'
        ),
        customer_id = ['user1', 'user2', 'user5', 'user4', 'user5', 'user6']
    )
)

    created     customer_id
0   2022-01-07  user1
1   2022-01-02  user2
2   2022-01-05  user5
3   2022-01-02  user4
4   2022-01-05  user5
5   2022-01-05  user6

我对它进行了一些更改,以使其更具说明性。

据我所知,您要计算每个日期存在多少唯一customer_id。这可以使用groupbynunique来完成。

customers_created = created_table.groupby('created')['customer_id'].nunique()

created     customer_id
2022-01-02  2
2022-01-05  2
2022-01-07  1

现在,您可能想通过一系列连续的日期加入此结果。首先,让我们创建一个具有这样的日期的索引:

dates = pd.date_range(start="2022-01-01", end="2022-01-10", name="date")

现在,让我们使用此新索引来重新索引我们的系列cesulter_created

(
    customers_created.reindex(dates, fill_value=0)
    .to_frame()
    .reset_index()
    .rename(columns={"customer_id": "new_users"})
)

    date        new_users
0   2022-01-01  0
1   2022-01-02  2
2   2022-01-03  0
3   2022-01-04  0
4   2022-01-05  2
5   2022-01-06  0
6   2022-01-07  1
7   2022-01-08  0
8   2022-01-09  0
9   2022-01-10  0


取决于您需要的内容,系列或数据框架,您可以放下最后一部分(即。 to_frame()等)。

这可能就是您要寻找的。在您的问题中,该表与原始表合并,但我认为这没有多大意义,因为初始表和新表之间没有关系。

customer_id中的所有值也可能是唯一的。然后,您可以将create_table.groupby('创建')['customer_id'] .nunique()create_table ['content']。value_counts()

First of all, I think it is better to split your data into two different tables. In the first table, you have only creation_dates and customer_ids. This is your actual input. It looks like this:

created_table = pd.DataFrame(
    dict(
        created=pd.Series(
            [
                "2022-01-07",
                "2022-01-02",
                "2022-01-05",
                "2022-01-02",
                "2022-01-05",
                "2022-01-05",
            ],
            dtype='datetime64[ns]'
        ),
        customer_id = ['user1', 'user2', 'user5', 'user4', 'user5', 'user6']
    )
)

    created     customer_id
0   2022-01-07  user1
1   2022-01-02  user2
2   2022-01-05  user5
3   2022-01-02  user4
4   2022-01-05  user5
5   2022-01-05  user6

I changed it a little bit to make it more illustrative.

Now as far as I understood, you want to count how many unique customer_ids exist for each date. This can be done with groupby and nunique.

customers_created = created_table.groupby('created')['customer_id'].nunique()

created     customer_id
2022-01-02  2
2022-01-05  2
2022-01-07  1

Now you probably want to join this results with a series of consecutive dates. First, let's create an index with such dates:

dates = pd.date_range(start="2022-01-01", end="2022-01-10", name="date")

Now let's reindex our series customers_created with this new index:

(
    customers_created.reindex(dates, fill_value=0)
    .to_frame()
    .reset_index()
    .rename(columns={"customer_id": "new_users"})
)

    date        new_users
0   2022-01-01  0
1   2022-01-02  2
2   2022-01-03  0
3   2022-01-04  0
4   2022-01-05  2
5   2022-01-06  0
6   2022-01-07  1
7   2022-01-08  0
8   2022-01-09  0
9   2022-01-10  0


Depending on what you need, series or dataframe, you may drop the last part (i.e. .to_frame(), etc).

This is probably what you were looking for. In your question, this table was merged with the original table, but I don't think it makes much sense, as there are no relation between rows of the initial table and the new table.

It is also possible that all values in customer_id are guaranteed to be unique. Then you can replace created_table.groupby('created')['customer_id'].nunique() with created_table['created'].value_counts().

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文