在熊猫中的groupby与其中的子句获得计数

发布于 01-21 00:22 字数 1306 浏览 1 评论 0原文

我有带有日期字段、已创建(user_accout) 和user_id 的数据表。

日期	创建	customer_id
2022-01-01	2021-05-07	user1。
2022-01-02	2022-01-02	使用2。
2022-01-03	2021-02-02	使用3。
2022-01-04	2022-01-05	使用4。
2022-01-05	2022-01-05	使用5。
2022-01-06	2022-01-08	使用6。

user1分组的新用户的计数（作为基于创建的字段的新用户）

我想获得按日期字段	创建的	customer_id	new_users（基于日期'colum1）
2022-01-01	2021-05-07	。	0
2022-01-02	2022-01-02	使用2。	1
2022-01-03	2021-02-02	使用3。	0
2022-01-04	2022-01-05	使用4。	0
2022-01-05	2022-01-05	使用5。	2
2022-01-06	2022-01-08	使用6。	0

我尝试使用 groupby 但无法分配创建的日期 == 来获取特定日期字段上新用户的计数。

原文

I have data table with date field, created(user_accout) and user_id.

date	created	customer_id
2022-01-01	2021-05-07	user1.
2022-01-02	2022-01-02	use2.
2022-01-03	2021-02-02	use3.
2022-01-04	2022-01-05	use4.
2022-01-05	2022-01-05	use5.
2022-01-06	2022-01-08	use6.

I want to get a count (as new users based on the created field)of new users grouped by the date field

date	created	customer_id	new_users(based on date'colum1)
2022-01-01	2021-05-07	user1.	0
2022-01-02	2022-01-02	use2.	1
2022-01-03	2021-02-02	use3.	0
2022-01-04	2022-01-05	use4.	0
2022-01-05	2022-01-05	use5.	2
2022-01-06	2022-01-08	use6.	0

i tried using the groupby but i could not able to assign date == created to get count of the new users on particular date field.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小清晰的声音2025-01-28 00:22:03

首先，我认为最好将您的数据分为两个不同的表。在第一表中，您只有creation_date s和customer_id s。这是您的实际输入。看起来像这样：

created_table = pd.DataFrame(
    dict(
        created=pd.Series(
            [
                "2022-01-07",
                "2022-01-02",
                "2022-01-05",
                "2022-01-02",
                "2022-01-05",
                "2022-01-05",
            ],
            dtype='datetime64[ns]'
        ),
        customer_id = ['user1', 'user2', 'user5', 'user4', 'user5', 'user6']
    )
)

    created     customer_id
0   2022-01-07  user1
1   2022-01-02  user2
2   2022-01-05  user5
3   2022-01-02  user4
4   2022-01-05  user5
5   2022-01-05  user6

我对它进行了一些更改，以使其更具说明性。

据我所知，您要计算每个日期存在多少唯一customer_id。这可以使用groupby和nunique来完成。

customers_created = created_table.groupby('created')['customer_id'].nunique()

created     customer_id
2022-01-02  2
2022-01-05  2
2022-01-07  1

现在，您可能想通过一系列连续的日期加入此结果。首先，让我们创建一个具有这样的日期的索引：

dates = pd.date_range(start="2022-01-01", end="2022-01-10", name="date")

现在，让我们使用此新索引来重新索引我们的系列cesulter_created：

(
    customers_created.reindex(dates, fill_value=0)
    .to_frame()
    .reset_index()
    .rename(columns={"customer_id": "new_users"})
)

    date        new_users
0   2022-01-01  0
1   2022-01-02  2
2   2022-01-03  0
3   2022-01-04  0
4   2022-01-05  2
5   2022-01-06  0
6   2022-01-07  1
7   2022-01-08  0
8   2022-01-09  0
9   2022-01-10  0

取决于您需要的内容，系列或数据框架，您可以放下最后一部分（即。 to_frame（）等）。

这可能就是您要寻找的。在您的问题中，该表与原始表合并，但我认为这没有多大意义，因为初始表和新表之间没有关系。

customer_id中的所有值也可能是唯一的。然后，您可以将create_table.groupby（'创建'）['customer_id'] .nunique（）用create_table ['content']。value_counts（）。

First of all, I think it is better to split your data into two different tables. In the first table, you have only creation_dates and customer_ids. This is your actual input. It looks like this:

created_table = pd.DataFrame(
    dict(
        created=pd.Series(
            [
                "2022-01-07",
                "2022-01-02",
                "2022-01-05",
                "2022-01-02",
                "2022-01-05",
                "2022-01-05",
            ],
            dtype='datetime64[ns]'
        ),
        customer_id = ['user1', 'user2', 'user5', 'user4', 'user5', 'user6']
    )
)

    created     customer_id
0   2022-01-07  user1
1   2022-01-02  user2
2   2022-01-05  user5
3   2022-01-02  user4
4   2022-01-05  user5
5   2022-01-05  user6

I changed it a little bit to make it more illustrative.

Now as far as I understood, you want to count how many unique customer_ids exist for each date. This can be done with groupby and nunique.

customers_created = created_table.groupby('created')['customer_id'].nunique()

created     customer_id
2022-01-02  2
2022-01-05  2
2022-01-07  1

Now you probably want to join this results with a series of consecutive dates. First, let's create an index with such dates:

dates = pd.date_range(start="2022-01-01", end="2022-01-10", name="date")

Now let's reindex our series customers_created with this new index:

(
    customers_created.reindex(dates, fill_value=0)
    .to_frame()
    .reset_index()
    .rename(columns={"customer_id": "new_users"})
)

    date        new_users
0   2022-01-01  0
1   2022-01-02  2
2   2022-01-03  0
3   2022-01-04  0
4   2022-01-05  2
5   2022-01-06  0
6   2022-01-07  1
7   2022-01-08  0
8   2022-01-09  0
9   2022-01-10  0

Depending on what you need, series or dataframe, you may drop the last part (i.e. .to_frame(), etc).

This is probably what you were looking for. In your question, this table was merged with the original table, but I don't think it makes much sense, as there are no relation between rows of the initial table and the new table.

It is also possible that all values in customer_id are guaranteed to be unique. Then you can replace created_table.groupby('created')['customer_id'].nunique() with created_table['created'].value_counts().

回复收藏 0 原文

~没有更多了~