Python Pandas：仅基于其中一列重新采样

发布于 2025-01-20 06:45:48 字数 4157 浏览 1 评论 0原文

我有以下数据，我正在重新采样我的数据，以找出每15分钟到达每一个车站的自行车。 However, my code is aggregating my stations too, and I only want to aggregate the variable "dtm_end_trip"

Sample data:

id_trip	dtm_start_trip	dtm_end_trip	start_station	end_station
1	2018-10-01 10:15:00	2018-10-01 10:17:00	A	B
2	10:17:00	10:18:00	B	A
...		2018-10-01
2018-10-01	00	2022-01-01 00:22:00	C	A
1000000	23:59:00	2022-01-01-01 00:29:00	A	D

试用代码：

df2 =  df(['end_station', 'dtm_end_trip']).size().to_frame(name = 'count').reset_index()
df2 = df2.sort_values(by='count', ascending=False)

df2= df2.set_index('dtm_end_trip')

df2 = df2.resample('15T').count()

输出I GET：

DTM_END_END_TRIP_TRIP	END_STATION_STATION_STATION_STATION_STATION_STATION	INC 2018-10-10-10-10-10-10-10-10-10-10-10-10-10-10-10-10-10--10-
2021-12-31 01 00:15:00	2	2
2018-10-01 00:30:00	0 0	0
2018-10-01-01 00:45:00	1	1
2018-10-01 01:00	2	01
2018-10-10-01 01 01 01 ：15：00	1	1

所需的输出：

DTM_END_TRIP	END_STATION	COUNT
2018-10-01 00:15:00	A	2
2018-10-01 00:15:00	B	0
2018-10-10-01-01 00:15:00	C	1
2018--- 10-01 00:15:00	D	2
2018-10-01 00:30:00	A	3
2018-10-01 00:30:00	B	2

上表的计数列在这种情况下是用随机数构建，其唯一目的是示例所需输出的体系结构。

原文

I have the following data and I'm resampling my data to find out how many bikes arrive at each of the stations every 15 minutes. However, my code is aggregating my stations too, and I only want to aggregate the variable "dtm_end_trip"

Sample data:

id_trip	dtm_start_trip	dtm_end_trip	start_station	end_station
1	2018-10-01 10:15:00	2018-10-01 10:17:00	A	B
2	2018-10-01 10:17:00	2018-10-01 10:18:00	B	A
...	...	...	...	...
999999	2021-12-31 23:58:00	2022-01-01 00:22:00	C	A
1000000	2021-12-31 23:59:00	2022-01-01 00:29:00	A	D

Trial code:

df2 =  df(['end_station', 'dtm_end_trip']).size().to_frame(name = 'count').reset_index()
df2 = df2.sort_values(by='count', ascending=False)

df2= df2.set_index('dtm_end_trip')

df2 = df2.resample('15T').count()

Output I get:

dtm_end_trip	end_station	count
2018-10-01 00:15:00	2	2
2018-10-01 00:30:00	0	0
2018-10-01 00:45:00	1	1
2018-10-01 01:00:00	2	2
2018-10-01 01:15:00	1	1

Desired output:

dtm_end_trip	end_station	count
2018-10-01 00:15:00	A	2
2018-10-01 00:15:00	B	0
2018-10-01 00:15:00	C	1
2018-10-01 00:15:00	D	2
2018-10-01 00:30:00	A	3
2018-10-01 00:30:00	B	2

The count column of the table above was, in this case, constructed with random numbers with the sole purpose of exemplifying the architecture of the desired output.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自控 2025-01-27 06:45:48

您可以使用 >这样：

out = df.groupby([
    pd.Grouper(freq='15min', key='dtm_end_trip'),
    'end_station',
]).size()

>>> out
dtm_end_trip         end_station
2018-10-01 10:15:00  A              1
                     B              1
2022-01-01 00:15:00  A              1
                     D              1
dtype: int64

结果是系列，但是您可以根据所需的输出轻松地将其转换为dataFrame，其标题：

>>> out.to_frame('count').reset_index()
         dtm_end_trip end_station  count
0 2018-10-01 10:15:00           A      1
1 2018-10-01 10:15:00           B      1
2 2022-01-01 00:15:00           A      1
3 2022-01-01 00:15:00           D      1

注意：这是从样本输入数据中的四行。

You can use pd.Grouper like this:

out = df.groupby([
    pd.Grouper(freq='15min', key='dtm_end_trip'),
    'end_station',
]).size()

>>> out
dtm_end_trip         end_station
2018-10-01 10:15:00  A              1
                     B              1
2022-01-01 00:15:00  A              1
                     D              1
dtype: int64

The result is a Series, but you can easily convert it to a DataFrame with the same headings as per your desired output:

>>> out.to_frame('count').reset_index()
         dtm_end_trip end_station  count
0 2018-10-01 10:15:00           A      1
1 2018-10-01 10:15:00           B      1
2 2022-01-01 00:15:00           A      1
3 2022-01-01 00:15:00           D      1

Note: this is the result from the four rows in your sample input data.

回复收藏 0 原文

~没有更多了~