Python Pandas面板数据 - 通过其他时期填充缺失值的信息

发布于 2025-02-13 16:44:45 字数 2017 浏览 0 评论 0原文

我正在使用一个数据集数据集。也就是说，在许多时间段内，我对某些单位有观察。

例如：

dates = 3 * list(pd.date_range(start='1/31/2018', end='3/31/2018', freq="M"))
unit_id = ["id_1", "id_1", "id_1", "id_2", "id_2", "id_2", "id_3", "id_3", "id_3"]
locations = ["loc_1", "loc_1", np.nan, "loc_2", "loc_2", np.nan, "loc_3", "loc_3", np.nan]
var_1 = ["x1_t1", "x1_t2", "x1_t3", "x2_t1", "x2_t2", "x2_t3", "x3_t1", "x3_t2", "x3_t3"]
var_2 = ["z1_t1", "z1_t2", "z1_t3", "z2_t1", "z2_t2", "z2_t3", "z3_t1", "z3_t2", "z3_t3"]

_ = pd.DataFrame({"date": dates, "id": unit_id, "location": locations, "var_1": var_1, "var_2": var_2})

这给了我这样的东西：

	日期	ID	位置	VAR_1	VAR_2
0	2018-01-31	ID_1	LOC_1	X1_T1	Z1_T1
1	2018-02-28	ID_1	LOC_1 LOC_1	X1_T2	Z1_T2
2	2018-03-31	ID_1 ID_1	NAN	X1_T3	Z1_T3 Z1_T3
3	2018-01-01-01-01	ID__________________2	x2 x2	X2-12 X2-12 x2-12 X2-12	x2-2 x2-2 x2-2 x2-2 x2 x2
x2 x2 x2	x2 x22-2 28	ID_2	loc_2	x2_t2	z2_t2
5	2018-03-31	id_2	NaN	x2_t3	z2_t3
6	2018-01-31	id_3	loc_3	x3_t1	z3_t1
7	2018-02-28	id_3	loc_3	x3_t2	z3_t2
8	2018-03-31	id_3	NaN	x3_t3	z3_t3

My dataframe is 不像示例那样订购。它是按时间订购的。

同样，面板是不平衡的，这意味着并非每个时期都出现所有单元。

我要做的是，如果单位出现在其他时期并具有位置信息，则填充与其他时期相对应的值（即ID匹配）的值（即ID匹配）。不弄乱其他变量。

有技巧吗？

原文

I am working with a data set of panel data.
That is, I have observations of some units over many time periods.

For example:

dates = 3 * list(pd.date_range(start='1/31/2018', end='3/31/2018', freq="M"))
unit_id = ["id_1", "id_1", "id_1", "id_2", "id_2", "id_2", "id_3", "id_3", "id_3"]
locations = ["loc_1", "loc_1", np.nan, "loc_2", "loc_2", np.nan, "loc_3", "loc_3", np.nan]
var_1 = ["x1_t1", "x1_t2", "x1_t3", "x2_t1", "x2_t2", "x2_t3", "x3_t1", "x3_t2", "x3_t3"]
var_2 = ["z1_t1", "z1_t2", "z1_t3", "z2_t1", "z2_t2", "z2_t3", "z3_t1", "z3_t2", "z3_t3"]

_ = pd.DataFrame({"date": dates, "id": unit_id, "location": locations, "var_1": var_1, "var_2": var_2})

This gives me something like this:

	date	id	location	var_1	var_2
0	2018-01-31	id_1	loc_1	x1_t1	z1_t1
1	2018-02-28	id_1	loc_1	x1_t2	z1_t2
2	2018-03-31	id_1	NaN	x1_t3	z1_t3
3	2018-01-31	id_2	loc_2	x2_t1	z2_t1
4	2018-02-28	id_2	loc_2	x2_t2	z2_t2
5	2018-03-31	id_2	NaN	x2_t3	z2_t3
6	2018-01-31	id_3	loc_3	x3_t1	z3_t1
7	2018-02-28	id_3	loc_3	x3_t2	z3_t2
8	2018-03-31	id_3	NaN	x3_t3	z3_t3

My dataframe is not ordered like the example. It is ordered by time.

Also, the panel is unbalances, meaning not all units show up in every period.

What I want to do is to fill the location values that are NaN with values that correspond to the same unit (i.e. id matches) from other periods, if the unit appears in some other period and has location information. Without messing up the other variables.

Any tips?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦中楼上月下 2025-02-20 16:44:45

这是在dataframe之后的延续（在下面代码中调用 df ）是创建

id_loc_df = df[['id','location']].dropna().drop_duplicates().rename(columns={'location':'location_fill'})

df_updated = pd.merge(df, id_loc_df,left_on='id',right_on='id')

df_updated['location'] = np.where(df_updated['location'].isna(), df_updated['location_fill'], df_updated['location'])

df_updated.drop('location_fill', axis=1)

输出的：

This is the continuation of your code after dataframe (calling it df in below code) is created

id_loc_df = df[['id','location']].dropna().drop_duplicates().rename(columns={'location':'location_fill'})

df_updated = pd.merge(df, id_loc_df,left_on='id',right_on='id')

df_updated['location'] = np.where(df_updated['location'].isna(), df_updated['location_fill'], df_updated['location'])

df_updated.drop('location_fill', axis=1)

Output: