Python Pandas面板数据 - 通过其他时期填充缺失值的信息
我正在使用一个数据集数据集。 也就是说,在许多时间段内,我对某些单位有观察。
例如:
dates = 3 * list(pd.date_range(start='1/31/2018', end='3/31/2018', freq="M"))
unit_id = ["id_1", "id_1", "id_1", "id_2", "id_2", "id_2", "id_3", "id_3", "id_3"]
locations = ["loc_1", "loc_1", np.nan, "loc_2", "loc_2", np.nan, "loc_3", "loc_3", np.nan]
var_1 = ["x1_t1", "x1_t2", "x1_t3", "x2_t1", "x2_t2", "x2_t3", "x3_t1", "x3_t2", "x3_t3"]
var_2 = ["z1_t1", "z1_t2", "z1_t3", "z2_t1", "z2_t2", "z2_t3", "z3_t1", "z3_t2", "z3_t3"]
_ = pd.DataFrame({"date": dates, "id": unit_id, "location": locations, "var_1": var_1, "var_2": var_2})
这给了我这样的东西:
日期 | ID | 位置 | VAR_1 | VAR_2 | |
---|---|---|---|---|---|
0 | 2018-01-31 | ID_1 | LOC_1 | X1_T1 | Z1_T1 |
1 | 2018-02-28 | ID_1 | LOC_1 LOC_1 | X1_T2 | Z1_T2 |
2 | 2018-03-31 | ID_1 ID_1 | NAN | X1_T3 | Z1_T3 Z1_T3 |
3 | 2018-01-01-01-01 | ID__________________2 | x2 x2 | X2-12 X2-12 x2-12 X2-12 | x2-2 x2-2 x2-2 x2-2 x2 x2 |
x2 x2 x2 | x2 x22-2 28 | ID_2 | loc_2 | x2_t2 | z2_t2 |
5 | 2018-03-31 | id_2 | NaN | x2_t3 | z2_t3 |
6 | 2018-01-31 | id_3 | loc_3 | x3_t1 | z3_t1 |
7 | 2018-02-28 | id_3 | loc_3 | x3_t2 | z3_t2 |
8 | 2018-03-31 | id_3 | NaN | x3_t3 | z3_t3 |
My dataframe is 不像示例那样订购。它是按时间订购的。
同样,面板是不平衡的,这意味着并非每个时期都出现所有单元。
我要做的是,如果单位出现在其他时期并具有位置信息,则填充与其他时期相对应的值(即ID匹配)的值(即ID匹配)。不弄乱其他变量。
有技巧吗?
I am working with a data set of panel data.
That is, I have observations of some units over many time periods.
For example:
dates = 3 * list(pd.date_range(start='1/31/2018', end='3/31/2018', freq="M"))
unit_id = ["id_1", "id_1", "id_1", "id_2", "id_2", "id_2", "id_3", "id_3", "id_3"]
locations = ["loc_1", "loc_1", np.nan, "loc_2", "loc_2", np.nan, "loc_3", "loc_3", np.nan]
var_1 = ["x1_t1", "x1_t2", "x1_t3", "x2_t1", "x2_t2", "x2_t3", "x3_t1", "x3_t2", "x3_t3"]
var_2 = ["z1_t1", "z1_t2", "z1_t3", "z2_t1", "z2_t2", "z2_t3", "z3_t1", "z3_t2", "z3_t3"]
_ = pd.DataFrame({"date": dates, "id": unit_id, "location": locations, "var_1": var_1, "var_2": var_2})
This gives me something like this:
date | id | location | var_1 | var_2 | |
---|---|---|---|---|---|
0 | 2018-01-31 | id_1 | loc_1 | x1_t1 | z1_t1 |
1 | 2018-02-28 | id_1 | loc_1 | x1_t2 | z1_t2 |
2 | 2018-03-31 | id_1 | NaN | x1_t3 | z1_t3 |
3 | 2018-01-31 | id_2 | loc_2 | x2_t1 | z2_t1 |
4 | 2018-02-28 | id_2 | loc_2 | x2_t2 | z2_t2 |
5 | 2018-03-31 | id_2 | NaN | x2_t3 | z2_t3 |
6 | 2018-01-31 | id_3 | loc_3 | x3_t1 | z3_t1 |
7 | 2018-02-28 | id_3 | loc_3 | x3_t2 | z3_t2 |
8 | 2018-03-31 | id_3 | NaN | x3_t3 | z3_t3 |
My dataframe is not ordered like the example. It is ordered by time.
Also, the panel is unbalances, meaning not all units show up in every period.
What I want to do is to fill the location values that are NaN with values that correspond to the same unit (i.e. id matches) from other periods, if the unit appears in some other period and has location information. Without messing up the other variables.
Any tips?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是在dataframe之后的延续(在下面代码中调用 df )是创建
输出的:
This is the continuation of your code after dataframe (calling it df in below code) is created
Output: