枢轴pandas dataframe从单行到一行
我最近应用了一个转换以使一个嵌套的JSON不打开,以便可以使用一个平坦的数据集,而在转换起作用时,最终格式并不是我想要的。它将所有数据压缩到一行中,并将后缀添加到列名中,而不是将每个id_prop
分为不同的列。
我的JSON格式数据集以使用Pandas复制:
import pandas as pd
json = {"id_prop.0":{"0":1},"id_prop.1":{"0":2},"id_prop.2":{"0":3},"prop_number.0":{"0":123},"prop_number.1":{"0":325},"prop_number.2":{"0":754},"prop_value.0":{"0":1},"prop_value.1":{"0":1},"prop_value.2":{"0":1}}
df = pd.DataFrame.from_dict(json, orient='columns')
我的结果:
ID_PROP.0 | ID_PROP.1 | ID_PROP.2 | PROP_NUMBER.0 | PROP_NUMBER.1 | PROP_NUMBER.2 | PROP_VALUE.0 | PROP_VALUE.1 | PROP_VALUE.2 | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 123 | 325 | 754 | 1 | 1 1 1 1 1 1 1 1 1 1 | 1 |
我期望结果:
ID_PROP | PROP_NUMBER | PROP_VALUE | |
---|---|---|---|
0 | 1 | 123 | 1 |
1 | 2 | 325 | 1 |
2 | 3 | 754 | 1 |
是否有任何方法可以将数据框架旋转到我需要的格式中,其中每行代表单个id_prop
的值?
attemps
我已经提取了我需要的列的名称,
def extract_cols(columns):
myset = set()
myset_add = myset.add
return [x for x in columns if not (x in myset or myset_add(x))]
cols = extract_cols(df.columns.str.replace("\.[0-9]", "", regex=True))
而没有后缀:“垂直”我需要使用stack()
>:
df_stacked = df.stack().reset_index(level=1, drop=True)
但是我还没有想出如何组合该信息尚未。任何帮助将不胜感激。
额外:
如果也有一种使用Pyspark应用程序的方法,那么更好!
I recently applied a transformation to unnest a nested json, in order to have a flat dataset to work with, and while the transformation works, the final format is not the one I am looking for. It compressed all the data into a single row and added suffixes to column names, instead of separating into different columns for each id_prop
.
My dataset in JSON format to replicate with Pandas:
import pandas as pd
json = {"id_prop.0":{"0":1},"id_prop.1":{"0":2},"id_prop.2":{"0":3},"prop_number.0":{"0":123},"prop_number.1":{"0":325},"prop_number.2":{"0":754},"prop_value.0":{"0":1},"prop_value.1":{"0":1},"prop_value.2":{"0":1}}
df = pd.DataFrame.from_dict(json, orient='columns')
My result:
id_prop.0 | id_prop.1 | id_prop.2 | prop_number.0 | prop_number.1 | prop_number.2 | prop_value.0 | prop_value.1 | prop_value.2 | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 123 | 325 | 754 | 1 | 1 | 1 |
The result I expect:
id_prop | prop_number | prop_value | |
---|---|---|---|
0 | 1 | 123 | 1 |
1 | 2 | 325 | 1 |
2 | 3 | 754 | 1 |
Is there any way to pivot the dataframe into the format I need, where each row represents the values of a single id_prop
?
Attemps
I have already extracted the names of the columns I need without suffixes:
def extract_cols(columns):
myset = set()
myset_add = myset.add
return [x for x in columns if not (x in myset or myset_add(x))]
cols = extract_cols(df.columns.str.replace("\.[0-9]", "", regex=True))
And also "verticalized" the results I need using stack()
:
df_stacked = df.stack().reset_index(level=1, drop=True)
But I haven't figured out how to combine that info yet. Any help would be highly appreciated.
Extra:
If there is also a way to apply this using pyspark, then much better!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一种方法:
输入:
输出:
说明:
。字符
unstack
旋转一个多索引(所需标签)为列标题<强>更新:要处理结果编号是第二个
。
分开的令牌,其正确的令牌(如OP的注释中所述),我们可以做到这一点:输入:输出:
输出:输出:
Here's a way:
Input:
Output:
Explanation:
.
characterunstack
to pivot a level of the MultiIndex (the desired labels) to be column headingsUPDATE: To handle labels where result number is the second
.
separated token with additional tokens to its right (as described in OP's comment), we can do this:Input:
Output:
您可以将列将其拆分为多索引,然后将其堆叠:
You can split the columns to a multiindex and then stack it: