枢轴pandas dataframe从单行到一行

发布于 2025-02-01 22:30:56 字数 1979 浏览 3 评论 0原文

我最近应用了一个转换以使一个嵌套的JSON不打开,以便可以使用一个平坦的数据集,而在转换起作用时,最终格式并不是我想要的。它将所有数据压缩到一行中,并将后缀添加到列名中,而不是将每个id_prop分为不同的列。

我的JSON格式数据集以使用Pandas复制:

import pandas as pd
json = {"id_prop.0":{"0":1},"id_prop.1":{"0":2},"id_prop.2":{"0":3},"prop_number.0":{"0":123},"prop_number.1":{"0":325},"prop_number.2":{"0":754},"prop_value.0":{"0":1},"prop_value.1":{"0":1},"prop_value.2":{"0":1}}
df = pd.DataFrame.from_dict(json, orient='columns')

我的结果:

ID_PROP.0ID_PROP.1ID_PROP.2PROP_NUMBER.0PROP_NUMBER.1PROP_NUMBER.2PROP_VALUE.0PROP_VALUE.1PROP_VALUE.2
012312332575411 1 1 1 1 1 1 1 1 11

我期望结果:

ID_PROPPROP_NUMBERPROP_VALUE
011231
123251
237541

是否有任何方法可以将数据框架旋转到我需要的格式中,其中每行代表单个id_prop的值?

attemps

我已经提取了我需要的列的名称,

def extract_cols(columns):
    myset = set()
    myset_add = myset.add
    return [x for x in columns if not (x in myset or myset_add(x))]

cols = extract_cols(df.columns.str.replace("\.[0-9]", "", regex=True))

而没有后缀:“垂直”我需要使用stack()>:

df_stacked = df.stack().reset_index(level=1, drop=True)

但是我还没有想出如何组合该信息尚未。任何帮助将不胜感激。

额外:

如果也有一种使用Pyspark应用程序的方法,那么更好!

I recently applied a transformation to unnest a nested json, in order to have a flat dataset to work with, and while the transformation works, the final format is not the one I am looking for. It compressed all the data into a single row and added suffixes to column names, instead of separating into different columns for each id_prop.

My dataset in JSON format to replicate with Pandas:

import pandas as pd
json = {"id_prop.0":{"0":1},"id_prop.1":{"0":2},"id_prop.2":{"0":3},"prop_number.0":{"0":123},"prop_number.1":{"0":325},"prop_number.2":{"0":754},"prop_value.0":{"0":1},"prop_value.1":{"0":1},"prop_value.2":{"0":1}}
df = pd.DataFrame.from_dict(json, orient='columns')

My result:

id_prop.0id_prop.1id_prop.2prop_number.0prop_number.1prop_number.2prop_value.0prop_value.1prop_value.2
0123123325754111

The result I expect:

id_propprop_numberprop_value
011231
123251
237541

Is there any way to pivot the dataframe into the format I need, where each row represents the values of a single id_prop?

Attemps

I have already extracted the names of the columns I need without suffixes:

def extract_cols(columns):
    myset = set()
    myset_add = myset.add
    return [x for x in columns if not (x in myset or myset_add(x))]

cols = extract_cols(df.columns.str.replace("\.[0-9]", "", regex=True))

And also "verticalized" the results I need using stack():

df_stacked = df.stack().reset_index(level=1, drop=True)

But I haven't figured out how to combine that info yet. Any help would be highly appreciated.

Extra:

If there is also a way to apply this using pyspark, then much better!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

就是爱搞怪 2025-02-08 22:30:56

这是一种方法:

df = df.T
df.index = pd.MultiIndex.from_arrays([[x[x.find('.')+1:] for x in df.index], [x[:x.find('.')] for x in df.index]])
df = df.unstack()

输入:

   id_prop.0  id_prop.1  id_prop.2  prop_number.0  prop_number.1  prop_number.2  prop_value.0  prop_value.1  prop_value.2
0          1          2          3            123            325            754             1             1             1

输出:

  id_prop prop_number prop_value
0       1         123          1
1       2         325          1
2       3         754          1

说明:

  • 转台,以便我们可以使用索引,而不是列
  • 将每个标签解析为所需的标签(前缀)(前缀),结果编号(后缀)由。字符
  • 将DF的索引更新为具有两个级别的多索引:结果编号列表和所需标签的列表
  • call unstack旋转一个多索引(所需标签)为列标题

<强>更新:要处理结果编号是第二个分开的令牌,其正确的令牌(如OP的注释中所述),我们可以做到这一点:

import pandas as pd
json = {
    "building.0.description.bedrooms":{"0":"qrs"}, 
    "building.1.description.bedrooms":{"0":"tuv"}, 
    "building.2.description.bedrooms":{"0":"xyz"}, 

    "id_prop.0":{"0":1},"id_prop.1":{"0":2},"id_prop.2":{"0":3},
    "prop_number.0":{"0":123},"prop_number.1":{"0":325},"prop_number.2":{"0":754},
    "prop_value.0":{"0":1},"prop_value.1":{"0":1},"prop_value.2":{"0":1}}
df = pd.DataFrame.from_dict(json, orient='columns')
print(df.to_string())

df = df.T
df.index = pd.MultiIndex.from_arrays([[x.split('.')[1] for x in df.index], ['.'.join(x.split('.')[0:1] + x.split('.')[2:]) for x in df.index]])
df = df.unstack()
df.columns = df.columns.get_level_values(1)
print(df)

输入:输出:

  building.0.description.bedrooms building.1.description.bedrooms building.2.description.bedrooms  id_prop.0  id_prop.1  id_prop.2  prop_number.0  prop_number.1  prop_number.2  prop_value.0  prop_value.1  prop_value.2
0                             qrs                             tuv                             xyz          1          2          3            123            325            754             1             1             1

输出:输出:

  building.description.bedrooms id_prop prop_number prop_value
0                           qrs       1         123          1
1                           tuv       2         325          1
2                           xyz       3         754          1

Here's a way:

df = df.T
df.index = pd.MultiIndex.from_arrays([[x[x.find('.')+1:] for x in df.index], [x[:x.find('.')] for x in df.index]])
df = df.unstack()

Input:

   id_prop.0  id_prop.1  id_prop.2  prop_number.0  prop_number.1  prop_number.2  prop_value.0  prop_value.1  prop_value.2
0          1          2          3            123            325            754             1             1             1

Output:

  id_prop prop_number prop_value
0       1         123          1
1       2         325          1
2       3         754          1

Explanation:

  • transpose so we can work with the index instead of the columns
  • parse each label into the desired label (prefix) and the result number (suffix) split by the . character
  • update the df's index to be a MultiIndex with two levels: an list of result numbers and a list of desired labels
  • call unstack to pivot a level of the MultiIndex (the desired labels) to be column headings

UPDATE: To handle labels where result number is the second . separated token with additional tokens to its right (as described in OP's comment), we can do this:

import pandas as pd
json = {
    "building.0.description.bedrooms":{"0":"qrs"}, 
    "building.1.description.bedrooms":{"0":"tuv"}, 
    "building.2.description.bedrooms":{"0":"xyz"}, 

    "id_prop.0":{"0":1},"id_prop.1":{"0":2},"id_prop.2":{"0":3},
    "prop_number.0":{"0":123},"prop_number.1":{"0":325},"prop_number.2":{"0":754},
    "prop_value.0":{"0":1},"prop_value.1":{"0":1},"prop_value.2":{"0":1}}
df = pd.DataFrame.from_dict(json, orient='columns')
print(df.to_string())

df = df.T
df.index = pd.MultiIndex.from_arrays([[x.split('.')[1] for x in df.index], ['.'.join(x.split('.')[0:1] + x.split('.')[2:]) for x in df.index]])
df = df.unstack()
df.columns = df.columns.get_level_values(1)
print(df)

Input:

  building.0.description.bedrooms building.1.description.bedrooms building.2.description.bedrooms  id_prop.0  id_prop.1  id_prop.2  prop_number.0  prop_number.1  prop_number.2  prop_value.0  prop_value.1  prop_value.2
0                             qrs                             tuv                             xyz          1          2          3            123            325            754             1             1             1

Output:

  building.description.bedrooms id_prop prop_number prop_value
0                           qrs       1         123          1
1                           tuv       2         325          1
2                           xyz       3         754          1
叹倦 2025-02-08 22:30:56

您可以将列将其拆分为多索引,然后将其堆叠:

df.set_axis(
    pd.MultiIndex.from_tuples([tuple(i) for i in df.columns.str.split(".")]), axis=1
).stack().droplevel(0)

You can split the columns to a multiindex and then stack it:

df.set_axis(
    pd.MultiIndex.from_tuples([tuple(i) for i in df.columns.str.split(".")]), axis=1
).stack().droplevel(0)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文