使用应用程序将数据框列分解为多列

发布于 2025-01-29 22:46:17 字数 1580 浏览 3 评论 0原文

我试图根据解析原始列内容的函数将数据框列分解为几个变化的列。它们包含一些我的功能可以变成具有不同列名称的数据帧。所有列需要添加到现有数据框的末尾。这些列不应以名称重复。以下是我要做的事情的简化版本。它错误。

编辑:一个澄清的一点,请忽略我使用dict形成sub_transaction的事实。现实中的sub_transaction列中有一个冗长的XML,它通过parse_subtransaction将其转换为数据框。较简单的命令只是目的。重要的一点是必须使用功能来解析它,并且该功能返回数据框架。

原始数据框架

transaction_id               sub_transaction
          abc1  {'id': 'abc1x', 'total': 10}
          abc2  {'id': 'abc2x', 'total': 20}
          abc3  {'id': 'abc3x', 'total': 30}
          abc4                            {}
          abc5               {'id': 'abc5x'}

所需的数据帧结果

transaction_id  sub_transaction_id  total
abc1                         abc1x     10
abc2                         abc2x     20
abc3                         abc3x     30
import pandas as pd

def parse_subtransaction(sub_transaction):
    return pd.DataFrame({
        'sub_transaction_id': [sub_transaction.get('id')],
        'total': [sub_transaction.get('total')]})

def main():
    df = pd.DataFrame({
        'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
        'sub_transaction': [
            {'id': 'abc1x', 'total': 10},
            {'id': 'abc2x', 'total': 20},
            {'id': 'abc3x', 'total': 30},
            {},
            {'id':'abc5x'}]
        })

    applied_df = df.apply(
        lambda row: parse_subtransaction(row['sub_transaction']),
        axis='columns',
        result_type='expand')

# ERROR: ValueError: If using all scalar values, you must pass an index

if (__name__ == "__main__"):
    main()

I am attempting to break out a DataFrame column into several varying ones based upon a function that parses the original column contents. They contain something that my function can turn into a dataframe with varying column names. All columns need to be added to the end of the existing dataframe. The columns should not duplicate in name. The below is a simplified version of what I'm trying to do. It errors out.

EDIT: One point of clarification, please disregard the fact that I have used a dict to form sub_transaction. The sub_transaction column in actuality has a lengthy XML in it that is turned into a DataFrame by parse_subtransaction. The simpler dict was just for example purposes. The important point is that a function must be used to parse it and that function returns a DataFrame.

original dataframe

transaction_id               sub_transaction
          abc1  {'id': 'abc1x', 'total': 10}
          abc2  {'id': 'abc2x', 'total': 20}
          abc3  {'id': 'abc3x', 'total': 30}
          abc4                            {}
          abc5               {'id': 'abc5x'}

desired dataframe outcome

transaction_id  sub_transaction_id  total
abc1                         abc1x     10
abc2                         abc2x     20
abc3                         abc3x     30
import pandas as pd

def parse_subtransaction(sub_transaction):
    return pd.DataFrame({
        'sub_transaction_id': [sub_transaction.get('id')],
        'total': [sub_transaction.get('total')]})

def main():
    df = pd.DataFrame({
        'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
        'sub_transaction': [
            {'id': 'abc1x', 'total': 10},
            {'id': 'abc2x', 'total': 20},
            {'id': 'abc3x', 'total': 30},
            {},
            {'id':'abc5x'}]
        })

    applied_df = df.apply(
        lambda row: parse_subtransaction(row['sub_transaction']),
        axis='columns',
        result_type='expand')

# ERROR: ValueError: If using all scalar values, you must pass an index

if (__name__ == "__main__"):
    main()

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

摘星┃星的人 2025-02-05 22:46:17

您可以使用以下方式完成同样的事情:

df.join(pd.DataFrame(df.sub_transaction.tolist()))
 
  transaction_id               sub_transaction     id  total
0           abc1  {'id': 'abc1x', 'total': 10}  abc1x   10.0
1           abc2  {'id': 'abc2x', 'total': 20}  abc2x   20.0
2           abc3  {'id': 'abc3x', 'total': 30}  abc3x   30.0
3           abc4                            {}    NaN    NaN
4           abc5               {'id': 'abc5x'}  abc5x    NaN

You could accomplish the same using:

df.join(pd.DataFrame(df.sub_transaction.tolist()))
 
  transaction_id               sub_transaction     id  total
0           abc1  {'id': 'abc1x', 'total': 10}  abc1x   10.0
1           abc2  {'id': 'abc2x', 'total': 20}  abc2x   20.0
2           abc3  {'id': 'abc3x', 'total': 30}  abc3x   30.0
3           abc4                            {}    NaN    NaN
4           abc5               {'id': 'abc5x'}  abc5x    NaN
祁梦 2025-02-05 22:46:17

一个选项是使用Pandas String GET

df.assign(sub_transaction_id = df.sub_transaction.str.get('id'), 
          total = df.sub_transaction.str.get('total'))

  transaction_id               sub_transaction sub_transaction_id  total
0           abc1  {'id': 'abc1x', 'total': 10}              abc1x   10.0
1           abc2  {'id': 'abc2x', 'total': 20}              abc2x   20.0
2           abc3  {'id': 'abc3x', 'total': 30}              abc3x   30.0
3           abc4                            {}               None    NaN
4           abc5               {'id': 'abc5x'}              abc5x    NaN

应用程序返回您的函数,我怀疑这不是您想要的,您可能想要一个包含您提取的单个数据框架。

One option is with pandas string get:

df.assign(sub_transaction_id = df.sub_transaction.str.get('id'), 
          total = df.sub_transaction.str.get('total'))

  transaction_id               sub_transaction sub_transaction_id  total
0           abc1  {'id': 'abc1x', 'total': 10}              abc1x   10.0
1           abc2  {'id': 'abc2x', 'total': 20}              abc2x   20.0
2           abc3  {'id': 'abc3x', 'total': 30}              abc3x   30.0
3           abc4                            {}               None    NaN
4           abc5               {'id': 'abc5x'}              abc5x    NaN

The apply returns your function per row, which I suspect is not what you want, you probably want a single DataFrame containing your extracts.

毁虫ゝ 2025-02-05 22:46:17

使用自己的风格:

import pandas as pd

def parse_subtransaction(sub_transaction):
    return ({'sub_transaction_id': sub_transaction.get('id'), 'total': sub_transaction.get('total')})
   

def main():
    df = pd.DataFrame({'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
                       'sub_transaction': [{'id': 'abc1x', 'total': 10}, {'id': 'abc2x', 'total': 20},
                                           {'id': 'abc3x', 'total': 30},{},{'id':'abc5x'}]})

    applied_df = df.apply(lambda row: parse_subtransaction(row['sub_transaction']), axis='columns', result_type='expand')
    final_df = pd.concat([df.iloc[: , :-1], applied_df], axis=1)
    print(final_df)
    
main()

To use your own style:

import pandas as pd

def parse_subtransaction(sub_transaction):
    return ({'sub_transaction_id': sub_transaction.get('id'), 'total': sub_transaction.get('total')})
   

def main():
    df = pd.DataFrame({'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
                       'sub_transaction': [{'id': 'abc1x', 'total': 10}, {'id': 'abc2x', 'total': 20},
                                           {'id': 'abc3x', 'total': 30},{},{'id':'abc5x'}]})

    applied_df = df.apply(lambda row: parse_subtransaction(row['sub_transaction']), axis='columns', result_type='expand')
    final_df = pd.concat([df.iloc[: , :-1], applied_df], axis=1)
    print(final_df)
    
main()
橙幽之幻 2025-02-05 22:46:17

parse_subtransaction应返回dict系列,而不是dataframe。*

def parse_subtransaction(sub_transaction):
    return {
        'sub_transaction_id': sub_transaction.get('id'),
        'total': sub_transaction.get('total')}

然后重新加入,我们可以使用变体 joris /a>:

pd.concat([df.drop(columns='sub_transaction'), applied_df], axis=1)

*尽管我不确定为什么确切。我看着 docs ” https://github.com/pandas-dev/pandas/blob/4bfe3d07b48581444c219b9b946346346346329024102ab6/pandas/_typandas/_typing.pypy.pypy.pypypy.pyp.pypyp.pyl151 func 参数的返回类型精确。

parse_subtransaction should return a dict or Series, not a DataFrame.*

def parse_subtransaction(sub_transaction):
    return {
        'sub_transaction_id': sub_transaction.get('id'),
        'total': sub_transaction.get('total')}

Then to rejoin, we can use a variation of joris's solution:

pd.concat([df.drop(columns='sub_transaction'), applied_df], axis=1)

* Although I'm not sure why exactly. I looked at the docs and type annotation, but couldn't find anything that specified the func parameter's return type precisely.

随心而道 2025-02-05 22:46:17

我得到了满足DICS方案的答案,但是我真的只需要人们假设应用程序中的功能始终返回数据范围作为起点。实际上,我正在解析XML。这是最终起作用的解决方案:

def parse_xml(xml):
    xml_dict = xmltodict.parse(xml)
    df = pd.json_normalize(xml_dict)
    df.columns = df.columns.str.replace("ns0", "", regex=False)
    df.columns = df.columns.str.replace("@xmlns", "", regex=False)
    df.columns = df.columns.str.replace(":", "", regex=False)
    df.columns = df.columns.str.replace(".", "_", regex=False)
    df.columns = df.columns.str.rstrip('_')
    return df

def parse_xmls(df, col='h_xml'):
    print("Parsing XML's")
    right_df_list = []
    for index, row in df.iterrows():
        xml_df = parse_xml(row['h_xml'])
        xml_dict = xml_df.to_dict()
        right_df_list.append(xml_dict)

    right_df = pd.DataFrame.from_dict(right_df_list, orient='columns')
    right_df = right_df.applymap(lambda col : col[0] if type(col) is dict else col)
    df = pd.merge(df, right_df, left_index=True, right_index=True)
    return df

I got answers that catered to the dict scenario, but I really just needed people to assume that the function in the apply always returned a DataFrame as the starting point. In reality, I'm parsing an XML. Here is the solution that ultimately worked:

def parse_xml(xml):
    xml_dict = xmltodict.parse(xml)
    df = pd.json_normalize(xml_dict)
    df.columns = df.columns.str.replace("ns0", "", regex=False)
    df.columns = df.columns.str.replace("@xmlns", "", regex=False)
    df.columns = df.columns.str.replace(":", "", regex=False)
    df.columns = df.columns.str.replace(".", "_", regex=False)
    df.columns = df.columns.str.rstrip('_')
    return df

def parse_xmls(df, col='h_xml'):
    print("Parsing XML's")
    right_df_list = []
    for index, row in df.iterrows():
        xml_df = parse_xml(row['h_xml'])
        xml_dict = xml_df.to_dict()
        right_df_list.append(xml_dict)

    right_df = pd.DataFrame.from_dict(right_df_list, orient='columns')
    right_df = right_df.applymap(lambda col : col[0] if type(col) is dict else col)
    df = pd.merge(df, right_df, left_index=True, right_index=True)
    return df
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文