使用应用程序将数据框列分解为多列
我试图根据解析原始列内容的函数将数据框列分解为几个变化的列。它们包含一些我的功能可以变成具有不同列名称的数据帧。所有列需要添加到现有数据框的末尾。这些列不应以名称重复。以下是我要做的事情的简化版本。它错误。
编辑:一个澄清的一点,请忽略我使用dict形成sub_transaction的事实。现实中的sub_transaction列中有一个冗长的XML,它通过parse_subtransaction将其转换为数据框。较简单的命令只是目的。重要的一点是必须使用功能来解析它,并且该功能返回数据框架。
原始数据框架
transaction_id sub_transaction
abc1 {'id': 'abc1x', 'total': 10}
abc2 {'id': 'abc2x', 'total': 20}
abc3 {'id': 'abc3x', 'total': 30}
abc4 {}
abc5 {'id': 'abc5x'}
所需的数据帧结果
transaction_id sub_transaction_id total
abc1 abc1x 10
abc2 abc2x 20
abc3 abc3x 30
import pandas as pd
def parse_subtransaction(sub_transaction):
return pd.DataFrame({
'sub_transaction_id': [sub_transaction.get('id')],
'total': [sub_transaction.get('total')]})
def main():
df = pd.DataFrame({
'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
'sub_transaction': [
{'id': 'abc1x', 'total': 10},
{'id': 'abc2x', 'total': 20},
{'id': 'abc3x', 'total': 30},
{},
{'id':'abc5x'}]
})
applied_df = df.apply(
lambda row: parse_subtransaction(row['sub_transaction']),
axis='columns',
result_type='expand')
# ERROR: ValueError: If using all scalar values, you must pass an index
if (__name__ == "__main__"):
main()
I am attempting to break out a DataFrame column into several varying ones based upon a function that parses the original column contents. They contain something that my function can turn into a dataframe with varying column names. All columns need to be added to the end of the existing dataframe. The columns should not duplicate in name. The below is a simplified version of what I'm trying to do. It errors out.
EDIT: One point of clarification, please disregard the fact that I have used a dict to form sub_transaction. The sub_transaction column in actuality has a lengthy XML in it that is turned into a DataFrame by parse_subtransaction. The simpler dict was just for example purposes. The important point is that a function must be used to parse it and that function returns a DataFrame.
original dataframe
transaction_id sub_transaction
abc1 {'id': 'abc1x', 'total': 10}
abc2 {'id': 'abc2x', 'total': 20}
abc3 {'id': 'abc3x', 'total': 30}
abc4 {}
abc5 {'id': 'abc5x'}
desired dataframe outcome
transaction_id sub_transaction_id total
abc1 abc1x 10
abc2 abc2x 20
abc3 abc3x 30
import pandas as pd
def parse_subtransaction(sub_transaction):
return pd.DataFrame({
'sub_transaction_id': [sub_transaction.get('id')],
'total': [sub_transaction.get('total')]})
def main():
df = pd.DataFrame({
'transaction_id': ['abc1', 'abc2', 'abc3','abc4','abc5'],
'sub_transaction': [
{'id': 'abc1x', 'total': 10},
{'id': 'abc2x', 'total': 20},
{'id': 'abc3x', 'total': 30},
{},
{'id':'abc5x'}]
})
applied_df = df.apply(
lambda row: parse_subtransaction(row['sub_transaction']),
axis='columns',
result_type='expand')
# ERROR: ValueError: If using all scalar values, you must pass an index
if (__name__ == "__main__"):
main()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以使用以下方式完成同样的事情:
You could accomplish the same using:
一个选项是使用Pandas String
GET
:应用程序返回您的函数,我怀疑这不是您想要的,您可能想要一个包含您提取的单个数据框架。
One option is with pandas string
get
:The apply returns your function per row, which I suspect is not what you want, you probably want a single DataFrame containing your extracts.
使用自己的风格:
To use your own style:
parse_subtransaction
应返回dict
或系列
,而不是dataframe
。*然后重新加入,我们可以使用变体 joris /a>:
*尽管我不确定为什么确切。我看着 docs ” https://github.com/pandas-dev/pandas/blob/4bfe3d07b48581444c219b9b946346346346329024102ab6/pandas/_typandas/_typing.pypy.pypy.pypypy.pyp.pypyp.pyl151 func 参数的返回类型精确。
parse_subtransaction
should return adict
orSeries
, not aDataFrame
.*Then to rejoin, we can use a variation of joris's solution:
* Although I'm not sure why exactly. I looked at the docs and type annotation, but couldn't find anything that specified the func parameter's return type precisely.
我得到了满足DICS方案的答案,但是我真的只需要人们假设应用程序中的功能始终返回数据范围作为起点。实际上,我正在解析XML。这是最终起作用的解决方案:
I got answers that catered to the dict scenario, but I really just needed people to assume that the function in the apply always returned a DataFrame as the starting point. In reality, I'm parsing an XML. Here is the solution that ultimately worked: