by ID组并根据Pyspark中的优先级创建列

发布于 2025-01-21 08:33:51 字数 1697 浏览 3 评论 0原文

有人可以帮我下面吗？我有一个输入数据框架。

ID	Process_Type	stp_stagewise
1	Loan_Creation	手册
1	贷款	NSTP
1	赔偿	STP
2	Loan_Creation	STP
2	报销	NSTP
3	Loan_Creation	4
3	Loan_Creation screat_creation	4
Loan_Creation	Loan_creation	Manuual
4	Loan_Creation	NSTP

输出

1	ID_CREATIION	in	nsstp
1	loat_creation	STP	Man
1	stp	STP	手册
报销	dotion	STP	STP
2	LOAN_CREATION	STP	STP
2	报销	NSTP	NSTP
3	Loan_Creation	手册	MANUL
3	LOAN_CREATION	STP	NSTP
4	LOAN_CREATION	NSTP NSTP	NSTP NSTP NSTP NSTP
NSTP	NSTP NSTP NSTP NSTP	我	需要

分组ID和Process_type列和PRISTIS_TYPE列，并优先列表，MARAUL＆GT;＆GT;＆GT;＆gt;＆gt;＆gt;＆gt;＆gt; nstp＆gt;＆gt; STP并创建另一列。

有人可以提供解决这个问题的方法。提前致谢。

略有更改与ID一起，也应在过程类型上完成组。

原文

Can someone help me with the below.
I have an input dataframe.

ID	process_type	STP_stagewise
1	loan_creation	Manual
1	loan creation	NSTP
1	reimbursement	STP
2	loan_creation	STP
2	reimbursement	NSTP
3	loan_creation	Manual
3	loan_creation	STP
4	loan_creation	Manual
4	loan_creation	NSTP

Output dataframe required:

ID	process_type	STP_stagewise	STP_type
1	loan_creation	Manual	Manual
1	loan creation	NSTP	Manual
1	reimbursement	STP	STP
2	loan_creation	STP	STP
2	reimbursement	NSTP	NSTP
3	loan_creation	Manual	Manual
3	loan_creation	STP	Manual
4	loan_creation	NSTP	NSTP
4	loan_creation	NSTP	NSTP

I need to groupby id and process_type column and prioritize, Manual >> NSTP >> STP and create a different column.

Can someone provide an approach to solve this. Thanks in Advance.

Slight change along with ID, group by should be done on process type also.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

渔村楼浪 2025-01-28 08:33:51

您可以解决此问题的一种方法是在ID中汇总并将所有独特的stp_stagewise收集到列表中，然后用custom_sort_map对其进行排序以获取第一个索引元素，最后将其加入您的主要数据帧

数据准备

s = StringIO("""
ID  STP_stagewise
1   Manual
1   NSTP
1   STP
2   STP
2   NSTP
3   Manual
3   STP
4   Manual
4   NSTP
""")

df = pd.read_csv(s,delimiter='\t')

sparkDF = sql.createDataFrame(df)

sparkDF.show()

+---+-------------+
| ID|STP_stagewise|
+---+-------------+
|  1|       Manual|
|  1|         NSTP|
|  1|          STP|
|  2|          STP|
|  2|         NSTP|
|  3|       Manual|
|  3|          STP|
|  4|       Manual|
|  4|         NSTP|
+---+-------------+

聚合 - 收集设置＆amp;排序

custom_sort_map = {'Manual':0,'NSTP':1,'STP':2}

udf_custom_sort = F.udf(lambda x: sorted(x,key=lambda x:custom_sort_map[x]), ArrayType(StringType()))

stpAgg = sparkDF.groupBy(F.col('ID')).agg(F.collect_set(F.col('STP_stagewise')).alias('STP_stagewise_set'))\
                .withColumn('sorted_STP_stagewise_set',udf_custom_sort('STP_stagewise_set'))\
                .withColumn('STP_type',F.col('sorted_STP_stagewise_set').getItem(0))

stpAgg.show()

+---+-------------------+------------------------+--------+
| ID|  STP_stagewise_set|sorted_STP_stagewise_set|STP_type|
+---+-------------------+------------------------+--------+
|  1|[STP, NSTP, Manual]|     [Manual, NSTP, STP]|  Manual|
|  3|      [STP, Manual]|           [Manual, STP]|  Manual|
|  2|        [STP, NSTP]|             [NSTP, STP]|    NSTP|
|  4|     [NSTP, Manual]|          [Manual, NSTP]|  Manual|
+---+-------------------+------------------------+--------+

加入

sparkDF = sparkDF.join(stpAgg
                       ,sparkDF['ID'] == stpAgg['ID']
                       ,'inner'
                      ).select(sparkDF['*'],stpAgg['STP_type'])

sparkDF.show()

+---+-------------+--------+
| ID|STP_stagewise|STP_type|
+---+-------------+--------+
|  1|       Manual|  Manual|
|  1|         NSTP|  Manual|
|  1|          STP|  Manual|
|  3|       Manual|  Manual|
|  3|          STP|  Manual|
|  2|          STP|    NSTP|
|  2|         NSTP|    NSTP|
|  4|       Manual|  Manual|
|  4|         NSTP|  Manual|
+---+-------------+--------+

One way you can solve this is by aggregating at id and collecting all the distinct STP_stagewise into a list and sorting it with a custom_sort_map to get the first index element and finally joining it back to your main DataFrame

Data Preparation

s = StringIO("""
ID  STP_stagewise
1   Manual
1   NSTP
1   STP
2   STP
2   NSTP
3   Manual
3   STP
4   Manual
4   NSTP
""")

df = pd.read_csv(s,delimiter='\t')

sparkDF = sql.createDataFrame(df)

sparkDF.show()

+---+-------------+
| ID|STP_stagewise|
+---+-------------+
|  1|       Manual|
|  1|         NSTP|
|  1|          STP|
|  2|          STP|
|  2|         NSTP|
|  3|       Manual|
|  3|          STP|
|  4|       Manual|
|  4|         NSTP|
+---+-------------+

Aggregation - Collect Set & Sort

custom_sort_map = {'Manual':0,'NSTP':1,'STP':2}

udf_custom_sort = F.udf(lambda x: sorted(x,key=lambda x:custom_sort_map[x]), ArrayType(StringType()))

stpAgg = sparkDF.groupBy(F.col('ID')).agg(F.collect_set(F.col('STP_stagewise')).alias('STP_stagewise_set'))\
                .withColumn('sorted_STP_stagewise_set',udf_custom_sort('STP_stagewise_set'))\
                .withColumn('STP_type',F.col('sorted_STP_stagewise_set').getItem(0))

stpAgg.show()

+---+-------------------+------------------------+--------+
| ID|  STP_stagewise_set|sorted_STP_stagewise_set|STP_type|
+---+-------------------+------------------------+--------+
|  1|[STP, NSTP, Manual]|     [Manual, NSTP, STP]|  Manual|
|  3|      [STP, Manual]|           [Manual, STP]|  Manual|
|  2|        [STP, NSTP]|             [NSTP, STP]|    NSTP|
|  4|     [NSTP, Manual]|          [Manual, NSTP]|  Manual|
+---+-------------------+------------------------+--------+

Join

sparkDF = sparkDF.join(stpAgg
                       ,sparkDF['ID'] == stpAgg['ID']
                       ,'inner'
                      ).select(sparkDF['*'],stpAgg['STP_type'])

sparkDF.show()

+---+-------------+--------+
| ID|STP_stagewise|STP_type|
+---+-------------+--------+
|  1|       Manual|  Manual|
|  1|         NSTP|  Manual|
|  1|          STP|  Manual|
|  3|       Manual|  Manual|
|  3|          STP|  Manual|
|  2|          STP|    NSTP|
|  2|         NSTP|    NSTP|
|  4|       Manual|  Manual|
|  4|         NSTP|  Manual|
+---+-------------+--------+

回复收藏 0 原文

~没有更多了~