Pyspark如何从两个数据框中识别不匹配的行值

发布于 2025-01-12 10:26:56 字数 499 浏览 3 评论 0原文

我有以下两个数据帧，我试图从中识别数据帧二中不匹配的行值。这是迁移的一部分，我希望看到源数据迁移/移动到不同目的地后的差异。

source_df
+---+-----+-----+
|key|val11|val12|
+---+-----+-----+
|abc|  1.1|  1.2|
|def|  3.0|  3.4|
+---+-----+-----+

dest_df
+---+-----+-----+
|key|val11|val12|
+---+-----+-----+
|abc|  2.1|  2.2|
|def|  3.0|  3.4|
+---+-----+-----+

我想看到类似下面的输出

key: abc,

col:          val11                  val12

difference:  [src-1.1,dst:2.1]       [src:1.2,dst:2.2]

有解决方案吗？

原文

I have below two data frame from which i am trying to identify the unmatched row value from data frame two. This is the part of migration where i want to see the difference after source data being migrated/moved to different destination.

source_df
+---+-----+-----+
|key|val11|val12|
+---+-----+-----+
|abc|  1.1|  1.2|
|def|  3.0|  3.4|
+---+-----+-----+

dest_df
+---+-----+-----+
|key|val11|val12|
+---+-----+-----+
|abc|  2.1|  2.2|
|def|  3.0|  3.4|
+---+-----+-----+

i want to see the output something like below

key: abc,

col:          val11                  val12

difference:  [src-1.1,dst:2.1]       [src:1.2,dst:2.2]

Any solution for this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

灵芸 2025-01-19 10:26:56

source_df  = spark.createDataFrame(
  [
('abc','1.1','1.2'),
('def','3.0','3.4'),
  ], ['key','val11','val12']
)

dest_df  = spark.createDataFrame(
  [
('abc','2.1','2.2'),
('def','3.0','3.4'),
  ], ['key','val11','val12']
)

report = source_df\
    .join(dest_df, 'key', 'full')\
    .filter((source_df.val11 != dest_df.val11) | (source_df.val12 != dest_df.val12))\
    .withColumn('difference_val11', F.concat(F.lit('[src:'), source_df.val11, F.lit(',dst:'),dest_df.val11,F.lit(']')))\
    .withColumn('difference_val12', F.concat(F.lit('[src:'), source_df.val12, F.lit(',dst:'),dest_df.val12,F.lit(']')))\
    .select('key', 'difference_val11', 'difference_val12')

report.show()

+---+-----------------+-----------------+
|key| difference_val11| difference_val12|
+---+-----------------+-----------------+
|abc|[src:1.1,dst:2.1]|[src:1.1,dst:2.1]|
+---+-----------------+-----------------+

或者，如果您想要完全采用该格式：

for x in report.select('key', 'difference_val11', 'difference_val12').collect():
    print("key: " + str(x[0]) + ",\n\n" +\
          "col:          val11                 val12\n\n" +\
         "difference:   " + str(x[1]) + "     " + str(x[2]))

输出：

key: abc,

col:          val11                 val12

difference:   [src:1.1,dst:2.1]     [src:1.2,dst:2.2]

source_df  = spark.createDataFrame(
  [
('abc','1.1','1.2'),
('def','3.0','3.4'),
  ], ['key','val11','val12']
)

dest_df  = spark.createDataFrame(
  [
('abc','2.1','2.2'),
('def','3.0','3.4'),
  ], ['key','val11','val12']
)

report = source_df\
    .join(dest_df, 'key', 'full')\
    .filter((source_df.val11 != dest_df.val11) | (source_df.val12 != dest_df.val12))\
    .withColumn('difference_val11', F.concat(F.lit('[src:'), source_df.val11, F.lit(',dst:'),dest_df.val11,F.lit(']')))\
    .withColumn('difference_val12', F.concat(F.lit('[src:'), source_df.val12, F.lit(',dst:'),dest_df.val12,F.lit(']')))\
    .select('key', 'difference_val11', 'difference_val12')

report.show()

+---+-----------------+-----------------+
|key| difference_val11| difference_val12|
+---+-----------------+-----------------+
|abc|[src:1.1,dst:2.1]|[src:1.1,dst:2.1]|
+---+-----------------+-----------------+

Or, if you want exactally in that format:

for x in report.select('key', 'difference_val11', 'difference_val12').collect():
    print("key: " + str(x[0]) + ",\n\n" +\
          "col:          val11                 val12\n\n" +\
         "difference:   " + str(x[1]) + "     " + str(x[2]))

Output:

key: abc,

col:          val11                 val12

difference:   [src:1.1,dst:2.1]     [src:1.2,dst:2.2]

回复收藏 0 原文

~没有更多了~

关于作者

没有心的人

暂无简介

文章

27 人气

关注发私信

陪我终i

文章 0 评论 0

关注

别忘他

文章 0 评论 0

关注

野心澎湃

文章 0 评论 0

关注

蒲公英的约定

文章 0 评论 0

关注

。

文章 0 评论 0

关注

旧时模样

文章 0 评论 0

友情链接

文江博客

Pyspark如何从两个数据框中识别不匹配的行值

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

陪我终i

别忘他

野心澎湃

蒲公英的约定

。

旧时模样

友情链接

Pyspark如何从两个数据框中识别不匹配的行值

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

陪我终i

别忘他

野心澎湃

蒲公英的约定

。

旧时模样

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。