在多种条件下滤除火花数据帧行的更好/有效的方法

发布于 2025-01-26 18:36:37 字数 1857 浏览 0 评论 0原文

我在下面有一个数据框，

  id          pub_date   version         unique_id     c_id    p_id    type      source
lni001        20220301      1           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      1           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      1           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      2           64WP-UI-CFGT    012     K21   location  internet
lni001        20220301      2           64WP-UI-CFGT    012     K21   location  internet
lni001        20220301      3           64WP-UI-CFGT    012     K21   location  internet
lni001        20220301      3           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni001        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni001        20220301      5           64WP-UI-CFGT    012     K21   location  internet
lni002        20220301      1           64WP-UI-CFGT    012     K21   location  internet
 ::
 ::

我想组合ID列，并且仅保留最高版本列的数字，但这里是一个捕获量，我还需要考虑类型列（它们只有两种类型，org或location ）。最终的数据帧将在我当前的方法下方看起来像是这样的，

  id          pub_date   version         unique_id     c_id    p_id    type      source
lni001        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni001        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni001        20220301      5           64WP-UI-CFGT    012     K21   location  internet
lni002        20220301      14          64WP-UI-CFGT    012     K21   location  internet
 ::
 ::

将数据框分为两个不同的框架，第一个是org下的org，另一个是类型列下的位置。然后我使用的是groupby，with column但是我的数据框很大。我想知道是否有更有效的方法可以在一行代码中做到这一点？而不是需要将它们分成两个数据范围，然后将它们合并回？

谢谢！

原文

I have a dataframe look like this below

  id          pub_date   version         unique_id     c_id    p_id    type      source
lni001        20220301      1           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      1           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      1           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      2           64WP-UI-CFGT    012     K21   location  internet
lni001        20220301      2           64WP-UI-CFGT    012     K21   location  internet
lni001        20220301      3           64WP-UI-CFGT    012     K21   location  internet
lni001        20220301      3           64WP-UI-POLI    002     P02    org      internet
lni001        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni001        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni001        20220301      5           64WP-UI-CFGT    012     K21   location  internet
lni002        20220301      1           64WP-UI-CFGT    012     K21   location  internet
 ::
 ::

I want to groupby id column and only keep the highest number from version column but here is a catch, I also need to take into consideration for the type column (which only have two types, org or location).
The final dataframe will look like this below

  id          pub_date   version         unique_id     c_id    p_id    type      source
lni001        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni001        20220301      85          64WP-UI-POLI    002     P02    org      internet
lni001        20220301      5           64WP-UI-CFGT    012     K21   location  internet
lni002        20220301      14          64WP-UI-CFGT    012     K21   location  internet
 ::
 ::

My current approach is separate the dataframe into two different ones, first one is org under type column, the other one is location under type column.Then I am using groupby, withColumn but my dataframe is huge. And I am wondering are there more efficient ways to do this maybe in one line of code? Rather than need to separate them into two dataframe then merger them back together?

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

白衬杉格子梦 2025-02-02 18:36:37

dense_rank（）可用于根据ID＆amp;类型。这可以用于仅保留每个组中的最高记录。

input.withColumn("rank", dense_rank() over (Window.partitionBy($"id",$"type").orderBy($"version".desc)))
  .filter($"rank" === 1)
  .drop($"rank")

输出：

+------+--------+-------+------------+---+----+--------+--------+
|id    |pub_date|version|unique_id   |_id|p_id|type    |source  |
+------+--------+-------+------------+---+----+--------+--------+
|lni001|20220301|5      |64WP-UI-CFGT|012|K21 |location|internet|
|lni001|20220301|85     |64WP-UI-POLI|002|P02 |org     |internet|
|lni001|20220301|85     |64WP-UI-POLI|002|P02 |org     |internet|
|lni002|20220301|1      |64WP-UI-CFGT|012|K21 |location|internet|
+------+--------+-------+------------+---+----+--------+--------+

dense_rank() can be used to find out top versions based on id & type. This can be used to retain only the top record in each group.

input.withColumn("rank", dense_rank() over (Window.partitionBy(
quot;id",
quot;type").orderBy(quot;version".desc)))
  .filter(quot;rank" === 1)
  .drop(quot;rank")

Output:

+------+--------+-------+------------+---+----+--------+--------+
|id    |pub_date|version|unique_id   |_id|p_id|type    |source  |
+------+--------+-------+------------+---+----+--------+--------+
|lni001|20220301|5      |64WP-UI-CFGT|012|K21 |location|internet|
|lni001|20220301|85     |64WP-UI-POLI|002|P02 |org     |internet|
|lni001|20220301|85     |64WP-UI-POLI|002|P02 |org     |internet|
|lni002|20220301|1      |64WP-UI-CFGT|012|K21 |location|internet|
+------+--------+-------+------------+---+----+--------+--------+

回复收藏 0 原文

~没有更多了~