从现有数据帧中列的子字符串创建新的 Pyspark 数据帧

发布于 2025-01-12 20:20:17 字数 1183 浏览 3 评论 0原文

我有一个如下所示的 Pyspark 数据框，需要创建一个新的数据框，其中只有一列由原始数据框中的所有 7 位数字组成。这些值都是字符串。应忽略 Column1。忽略 Column2 中的非数字和单个 7 位数字相当简单，但对于具有两个单独的 7 位数字的值，我很难将它们单独拉出来。这需要自动化并且能够在其他类似的数据帧上运行。这些数字始终为 7 位数字，并且始终以“1”开头。有什么建议吗？

+-----------+--------------------+
|    COLUMN1|             COLUMN2|
+-----------+--------------------+
|     Value1|           Something|
|     Value2|     1057873 1057887|
|     Value3| Something Something|
|     Value4|                null|
|     Value5|             1312039|
|     Value6|     1463451 1463485|
|     Value7|     Not In Database|
|     Value8|     1617275 1617288|
+-----------+--------------------+

生成的数据帧应如下所示：

+-------+
|Column1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+

更新：

响应很好，但不幸的是我使用的是不同意的旧版本 Spark。我使用下面的方法来解决这个问题，虽然它有点笨拙......它有效。

from pyspark.sql import functions as F

new_df = df.select(df.COLUMN2)

new_df = new_df.withColumn('splits', F.split(new_df.COLUMN2, ' '))

new_df = new_df.select(F.explode(new_df.splits).alias('column1'))

new_df = new_df.filter(new_df.column1.rlike('\d{7}'))

原文

I have a Pyspark dataframe as below and need to create a new dataframe with only one column made up of all the 7 digit numbers in the original dataframe. The values are all strings. Column1 should be ignored. Ignoring non-numbers and single 7 digit numbers in Column2 is fairly straightforward, but for the values that have two separate 7 digit numbers, I'm having difficulty pulling them out individually. This needs to be automated and able to run on other similar dataframes. The numbers are always 7 digits and always begin with a '1.' Any tips?

+-----------+--------------------+
|    COLUMN1|             COLUMN2|
+-----------+--------------------+
|     Value1|           Something|
|     Value2|     1057873 1057887|
|     Value3| Something Something|
|     Value4|                null|
|     Value5|             1312039|
|     Value6|     1463451 1463485|
|     Value7|     Not In Database|
|     Value8|     1617275 1617288|
+-----------+--------------------+

The resulting dataframe should be as below:

+-------+
|Column1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+

UPDATE:

The responses are great, but unfortunately I'm using a older version of Spark that doesn't agree. I used the below to solve the problem, though it's a bit clunky...it works.

from pyspark.sql import functions as F

new_df = df.select(df.COLUMN2)

new_df = new_df.withColumn('splits', F.split(new_df.COLUMN2, ' '))

new_df = new_df.select(F.explode(new_df.splits).alias('column1'))

new_df = new_df.filter(new_df.column1.rlike('\d{7}'))

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

黑凤梨 2025-01-19 20:20:18

这是一种适用于 Spark 2.4+ 的高阶 lambda 函数的方法，其中我们按空格分割列，然后过滤以 0-9 开头且长度为 n (7) 的单词，然后爆炸：

n = 7
df.selectExpr(f"""explode(filter(split(COLUMN2,' '),x-> 
            x rlike '^[0-9]+' and length(x)={n})) as COLUMN1""").show(truncate=False)

+-------+
|COLUMN1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+

Here is an approach with higher order lambda functions for spark 2.4+ wherein we split the column by space and then filter the words which starts with 0-9 and are length n (7), then explode:

n = 7
df.selectExpr(f"""explode(filter(split(COLUMN2,' '),x-> 
            x rlike '^[0-9]+' and length(x)={n})) as COLUMN1""").show(truncate=False)

+-------+
|COLUMN1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+

回复收藏 0 原文

知足的幸福 2025-01-19 20:20:18

我喜欢@nky 并投票支持它。替代方案也可以使用 pysparks 存在于 3.0+ 中的高阶函数

new = df.selectExpr("explode(split(COLUMN2,' ')) as COLUMN1").where(F.expr("exists(array(COLUMN1), element ->  element rlike '([0-9]{7})')"))

new.show()

+-------+
|COLUMN1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+

I like @nky and voted for it. An alternative Can also use pysparks exists in a higher order function in 3.0+

new = df.selectExpr("explode(split(COLUMN2,' ')) as COLUMN1").where(F.expr("exists(array(COLUMN1), element ->  element rlike '([0-9]{7})')"))

new.show()

+-------+
|COLUMN1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+

回复收藏 0 原文

葬花如无物 2025-01-19 20:20:18

IIUC，您可以使用正则表达式和 str.extractall：

df2 = (df['COLUMN2'].str.extractall(r'(\b\d{7}\b)')[0]
      .reset_index(drop=True).to_frame(name='COLUMN1')
      )

输出：

正则表达式：

(      start capturing
\b     word boundary
\d{7}  7 digits       # or 1\d{6} for "1" + 6 digits
\b     word boundary
)      end capture

IIUC, you could use a regex and str.extractall:

df2 = (df['COLUMN2'].str.extractall(r'(\b\d{7}\b)')[0]
      .reset_index(drop=True).to_frame(name='COLUMN1')
      )

output:

regex:

(      start capturing
\b     word boundary
\d{7}  7 digits       # or 1\d{6} for "1" + 6 digits
\b     word boundary
)      end capture

回复收藏 0 原文

~没有更多了~

关于作者

沉睡月亮

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

从现有数据帧中列的子字符串创建新的 Pyspark 数据帧

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

从现有数据帧中列的子字符串创建新的 Pyspark 数据帧

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。