从现有数据帧中列的子字符串创建新的 Pyspark 数据帧
我有一个如下所示的 Pyspark 数据框,需要创建一个新的数据框,其中只有一列由原始数据框中的所有 7 位数字组成。这些值都是字符串。应忽略 Column1。忽略 Column2 中的非数字和单个 7 位数字相当简单,但对于具有两个单独的 7 位数字的值,我很难将它们单独拉出来。这需要自动化并且能够在其他类似的数据帧上运行。这些数字始终为 7 位数字,并且始终以“1”开头。有什么建议吗?
+-----------+--------------------+
| COLUMN1| COLUMN2|
+-----------+--------------------+
| Value1| Something|
| Value2| 1057873 1057887|
| Value3| Something Something|
| Value4| null|
| Value5| 1312039|
| Value6| 1463451 1463485|
| Value7| Not In Database|
| Value8| 1617275 1617288|
+-----------+--------------------+
生成的数据帧应如下所示:
+-------+
|Column1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+
- 更新:
响应很好,但不幸的是我使用的是不同意的旧版本 Spark。我使用下面的方法来解决这个问题,虽然它有点笨拙......它有效。
from pyspark.sql import functions as F
new_df = df.select(df.COLUMN2)
new_df = new_df.withColumn('splits', F.split(new_df.COLUMN2, ' '))
new_df = new_df.select(F.explode(new_df.splits).alias('column1'))
new_df = new_df.filter(new_df.column1.rlike('\d{7}'))
I have a Pyspark dataframe as below and need to create a new dataframe with only one column made up of all the 7 digit numbers in the original dataframe. The values are all strings. Column1 should be ignored. Ignoring non-numbers and single 7 digit numbers in Column2 is fairly straightforward, but for the values that have two separate 7 digit numbers, I'm having difficulty pulling them out individually. This needs to be automated and able to run on other similar dataframes. The numbers are always 7 digits and always begin with a '1.' Any tips?
+-----------+--------------------+
| COLUMN1| COLUMN2|
+-----------+--------------------+
| Value1| Something|
| Value2| 1057873 1057887|
| Value3| Something Something|
| Value4| null|
| Value5| 1312039|
| Value6| 1463451 1463485|
| Value7| Not In Database|
| Value8| 1617275 1617288|
+-----------+--------------------+
The resulting dataframe should be as below:
+-------+
|Column1|
+-------+
|1057873|
|1057887|
|1312039|
|1463451|
|1463485|
|1617275|
|1617288|
+-------+
- UPDATE:
The responses are great, but unfortunately I'm using a older version of Spark that doesn't agree. I used the below to solve the problem, though it's a bit clunky...it works.
from pyspark.sql import functions as F
new_df = df.select(df.COLUMN2)
new_df = new_df.withColumn('splits', F.split(new_df.COLUMN2, ' '))
new_df = new_df.select(F.explode(new_df.splits).alias('column1'))
new_df = new_df.filter(new_df.column1.rlike('\d{7}'))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是一种适用于 Spark 2.4+ 的高阶 lambda 函数的方法,其中我们按空格分割列,然后过滤以 0-9 开头且长度为 n (7) 的单词,然后爆炸:
Here is an approach with higher order lambda functions for spark 2.4+ wherein we split the column by space and then filter the words which starts with 0-9 and are length n (7), then explode:
我喜欢@nky 并投票支持它。替代方案也可以使用 pysparks 存在于 3.0+ 中的高阶函数
I like @nky and voted for it. An alternative Can also use pysparks exists in a higher order function in 3.0+
IIUC,您可以使用正则表达式和 str.extractall:
输出:
正则表达式:
IIUC, you could use a regex and
str.extractall
:output:
regex: