Pyspark-用A＆＃x27; 0＆＃x27填充一个空字符串。如果数据类型为bigint/double/整数

发布于 2025-01-24 06:54:49 字数 1471 浏览 4 评论 0原文

如果使用PySpark在数据范围内使用Bigint/Double/Integer，我正在尝试使用“ 0”列填充一个空字符串，

data = [("James","","Smith","36","M",3000,"1.2"),
    ("Michael","Rose"," ","40","M",4000,"2.0"),
    ("Robert","","Williams","42","M",4000,"5.0"),
    ("Maria","Anne"," ","39","F", ," "),
    ("Jen","Mary","Brown"," ","F",-1,"")
  ]

schema = StructType([ 
    StructField("firstname",StringType(),True), 
    StructField("middlename",StringType(),True), 
    StructField("lastname",StringType(),True), 
    StructField("age", StringType(), True), 
    StructField("gender", StringType(), True), 
    StructField("salary", IntegerType(), True),
    StructField("amount", DoubleType(), True) 
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

我正在尝试这样尝试。

df.Select（ *[f.When（f.dtype in（'integertype'，'doubletype'）和f.col（column）.ishaving（“”），'0'）。否则（f.col（column））

。

+---------+----------+--------+---+------+------+------+                        
|firstname|middlename|lastname|age|gender|salary|amount|
+---------+----------+--------+---+------+------+------+
|    James|          |   Smith| 36|     M|  3000|   1.2|
|  Michael|      Rose|        | 40|     M|  4000|   2.0|
|   Robert|          |Williams| 42|     M|  4000|   5.0|
|    Maria|      Anne|        | 39|     F|     0|     0|
|      Jen|      Mary|   Brown|   |     F|    -1|     0|
+---------+----------+--------+---+------+------+------+

原文

I am trying to fill an empty strings with a '0' if column Data type is BIGINT/DOUBLE/Integer in a dataframe using pyspark

data = [("James","","Smith","36","M",3000,"1.2"),
    ("Michael","Rose"," ","40","M",4000,"2.0"),
    ("Robert","","Williams","42","M",4000,"5.0"),
    ("Maria","Anne"," ","39","F", ," "),
    ("Jen","Mary","Brown"," ","F",-1,"")
  ]

schema = StructType([ 
    StructField("firstname",StringType(),True), 
    StructField("middlename",StringType(),True), 
    StructField("lastname",StringType(),True), 
    StructField("age", StringType(), True), 
    StructField("gender", StringType(), True), 
    StructField("salary", IntegerType(), True),
    StructField("amount", DoubleType(), True) 
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df.printSchema()

I am trying like this.

df.select( *[ F.when(F.dtype in ('integertype','doubletype') and F.col(column).ishaving(" "),'0').otherwise(F.col(column)).alias(column) for column in df.columns]).show()

Expected output:

+---------+----------+--------+---+------+------+------+                        
|firstname|middlename|lastname|age|gender|salary|amount|
+---------+----------+--------+---+------+------+------+
|    James|          |   Smith| 36|     M|  3000|   1.2|
|  Michael|      Rose|        | 40|     M|  4000|   2.0|
|   Robert|          |Williams| 42|     M|  4000|   5.0|
|    Maria|      Anne|        | 39|     F|     0|     0|
|      Jen|      Mary|   Brown|   |     F|    -1|     0|
+---------+----------+--------+---+------+------+------+

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绅士风度i 2025-01-31 06:54:49

您可以使用降低为实现这一目标，它使代码更加清洁，更容易理解

。创建一个to_fill列表以根据您的条件匹配列，可以根据您的方案进行进一步修改。

数据准备

data = [("James","","Smith","36","M",3000,1.2),
    ("Michael","Rose"," ","40","M",4000,2.0),
    ("Robert","","Williams","42","M",4000,5.0),
    ("Maria","Anne"," ","39","F",None,None),
    ("Jen","Mary","Brown"," ","F",-1,None)
  ]

schema = StructType([ 
    StructField("firstname",StringType(),True), 
    StructField("middlename",StringType(),True), 
    StructField("lastname",StringType(),True), 
    StructField("age", StringType(), True), 
    StructField("gender", StringType(), True), 
    StructField("salary", IntegerType(), True),
    StructField("amount", DoubleType(), True) 
  ])
 
sparkDF = sql.createDataFrame(data=data,schema=schema)
sparkDF.show()

+---------+----------+--------+---+------+------+------+
|firstname|middlename|lastname|age|gender|salary|amount|
+---------+----------+--------+---+------+------+------+
|    James|          |   Smith| 36|     M|  3000|   1.2|
|  Michael|      Rose|        | 40|     M|  4000|   2.0|
|   Robert|          |Williams| 42|     M|  4000|   5.0|
|    Maria|      Anne|        | 39|     F|  null|  null|
|      Jen|      Mary|   Brown|   |     F|    -1|  null|
+---------+----------+--------+---+------+------+------+

减少

to_fill =  [ c for c,d in sparkDF.dtypes if d in ['int','bigint','double']] 

# to_fill --> ['salary','amount']

sparkDF = reduce(
    lambda df, x: df.withColumn(x, F.when(F.col(x).isNull(),0).otherwise(F.col(x))),
    to_fill,
    sparkDF,
)

sparkDF.show()

+---------+----------+--------+---+------+------+------+
|firstname|middlename|lastname|age|gender|salary|amount|
+---------+----------+--------+---+------+------+------+
|    James|          |   Smith| 36|     M|  3000|   1.2|
|  Michael|      Rose|        | 40|     M|  4000|   2.0|
|   Robert|          |Williams| 42|     M|  4000|   5.0|
|    Maria|      Anne|        | 39|     F|     0|   0.0|
|      Jen|      Mary|   Brown|   |     F|    -1|   0.0|
+---------+----------+--------+---+------+------+------+

You can utilise reduce to accomplish this , it makes the code more cleaner and easier to understand

Additionally create a to_fill list to match the columns based on your condition , which can be further modified based on your scenarios.

Data Preparation

data = [("James","","Smith","36","M",3000,1.2),
    ("Michael","Rose"," ","40","M",4000,2.0),
    ("Robert","","Williams","42","M",4000,5.0),
    ("Maria","Anne"," ","39","F",None,None),
    ("Jen","Mary","Brown"," ","F",-1,None)
  ]

schema = StructType([ 
    StructField("firstname",StringType(),True), 
    StructField("middlename",StringType(),True), 
    StructField("lastname",StringType(),True), 
    StructField("age", StringType(), True), 
    StructField("gender", StringType(), True), 
    StructField("salary", IntegerType(), True),
    StructField("amount", DoubleType(), True) 
  ])
 
sparkDF = sql.createDataFrame(data=data,schema=schema)
sparkDF.show()

+---------+----------+--------+---+------+------+------+
|firstname|middlename|lastname|age|gender|salary|amount|
+---------+----------+--------+---+------+------+------+
|    James|          |   Smith| 36|     M|  3000|   1.2|
|  Michael|      Rose|        | 40|     M|  4000|   2.0|
|   Robert|          |Williams| 42|     M|  4000|   5.0|
|    Maria|      Anne|        | 39|     F|  null|  null|
|      Jen|      Mary|   Brown|   |     F|    -1|  null|
+---------+----------+--------+---+------+------+------+

Reduce

to_fill =  [ c for c,d in sparkDF.dtypes if d in ['int','bigint','double']] 

# to_fill --> ['salary','amount']

sparkDF = reduce(
    lambda df, x: df.withColumn(x, F.when(F.col(x).isNull(),0).otherwise(F.col(x))),
    to_fill,
    sparkDF,
)

sparkDF.show()

+---------+----------+--------+---+------+------+------+
|firstname|middlename|lastname|age|gender|salary|amount|
+---------+----------+--------+---+------+------+------+
|    James|          |   Smith| 36|     M|  3000|   1.2|
|  Michael|      Rose|        | 40|     M|  4000|   2.0|
|   Robert|          |Williams| 42|     M|  4000|   5.0|
|    Maria|      Anne|        | 39|     F|     0|   0.0|
|      Jen|      Mary|   Brown|   |     F|    -1|   0.0|
+---------+----------+--------+---+------+------+------+

回复收藏 0 原文

记忆消瘦 2025-01-31 06:54:49

您可以尝试以下尝试：

from pyspark.sql import *
    from pyspark.sql.functions import *
    from pyspark.sql.types import *
    
    spark = SparkSession.builder.master("local").appName("test").getOrCreate()
    data = [("James", "", "Smith", "36", "", 3000, 1.2),
            ("Michael", "Rose", "", "40", "M", 4000, 2.0),
            ("Robert", "", "Williams", "42", "M", 4000, 5.0),
            ("Maria", "Anne", " ", "39", "F", None, None),
            ("Jen", "Mary", "Brown", " ", "F", -1, None)
            ]
    
    schema = StructType([StructField("firstname", StringType(), True),StructField("middlename", StringType(), True),StructField("lastname", StringType(), True),StructField("age", StringType(), True),StructField("gender", StringType(), True),StructField("salary", IntegerType(), True),StructField("amount", DoubleType(), True)])
    
    dfa = spark.createDataFrame(data=data, schema=schema)
    dfa.show()
    
    
    def removenull(dfa):
            dfa = dfa.select([trim(col(c)).alias(c) for c in dfa.columns])
            for i in dfa.columns:
                    dfa = dfa.withColumn(i , when(col(i)=="", None ).otherwise(col(i)))
            return dfa
    
    removenull(dfa).show()

输出：

+---------+----------+--------+----+------+------+------+
|firstname|middlename|lastname| age|gender|salary|amount|
+---------+----------+--------+----+------+------+------+
|    James|      null|   Smith|  36|  null|  3000|   1.2|
|  Michael|      Rose|    null|  40|     M|  4000|   2.0|
|   Robert|      null|Williams|  42|     M|  4000|   5.0|
|    Maria|      Anne|    null|  39|     F|  null|  null|
|      Jen|      Mary|   Brown|null|     F|    -1|  null|
+---------+----------+--------+----+------+------+------+

You can try this :

from pyspark.sql import *
    from pyspark.sql.functions import *
    from pyspark.sql.types import *
    
    spark = SparkSession.builder.master("local").appName("test").getOrCreate()
    data = [("James", "", "Smith", "36", "", 3000, 1.2),
            ("Michael", "Rose", "", "40", "M", 4000, 2.0),
            ("Robert", "", "Williams", "42", "M", 4000, 5.0),
            ("Maria", "Anne", " ", "39", "F", None, None),
            ("Jen", "Mary", "Brown", " ", "F", -1, None)
            ]
    
    schema = StructType([StructField("firstname", StringType(), True),StructField("middlename", StringType(), True),StructField("lastname", StringType(), True),StructField("age", StringType(), True),StructField("gender", StringType(), True),StructField("salary", IntegerType(), True),StructField("amount", DoubleType(), True)])
    
    dfa = spark.createDataFrame(data=data, schema=schema)
    dfa.show()
    
    
    def removenull(dfa):
            dfa = dfa.select([trim(col(c)).alias(c) for c in dfa.columns])
            for i in dfa.columns:
                    dfa = dfa.withColumn(i , when(col(i)=="", None ).otherwise(col(i)))
            return dfa
    
    removenull(dfa).show()

output:

+---------+----------+--------+----+------+------+------+
|firstname|middlename|lastname| age|gender|salary|amount|
+---------+----------+--------+----+------+------+------+
|    James|      null|   Smith|  36|  null|  3000|   1.2|
|  Michael|      Rose|    null|  40|     M|  4000|   2.0|
|   Robert|      null|Williams|  42|     M|  4000|   5.0|
|    Maria|      Anne|    null|  39|     F|  null|  null|
|      Jen|      Mary|   Brown|null|     F|    -1|  null|
+---------+----------+--------+----+------+------+------+

回复收藏 0 原文

~没有更多了~