如何在 Spark RDD 中使用 Option case 类处理零除数情况

发布于 2025-01-10 10:48:07 字数 938 浏览 1 评论 0原文

我试图在 Scala Spark 中计算百分比时使用 Option case 类来处理零分母。 RDD 的集合如下所示:

val counties = Array("New+York", "Bronx","Kings","Queens","Richmond")
val base_url = "https://health.data.ny.gov/resource/xdss-u53e.json?County="
val urls = counties.map(a => base_url+a)
val results = urls.map(u => scala.io.Source.fromURL(u).mkString)
val data_rdd = spark.read.json(sc.parallelize(results)).rdd.map(r => (r(4).toString.slice(0,10), r(0).toString,r(3).toString.toInt,r(5).toString.toInt))

我想要做的是返回一个元组(日期,状态,百分比),其中百分比是通过将第三个元素除以第四个元素来计算的(即使用第一个 Int 除第二个元素)国际)。然而,由于某些除数为零,我确实需要使用 Option 案例类来处理这些情况,但我一直不知道如何使用 Scala Spark 来做到这一点。

以下是我尝试过的:

data_rdd.map{ case (a,b,c,d) => (a,b,c/d)
      case _ => (a,b,0)}

此代码给了我一个错误:

<console>:28: error: not found: value a
             case _ => (a,b,0)}

任何人都可以帮我找出一种使用选项案例类处理零除数的方法吗?太感谢了!

I'm trying to use the Option case class to handle zero denominators while calculating percentages in Scala Spark. The set of RDD looks like the following:

val counties = Array("New+York", "Bronx","Kings","Queens","Richmond")
val base_url = "https://health.data.ny.gov/resource/xdss-u53e.json?County="
val urls = counties.map(a => base_url+a)
val results = urls.map(u => scala.io.Source.fromURL(u).mkString)
val data_rdd = spark.read.json(sc.parallelize(results)).rdd.map(r => (r(4).toString.slice(0,10), r(0).toString,r(3).toString.toInt,r(5).toString.toInt))

What I want to do is to return a tuple (date, state, percent), where percent is calculated by dividing the third element by the fourth element(i.e. use the first Int to divide the second Int). However, since some divisors are zero, I really need to use the Option case class to handle these cases, but I'm stuck with how to do so using Scala Spark.

The following is what I've tried:

data_rdd.map{ case (a,b,c,d) => (a,b,c/d)
      case _ => (a,b,0)}

This code gives me an error of :

<console>:28: error: not found: value a
             case _ => (a,b,0)}

Can anyone help me figure out a way to handle the zero-divisors using option case class? Thank you so much!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

苍风燃霜 2025-01-17 10:48:07

您可以使用 scala.util.Try 来实现这一点。基本上,你可以给它一个可能失败的输入,然后把它变成一个选项。一个简化的示例如下所示:

import org.apache.spark.sql._
import spark.implicits._
import scala.util.Try

val columnNames = Seq("String", "Int1", "Int2")
val df = Seq(
  ("Alex", 3, 4),
  ("John", 1, 2),
  ("Alice", 7, 0),
  ("Mark", 5, -3)
).toDF(columnNames: _*)

val output = df.map{
  row => {
    // Dividing int1 by int2
    val division = Try(row.getInt(1) / row.getInt(2)).toOption

    // Creating a new row with an extra element: division
    (row.getString(0), row.getInt(1), row.getInt(2), division)
  }
}.toDF(columnNames :+ "division": _*)

output.show                                                                                                                                                                                                                                                              
+------+----+----+--------+                                                                                                                                                                                                                                                     
|String|Int1|Int2|division|                                                                                                                                                                                                                                                     
+------+----+----+--------+                                                                                                                                                                                                                                                     
|  Alex|   3|   4|       0|                                                                                                                                                                                                                                                     
|  John|   1|   2|       0|                                                                                                                                                                                                                                                     
| Alice|   7|   0|    null|                                                                                                                                                                                                                                                     
|  Mark|   5|  -3|      -1|                                                                                                                                                                                                                                                     
+------+----+----+--------+

此除法发生时不会失败,它只是在您的行中创建一个 null 条目。

我为此使用了 Dataframes,因为它是我首选的 API,但您也可以对 RDD 执行相同的操作。

希望这有帮助!

You can use scala.util.Try for that. Basically, you can give it an input that might fail, and then turn it into an option. A simplified example looks like this:

import org.apache.spark.sql._
import spark.implicits._
import scala.util.Try

val columnNames = Seq("String", "Int1", "Int2")
val df = Seq(
  ("Alex", 3, 4),
  ("John", 1, 2),
  ("Alice", 7, 0),
  ("Mark", 5, -3)
).toDF(columnNames: _*)

val output = df.map{
  row => {
    // Dividing int1 by int2
    val division = Try(row.getInt(1) / row.getInt(2)).toOption

    // Creating a new row with an extra element: division
    (row.getString(0), row.getInt(1), row.getInt(2), division)
  }
}.toDF(columnNames :+ "division": _*)

output.show                                                                                                                                                                                                                                                              
+------+----+----+--------+                                                                                                                                                                                                                                                     
|String|Int1|Int2|division|                                                                                                                                                                                                                                                     
+------+----+----+--------+                                                                                                                                                                                                                                                     
|  Alex|   3|   4|       0|                                                                                                                                                                                                                                                     
|  John|   1|   2|       0|                                                                                                                                                                                                                                                     
| Alice|   7|   0|    null|                                                                                                                                                                                                                                                     
|  Mark|   5|  -3|      -1|                                                                                                                                                                                                                                                     
+------+----+----+--------+

This division does not fail when it happens, it just creates a null entry in your row.

I used Dataframes for this since it's my preferred API but you can do just the same for RDDs.

Hope this helps!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文