我有以下代码,该代码曾在Spark DataFrame中用于(SHA)哈希列:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{sha2,lit, col}
object hashing {
def process(hashFieldNames: List[String])(df: DataFrame) = {
hashFieldNames.foldLeft(df) { case (df, hashField) =>
df.withColumn(hashField, sha2(col(hashField), 256))
}
}
}
现在在单独的文件中,我正在使用 hashing.process
使用 AnywordSpec
测试我的 hashing.process
如下测试:
"The hashing .process " should {
// some cases here that complete succesfully
"fail to hash a spark dataframe due to type mismatch " in {
val goodColumns = Seq("language", "usersCount", "ID", "personalData")
val badDataSample =
Seq(
("Java", "20000", 2, "happy"),
("Python", "100000", 3, "happy"),
("Scala", "3000", 1, "jolly")
)
val badDf =
spark.sparkContext.parallelize(badDataSample).toDF(goodColumns: _*)
val thrown = intercept[org.apache.spark.sql.AnalysisException] {
val hashedResultDf =
hashing.process(hashFieldNames)(badDf)
}
assert (thrown.getMessage === // some lengthy error message that I do not want to copy paste in its entirety.
据我了解,通常,人们希望将整个错误消息进行硬编码,以确保它确实是我们所期望的。但是,该消息非常漫长,我想知道是否没有更好的方法。
基本上,我有两个问题:
a。)仅匹配错误消息的开始,然后
跟进我正在考虑这样的事情: thrown.getMessage ===“ [由于数据类型不匹配而导致的SHA2(ID,256):参数1需要二进制类型,但是,ID是INT类型。 + regexpattern \;(。*))
b。)如果a。)被视为一种骇客方法,您是否有关于如何正确执行的工作建议?
注意:上面的代码可能会出现小错误,我将其调整为SO POST。但是你应该明白。
I have the following code, which is used to (sha) hash columns in a spark dataframe:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{sha2,lit, col}
object hashing {
def process(hashFieldNames: List[String])(df: DataFrame) = {
hashFieldNames.foldLeft(df) { case (df, hashField) =>
df.withColumn(hashField, sha2(col(hashField), 256))
}
}
}
Now in a seperate file, I am testing my hashing.process
using a AnyWordSpec
Test as follows:
"The hashing .process " should {
// some cases here that complete succesfully
"fail to hash a spark dataframe due to type mismatch " in {
val goodColumns = Seq("language", "usersCount", "ID", "personalData")
val badDataSample =
Seq(
("Java", "20000", 2, "happy"),
("Python", "100000", 3, "happy"),
("Scala", "3000", 1, "jolly")
)
val badDf =
spark.sparkContext.parallelize(badDataSample).toDF(goodColumns: _*)
val thrown = intercept[org.apache.spark.sql.AnalysisException] {
val hashedResultDf =
hashing.process(hashFieldNames)(badDf)
}
assert (thrown.getMessage === // some lengthy error message that I do not want to copy paste in its entirety.
Usually, as I understand, one would want to hard code the whole error message to ensure that it is indeed as we expect. However, the message is very lengthy and I am wondering if there is no better approach.
Basically, I have two questions:
a.) Is it considered good practice to match only the beginning part of error message and then
follow up with a regex ? I am thinking something like this: thrown.getMessage === "[cannot resolve sha2(ID, 256) due to data type mismatch: argument 1 requires binary type, however, ID is of int type.;" + regexpattern \;(.*))
b.) If a.) is considered a hacky approach, do you have any working suggestion on how to do it properly ?
Note: Small errors possible with code above, I adapted it for SO post. But you should get the idea.
发布评论
评论(2)
好的,回答我自己的问题。我现在这样解决了:
让这篇文章开放,以寻求潜在的建议 /改进。根据 https://github.com/databricks/databricks/scala-style-guide#拦截外观解决方案仍然不理想。
Ok, answering my own question. I now solved it like this:
Leaving this post open for potential suggestions / improvement. According to https://github.com/databricks/scala-style-guide#intercepting-exceptions the solution is still not ideal.
您不应断言异常消息(除非它们面信给用户,否则依赖于这些消息)。
如果抛出异常是合同的一部分,那么您应该将一种带有给定错误代码的特定类型之一扔出,并且应该断言这一点。如果不是,那么谁在乎消息说的话?
You should not be asserting exception messages (unless they are surfced to the user, or something downndstream relies on them).
If throwing an exception is a part of contract, then you should be throwing one of a specific type with a given error code, and tests should be asserting that. And if it isn't, then who cares what the message said?