基础统计 Summary statistics colStats
Summary statistics colStats 函数用户汇总统计 RDD[Vector] 的列
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
val observations = sc.parallelize(
Seq(
Vectors.dense(1.0, 10.0, 100.0),
Vectors.dense(2.0, 20.0, 200.0),
Vectors.dense(3.0, 30.0, 300.0)
)
)
// Compute column summary statistics.
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each column
println(summary.min) //min value for each column
[2.0,20.0,200.0]
[1.0,100.0,10000.0]
[3.0,3.0,3.0]
[1.0,10.0,100.0]
Correlations
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD
val seriesX: RDD[Double] = sc.parallelize(Array(1, 2, 3, 3, 5)) // a series
// must have the same number of partitions and cardinality as seriesX
val seriesY: RDD[Double] = sc.parallelize(Array(11, 22, 33, 33, 555))
// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a
// method is not specified, Pearson's method will be used by default.
val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")
println(s"Correlation is: $correlation")
val data: RDD[Vector] = sc.parallelize(
Seq(
Vectors.dense(1.0, 10.0, 100.0),
Vectors.dense(2.0, 20.0, 200.0),
Vectors.dense(5.0, 33.0, 366.0))
) // note that each Vector is a row and not a column
// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method
// If a method is not specified, Pearson's method will be used by default.
val correlMatrix: Matrix = Statistics.corr(data, "pearson")
println(correlMatrix.toString)
Correlation is: 0.8500286768773001
1.0 0.978883465889473 0.9903895695275671
0.978883465889473 1.0 0.9977483233986101
0.9903895695275671 0.9977483233986101 1.0
Stratified sampling
// an RDD[(K, V)] of any key value pairs
val data = sc.parallelize(
Seq((1, 'a'), (1, 'b'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f')))
// specify the exact fraction desired from each key
val fractions = Map(1 -> 0.1, 2 -> 0.6, 3 -> 0.3)
// Get an approximate sample from each stratum
val approxSample = data.sampleByKey(withReplacement = false, fractions = fractions)
approxSample.foreach(println(_))
// Get an exact sample from each stratum
val exactSample = data.sampleByKeyExact(withReplacement = false, fractions = fractions)
Hypothesis testing
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.mllib.stat.test.ChiSqTestResult
import org.apache.spark.rdd.RDD
// a vector composed of the frequencies of events
val vec: Vector = Vectors.dense(0.1, 0.15, 0.2, 0.3, 0.25)
// compute the goodness of fit. If a second vector to test against is not supplied
// as a parameter, the test runs against a uniform distribution.
val goodnessOfFitTestResult = Statistics.chiSqTest(vec)
// summary of the test including the p-value, degrees of freedom, test statistic, the method
// used, and the null hypothesis.
println(s"$goodnessOfFitTestResult\n")
// a contingency matrix. Create a dense matrix ((1.0, 2.0), (3.0, 4.0), (5.0, 6.0))
val mat: Matrix = Matrices.dense(3, 2, Array(1.0, 3.0, 5.0, 2.0, 4.0, 6.0))
// conduct Pearson's independence test on the input contingency matrix
val independenceTestResult = Statistics.chiSqTest(mat)
// summary of the test including the p-value, degrees of freedom
println(s"$independenceTestResult\n")
val obs: RDD[LabeledPoint] =
sc.parallelize(
Seq(
LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)),
LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 0.0)),
LabeledPoint(-1.0, Vectors.dense(-1.0, 0.0, -0.5)
)
)
) // (feature, label) pairs.
// The contingency table is constructed from the raw (feature, label) pairs and used to conduct
// the independence test. Returns an array containing the ChiSquaredTestResult for every feature
// against the label.
val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
featureTestResults.zipWithIndex.foreach { case (k, v) =>
println("Column " + (v + 1).toString + ":")
println(k)
} // summary of the test
Chi squared test summary:
method: pearson
degrees of freedom = 4
statistic = 0.12499999999999999
pValue = 0.998126379239318
No presumption against null hypothesis: observed follows the same distribution as expected..
Chi squared test summary:
method: pearson
degrees of freedom = 2
statistic = 0.14141414141414144
pValue = 0.931734784568187
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..
Column 1:
Chi squared test summary:
method: pearson
degrees of freedom = 1
statistic = 3.0000000000000004
pValue = 0.08326451666354884
Low presumption against null hypothesis: the occurrence of the outcomes is statistically independent..
Column 2:
Chi squared test summary:
method: pearson
degrees of freedom = 1
statistic = 0.75
pValue = 0.3864762307712326
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..
Column 3:
Chi squared test summary:
method: pearson
degrees of freedom = 2
statistic = 3.0
pValue = 0.22313016014843035
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD
val data: RDD[Double] = sc.parallelize(Seq(0.1, 0.15, 0.2, 0.3, 0.25)) // an RDD of sample data
// run a KS test for the sample versus a standard normal distribution
val testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0, 1)
// summary of the test including the p-value, test statistic, and null hypothesis if our p-value
// indicates significance, we can reject the null hypothesis.
println(testResult)
println()
// perform a KS test using a cumulative distribution function of our making
val myCDF = Map(0.1 -> 0.2, 0.15 -> 0.6, 0.2 -> 0.05, 0.3 -> 0.05, 0.25 -> 0.1)
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
println(testResult2)
Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.539827837277029
pValue = 0.06821463111921133
Low presumption against null hypothesis: Sample follows theoretical distribution.
Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.95
pValue = 6.249999999763389E-7
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
Streaming Significance Testing
Random data generation
import org.apache.spark.mllib.random.RandomRDDs._
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)
println(u)
println(v)
Kernel density estimation
import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD
// an RDD of sample data
val data: RDD[Double] = sc.parallelize(Seq(1, 1, 1, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9))
// Construct the density estimator with the sample data and a standard deviation
// for the Gaussian kernels
val kd = new KernelDensity().setSample(data).setBandwidth(3.0)
// Find density estimates for the given values
val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
上一篇: Spark 构建聚类模型
下一篇: 彻底找到 Tomcat 启动速度慢的元凶
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论