desehawk 发表于 2015-5-11 20:50:16

Spark MLlib Statistics统计

问题导读
1.spark如何实现列统计汇总?
2.MLlib在本文有哪些作用?


static/image/hrline/4.gif



Spark Mllib 统计模块代码结构如下:

1.1 列统计汇总计算每列最大值、最小值、平均值、方差值、L1范数、L2范数。    //读取数据,转换成RDD类型    val data_path = "/home/jb-huangmeiling/sample_stat.txt"    val data = sc.textFile(data_path).map(_.split("\t")).map(f => f.map(f =>f.toDouble))    val data1 = data.map(f => Vectors.dense(f))       //计算每列最大值、最小值、平均值、方差值、L1范数、L2范数    val stat1 = Statistics.colStats(data1)    stat1.max    stat1.min    stat1.mean    stat1.variance    stat1.normL1    stat1.normL2
执行结果:数据

12345
67159
35631
31156

scala> data1.collectres19: Array = Array(, , , ) scala>   stat1.maxres20: org.apache.spark.mllib.linalg.Vector = scala>   stat1.minres21: org.apache.spark.mllib.linalg.Vector = scala>   stat1.meanres22: org.apache.spark.mllib.linalg.Vector = scala>   stat1.varianceres23: org.apache.spark.mllib.linalg.Vector = scala>   stat1.normL1res24: org.apache.spark.mllib.linalg.Vector = scala>   stat1.normL2res25: org.apache.spark.mllib.linalg.Vector =
1.2 相关系数Pearson相关系数表达的是两个数值变量的线性相关性, 它一般适用于正态分布。其取值范围是[-1, 1], 当取值为0表示不相关,取值为(0~-1]表示负相关,取值为(0, 1]表示正相关。


Spearman相关系数也用来表达两个变量的相关性,但是它没有Pearson相关系数对变量的分布要求那么严格,另外Spearman相关系数可以更好地用于测度变量的排序关系。其计算公式为:


   //计算pearson系数、spearman相关系数    val corr1 = Statistics.corr(data1, "pearson")    val corr2 = Statistics.corr(data1, "spearman")    val x1 = sc.parallelize(Array(1.0, 2.0, 3.0, 4.0))    val y1 = sc.parallelize(Array(5.0, 6.0, 6.0, 6.0))    val corr3 = Statistics.corr(x1, y1, "pearson")scala> corr1res6: org.apache.spark.mllib.linalg.Matrix =1.0                   0.7779829610026362    -0.39346431156047523... (5 total)0.7779829610026362    1.0                   0.14087521363240252   ...-0.393464311560475230.14087521363240252   1.0                   ...0.4644203640128242    -0.09482093118615205-0.9945577827230707   ...0.5750122832421579    0.19233705001984078   -0.9286374704669208   ... scala> corr2res7: org.apache.spark.mllib.linalg.Matrix =1.0                  0.632455532033675   -0.5000000000000001... (5 total)0.632455532033675    1.0                   0.10540925533894883...-0.50000000000000010.10540925533894883   1.0                  ...0.5000000000000001   -0.10540925533894883-1.0000000000000002...0.6324555320336723   0.20000000000000429   -0.9486832980505085... scala> corr3res8: Double = 0.77459666924147751.3 假设检验MLlib当前支持用于判断拟合度或者独立性的Pearson卡方(chi-squared ( χ2) )检验。不同的输入类型决定了是做拟合度检验还是独立性检验。拟合度检验要求输入为Vector, 独立性检验要求输入是Matrix。
    //卡方检验    val v1 = Vectors.dense(43.0, 9.0)    val v2 = Vectors.dense(44.0, 4.0)   val c1 = Statistics.chiSqTest(v1, v2)
执行结果:c1: org.apache.spark.mllib.stat.test.ChiSqTestResult =Chi squared test summary:method: pearsondegrees of freedom = 1statistic = 5.482517482517483pValue = 0.01920757707591003Strong presumption against null hypothesis: observed follows the same distribution as expected..结果返回:统计量:pearson、自由度:1、值:5.48、概率:0.019。

页: [1]
查看完整版本: Spark MLlib Statistics统计