分享

Spark0.9分布式运行MLlib的二元分类算法

xioaxu790 2014-12-24 21:35:02 发表于 实操演练 [显示全部楼层] 回帖奖励 阅读模式 关闭右栏 1 11505

问题导读
1、什么是MLlib?
2、什么方法默认执行L2正规化算法?
3、如何理解二元分类?






MLlib是的Spark实现一些常见的机器学习(ML)的功能并且提供一些相关的测试和数据生成器的功能。 MLlib目前支持4种常见的类型的机器学习问题的设定,即,二元分类,回归,聚类和协同过滤,以及一个原始梯度下降优化。这个指南将概述在MLlib所支持的功能,并且还提供了如何调用MLlib的一些例子。


依赖库
MLlib使用jblas线性代数库,它本身取决于本地Fortran程序。如果不是已经安装在你的节点,你可能需要安装gfortran运行时库。如果不能自动检测到这些库,MLlib将抛出一个链接错误。
使用MLlib在Python中,您将需要安装1.7或更新版本的NumPy和Python 2.7。


二元分类含义
二元分类是一种监督学习算法问题,我们想将实体为两种不同的类别或标签,如,预测邮件是否是垃圾邮件。这个问题涉及到通过执行一组打标签的数据集来进行学习的算法,比如,一组通过(数值)来代表特性以及分类标签的数据实体。算法会返回训练模型,该训练模型可以预测那些未知标签的新实体的标签。
MLlib目前支持两种标准的二元分类模型,即线性支持向量机(向量机)和逻辑回归以及对于每个算法模型的随机变量的L1和L2规则化算法。所有的算法都会利用原始梯度下降训练算法的(在下面描述),并采取作为输入正则化参数(regParam)以及各种数与梯度下降相关的参数(stepSize,numIterations miniBatchFraction)。
可用的二元分类算法:
SVMWithSGD
LogisticRegressionWithSGD


scala模板建立
为了让sbt正确工作,我们需要正确放置SimpleApp位置。scala和simple.scalat根据下面典型的目录结构来放置。一旦建立好了模板,我们可以通过编译来创建一个JAR包,它包含应用程序的代码,然后使用sbt/sbt run来执行我们编写的程序。
  1. find .
  2. ./scala/sample/scala/lib
  3. ./scala/sample/scala/lib/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar
  4. ./scala/sbt
  5. ./scala/sbt/sbt-launch-1.jar
  6. ./scala/sbt/sbt
  7. ./scala/src
  8. ./scala/src/main
  9. ./scala/src/main/scala
  10. ./scala/src/main/scala/SimpleApp.scala
复制代码


二元分类代码
下面的代码片段说明了如何加载一个样本数据集,对训练数据集执行训练算法,在代码中我们使用静态对象,并计算出实际结果和预测模型计算训练结果之间的误差。
  1. import org.apache.spark.SparkContext
  2. import org.apache.spark.mllib.classification.SVMWithSGD
  3. import org.apache.spark.mllib.regression.LabeledPoint
  4. object SimpleApp {
  5.   def main(args: Array[String]) {
  6.     val sc = new SparkContext("spark://192.168.159.129:7077", "Simple App", "/root/spark-0.9",
  7.       List("target/scala-2.10/scala_2.10-0.1-SNAPSHOT.jar"))
  8.     val data = sc.textFile("hdfs://master:9000/mllib/sample_svm_data.txt")
  9.     val parsedData = data.map { line =>
  10.     val parts = line.split(' ')
  11.     LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)
  12.     }
  13.   
  14.     // Run training algorithm to build the model
  15.     val numIterations = 20
  16.     val model = SVMWithSGD.train(parsedData, numIterations)
  17.    
  18.     // Evaluate model on training examples and compute training error
  19.     val labelAndPreds = parsedData.map { point =>
  20.       val prediction = model.predict(point.features)
  21.       (point.label, prediction)
  22.     }
  23.     val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
  24.     println("Training Error = " + trainErr)
  25.     }
  26. }
复制代码



执行结果
  1. [root@master spark-0.9]# cd /root/sample/scala
  2. [root@master scala]# sbt/sbt package run
  3. [info] Set current project to scala (in build file:/root/sample/scala/)
  4. [info] Compiling 1 Scala source to /root/sample/scala/target/scala-2.10/classes...
  5. [info] Packaging /root/sample/scala/target/scala-2.10/scala_2.10-0.1-SNAPSHOT.jar ...
  6. [info] Done packaging.
  7. [success] Total time: 15 s, completed Feb 10, 2014 11:27:51 PM
  8. [info] Running SimpleApp
  9. log4j:WARN No appenders could be found for logger (akka.event.slf4j.Slf4jLogger).
  10. log4j:WARN Please initialize the log4j system properly.
  11. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
  12. 14/02/10 23:27:54 INFO SparkEnv: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
  13. 14/02/10 23:27:54 INFO SparkEnv: Registering BlockManagerMaster
  14. 14/02/10 23:27:54 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140210232754-e3bb
  15. 14/02/10 23:27:54 INFO MemoryStore: MemoryStore started with capacity 580.0 MB.
  16. 14/02/10 23:27:54 INFO ConnectionManager: Bound socket to port 48916 with id = ConnectionManagerId(master,48916)
  17. 14/02/10 23:27:54 INFO BlockManagerMaster: Trying to register BlockManager
  18. 14/02/10 23:27:54 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager master:48916 with 580.0 MB RAM
  19. 14/02/10 23:27:54 INFO BlockManagerMaster: Registered BlockManager
  20. 14/02/10 23:27:54 INFO HttpServer: Starting HTTP Server
  21. 14/02/10 23:27:55 INFO HttpBroadcast: Broadcast server started at http://192.168.159.129:49765
  22. 14/02/10 23:27:55 INFO SparkEnv: Registering MapOutputTracker
  23. 14/02/10 23:27:55 INFO HttpFileServer: HTTP File server directory is /tmp/spark-b309992e-6b24-4823-9ce7-68ff0ee6ec1a
  24. 14/02/10 23:27:55 INFO HttpServer: Starting HTTP Server
  25. 14/02/10 23:27:56 INFO SparkUI: Started Spark Web UI at http://master:4040
  26. 14/02/10 23:27:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  27. 14/02/10 23:27:56 INFO SparkContext: Added JAR target/scala-2.10/scala_2.10-0.1-SNAPSHOT.jar at http://192.168.159.129:35769/jars/scala_2.10-0.1-SNAPSHOT.jar with timestamp 1392046076889
  28. 14/02/10 23:27:56 INFO AppClient$ClientActor: Connecting to master spark://192.168.159.129:7077...
  29. 14/02/10 23:27:57 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
  30. 14/02/10 23:27:57 INFO MemoryStore: ensureFreeSpace(132636) called with curMem=0, maxMem=608187187
  31. 14/02/10 23:27:57 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 129.5 KB, free 579.9 MB)
  32. 14/02/10 23:27:58 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140210232758-0007
  33. 14/02/10 23:27:58 INFO AppClient$ClientActor: Executor added: app-20140210232758-0007/0 on worker-20140210205103-slaver01-37106 (slaver01:37106) with 1 cores
  34. 14/02/10 23:27:58 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140210232758-0007/0 on hostPort slaver01:37106 with 1 cores, 512.0 MB RAM
  35. 14/02/10 23:27:58 INFO AppClient$ClientActor: Executor added: app-20140210232758-0007/1 on worker-20140210205049-slaver02-48689 (slaver02:48689) with 1 cores
  36. 9:16 INFO DAGScheduler: Final stage: Stage 15 (reduce at GradientDescent.scala:150)
  37. 14/02/10 23:29:16 INFO DAGScheduler: Parents of final stage: List()
  38. 14/02/10 23:29:16 INFO DAGScheduler: Missing parents: List()
  39. 14/02/10 23:29:16 INFO DAGScheduler: Submitting Stage 15 (MappedRDD[30] at map at GradientDescent.scala:145), which has no missing parents
  40. 14/02/10 23:29:16 INFO DAGScheduler: Submitting 2 missing tasks from Stage 15 (MappedRDD[30] at map at GradientDescent.scala:145)
  41. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Adding task set 15.0 with 2 tasks
  42. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 15.0:0 as TID 28 on executor 1: slaver02 (NODE_LOCAL)
  43. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 15.0:0 as 2501 bytes in 0 ms
  44. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 15.0:1 as TID 29 on executor 0: slaver01 (NODE_LOCAL)
  45. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 15.0:1 as 2501 bytes in 0 ms
  46. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 29 in 59 ms on slaver01 (progress: 0/2)
  47. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(15, 1)
  48. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 28 in 64 ms on slaver02 (progress: 1/2)
  49. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Remove TaskSet 15.0 from pool
  50. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(15, 0)
  51. 14/02/10 23:29:16 INFO DAGScheduler: Stage 15 (reduce at GradientDescent.scala:150) finished in 0.062 s
  52. 14/02/10 23:29:16 INFO SparkContext: Job finished: reduce at GradientDescent.scala:150, took 0.079776485 s
  53. 14/02/10 23:29:16 INFO SparkContext: Starting job: reduce at GradientDescent.scala:150
  54. 14/02/10 23:29:16 INFO DAGScheduler: Got job 16 (reduce at GradientDescent.scala:150) with 2 output partitions (allowLocal=false)
  55. 14/02/10 23:29:16 INFO DAGScheduler: Final stage: Stage 16 (reduce at GradientDescent.scala:150)
  56. 14/02/10 23:29:16 INFO DAGScheduler: Parents of final stage: List()
  57. 14/02/10 23:29:16 INFO DAGScheduler: Missing parents: List()
  58. 14/02/10 23:29:16 INFO DAGScheduler: Submitting Stage 16 (MappedRDD[32] at map at GradientDescent.scala:145), which has no missing parents
  59. 14/02/10 23:29:16 INFO DAGScheduler: Submitting 2 missing tasks from Stage 16 (MappedRDD[32] at map at GradientDescent.scala:145)
  60. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Adding task set 16.0 with 2 tasks
  61. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 16.0:0 as TID 30 on executor 1: slaver02 (NODE_LOCAL)
  62. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 16.0:0 as 2504 bytes in 0 ms
  63. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 16.0:1 as TID 31 on executor 0: slaver01 (NODE_LOCAL)
  64. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 16.0:1 as 2504 bytes in 0 ms
  65. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 31 in 32 ms on slaver01 (progress: 0/2)
  66. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(16, 1)
  67. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 30 in 65 ms on slaver02 (progress: 1/2)
  68. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(16, 0)
  69. 14/02/10 23:29:16 INFO DAGScheduler: Stage 16 (reduce at GradientDescent.scala:150) finished in 0.068 s
  70. 14/02/10 23:29:16 INFO SparkContext: Job finished: reduce at GradientDescent.scala:150, took 0.084612863 s
  71. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Remove TaskSet 16.0 from pool
  72. 14/02/10 23:29:16 INFO SparkContext: Starting job: reduce at GradientDescent.scala:150
  73. 14/02/10 23:29:16 INFO DAGScheduler: Got job 17 (reduce at GradientDescent.scala:150) with 2 output partitions (allowLocal=false)
  74. 14/02/10 23:29:16 INFO DAGScheduler: Final stage: Stage 17 (reduce at GradientDescent.scala:150)
  75. 14/02/10 23:29:16 INFO DAGScheduler: Parents of final stage: List()
  76. 14/02/10 23:29:16 INFO DAGScheduler: Missing parents: List()
  77. 14/02/10 23:29:16 INFO DAGScheduler: Submitting Stage 17 (MappedRDD[34] at map at GradientDescent.scala:145), which has no missing parents
  78. 14/02/10 23:29:16 INFO DAGScheduler: Submitting 2 missing tasks from Stage 17 (MappedRDD[34] at map at GradientDescent.scala:145)
  79. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Adding task set 17.0 with 2 tasks
  80. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 17.0:0 as TID 32 on executor 1: slaver02 (NODE_LOCAL)
  81. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 17.0:0 as 2500 bytes in 0 ms
  82. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 17.0:1 as TID 33 on executor 0: slaver01 (NODE_LOCAL)
  83. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 17.0:1 as 2500 bytes in 0 ms
  84. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 32 in 47 ms on slaver02 (progress: 0/2)
  85. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(17, 0)
  86. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 33 in 75 ms on slaver01 (progress: 1/2)
  87. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Remove TaskSet 17.0 from pool
  88. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(17, 1)
  89. 14/02/10 23:29:16 INFO DAGScheduler: Stage 17 (reduce at GradientDescent.scala:150) finished in 0.070 s
  90. 14/02/10 23:29:16 INFO SparkContext: Job finished: reduce at GradientDescent.scala:150, took 0.084426168 s
  91. 14/02/10 23:29:16 INFO SparkContext: Starting job: reduce at GradientDescent.scala:150
  92. 14/02/10 23:29:16 INFO DAGScheduler: Got job 18 (reduce at GradientDescent.scala:150) with 2 output partitions (allowLocal=false)
  93. 14/02/10 23:29:16 INFO DAGScheduler: Final stage: Stage 18 (reduce at GradientDescent.scala:150)
  94. 14/02/10 23:29:16 INFO DAGScheduler: Parents of final stage: List()
  95. 14/02/10 23:29:16 INFO DAGScheduler: Missing parents: List()
  96. 14/02/10 23:29:16 INFO DAGScheduler: Submitting Stage 18 (MappedRDD[36] at map at GradientDescent.scala:145), which has no missing parents
  97. 14/02/10 23:29:16 INFO DAGScheduler: Submitting 2 missing tasks from Stage 18 (MappedRDD[36] at map at GradientDescent.scala:145)
  98. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Adding task set 18.0 with 2 tasks
  99. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 18.0:0 as TID 34 on executor 1: slaver02 (NODE_LOCAL)
  100. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 18.0:0 as 2504 bytes in 0 ms
  101. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 18.0:1 as TID 35 on executor 0: slaver01 (NODE_LOCAL)
  102. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 18.0:1 as 2504 bytes in 0 ms
  103. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 34 in 40 ms on slaver02 (progress: 0/2)
  104. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(18, 0)
  105. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 35 in 81 ms on slaver01 (progress: 1/2)
  106. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Remove TaskSet 18.0 from pool
  107. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(18, 1)
  108. 14/02/10 23:29:16 INFO DAGScheduler: Stage 18 (reduce at GradientDescent.scala:150) finished in 0.079 s
  109. 14/02/10 23:29:16 INFO SparkContext: Job finished: reduce at GradientDescent.scala:150, took 0.09669554 s
  110. 14/02/10 23:29:16 INFO SparkContext: Starting job: reduce at GradientDescent.scala:150
  111. 14/02/10 23:29:16 INFO DAGScheduler: Got job 19 (reduce at GradientDescent.scala:150) with 2 output partitions (allowLocal=false)
  112. 14/02/10 23:29:16 INFO DAGScheduler: Final stage: Stage 19 (reduce at GradientDescent.scala:150)
  113. 14/02/10 23:29:16 INFO DAGScheduler: Parents of final stage: List()
  114. 14/02/10 23:29:16 INFO DAGScheduler: Missing parents: List()
  115. 14/02/10 23:29:16 INFO DAGScheduler: Submitting Stage 19 (MappedRDD[38] at map at GradientDescent.scala:145), which has no missing parents
  116. 14/02/10 23:29:16 INFO DAGScheduler: Submitting 2 missing tasks from Stage 19 (MappedRDD[38] at map at GradientDescent.scala:145)
  117. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Adding task set 19.0 with 2 tasks
  118. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 19.0:0 as TID 36 on executor 1: slaver02 (NODE_LOCAL)
  119. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 19.0:0 as 2502 bytes in 0 ms
  120. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 19.0:1 as TID 37 on executor 0: slaver01 (NODE_LOCAL)
  121. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 19.0:1 as 2502 bytes in 0 ms
  122. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 36 in 63 ms on slaver02 (progress: 0/2)
  123. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(19, 0)
  124. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 37 in 80 ms on slaver01 (progress: 1/2)
  125. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Remove TaskSet 19.0 from pool
  126. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(19, 1)
  127. 14/02/10 23:29:16 INFO DAGScheduler: Stage 19 (reduce at GradientDescent.scala:150) finished in 0.076 s
  128. 14/02/10 23:29:16 INFO SparkContext: Job finished: reduce at GradientDescent.scala:150, took 0.090877223 s
  129. 14/02/10 23:29:16 INFO SparkContext: Starting job: reduce at GradientDescent.scala:150
  130. 14/02/10 23:29:16 INFO DAGScheduler: Got job 20 (reduce at GradientDescent.scala:150) with 2 output partitions (allowLocal=false)
  131. 14/02/10 23:29:16 INFO DAGScheduler: Final stage: Stage 20 (reduce at GradientDescent.scala:150)
  132. 14/02/10 23:29:16 INFO DAGScheduler: Parents of final stage: List()
  133. 14/02/10 23:29:16 INFO DAGScheduler: Missing parents: List()
  134. 14/02/10 23:29:16 INFO DAGScheduler: Submitting Stage 20 (MappedRDD[40] at map at GradientDescent.scala:145), which has no missing parents
  135. 14/02/10 23:29:16 INFO DAGScheduler: Submitting 2 missing tasks from Stage 20 (MappedRDD[40] at map at GradientDescent.scala:145)
  136. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Adding task set 20.0 with 2 tasks
  137. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 20.0:0 as TID 38 on executor 1: slaver02 (NODE_LOCAL)
  138. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 20.0:0 as 2499 bytes in 0 ms
  139. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 20.0:1 as TID 39 on executor 0: slaver01 (NODE_LOCAL)
  140. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 20.0:1 as 2499 bytes in 0 ms
  141. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 39 in 57 ms on slaver01 (progress: 0/2)
  142. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(20, 1)
  143. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 38 in 64 ms on slaver02 (progress: 1/2)
  144. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Remove TaskSet 20.0 from pool
  145. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(20, 0)
  146. 14/02/10 23:29:16 INFO DAGScheduler: Stage 20 (reduce at GradientDescent.scala:150) finished in 0.061 s
  147. 14/02/10 23:29:16 INFO SparkContext: Job finished: reduce at GradientDescent.scala:150, took 0.071109426 s
  148. 14/02/10 23:29:16 INFO SparkContext: Starting job: reduce at GradientDescent.scala:150
  149. 14/02/10 23:29:16 INFO DAGScheduler: Got job 21 (reduce at GradientDescent.scala:150) with 2 output partitions (allowLocal=false)
  150. 14/02/10 23:29:16 INFO DAGScheduler: Final stage: Stage 21 (reduce at GradientDescent.scala:150)
  151. 14/02/10 23:29:16 INFO DAGScheduler: Parents of final stage: List()
  152. 14/02/10 23:29:16 INFO DAGScheduler: Missing parents: List()
  153. 14/02/10 23:29:16 INFO DAGScheduler: Submitting Stage 21 (MappedRDD[42] at map at GradientDescent.scala:145), which has no missing parents
  154. 14/02/10 23:29:16 INFO DAGScheduler: Submitting 2 missing tasks from Stage 21 (MappedRDD[42] at map at GradientDescent.scala:145)
  155. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Adding task set 21.0 with 2 tasks
  156. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 21.0:0 as TID 40 on executor 1: slaver02 (NODE_LOCAL)
  157. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 21.0:0 as 2500 bytes in 0 ms
  158. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 21.0:1 as TID 41 on executor 0: slaver01 (NODE_LOCAL)
  159. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 21.0:1 as 2500 bytes in 0 ms
  160. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 41 in 43 ms on slaver01 (progress: 0/2)
  161. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(21, 1)
  162. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 40 in 55 ms on slaver02 (progress: 1/2)
  163. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Remove TaskSet 21.0 from pool
  164. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(21, 0)
  165. 14/02/10 23:29:16 INFO DAGScheduler: Stage 21 (reduce at GradientDescent.scala:150) finished in 0.052 s
  166. 14/02/10 23:29:16 INFO SparkContext: Job finished: reduce at GradientDescent.scala:150, took 0.0626958 s
  167. 14/02/10 23:29:16 INFO SparkContext: Starting job: reduce at GradientDescent.scala:150
  168. 14/02/10 23:29:16 INFO DAGScheduler: Got job 22 (reduce at GradientDescent.scala:150) with 2 output partitions (allowLocal=false)
  169. 14/02/10 23:29:16 INFO DAGScheduler: Final stage: Stage 22 (reduce at GradientDescent.scala:150)
  170. 14/02/10 23:29:16 INFO DAGScheduler: Parents of final stage: List()
  171. 14/02/10 23:29:16 INFO DAGScheduler: Missing parents: List()
  172. 14/02/10 23:29:16 INFO DAGScheduler: Submitting Stage 22 (MappedRDD[44] at map at GradientDescent.scala:145), which has no missing parents
  173. 14/02/10 23:29:16 INFO DAGScheduler: Submitting 2 missing tasks from Stage 22 (MappedRDD[44] at map at GradientDescent.scala:145)
  174. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Adding task set 22.0 with 2 tasks
  175. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 22.0:0 as TID 42 on executor 1: slaver02 (NODE_LOCAL)
  176. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 22.0:0 as 2503 bytes in 0 ms
  177. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 22.0:1 as TID 43 on executor 0: slaver01 (NODE_LOCAL)
  178. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 22.0:1 as 2503 bytes in 0 ms
  179. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 43 in 44 ms on slaver01 (progress: 0/2)
  180. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(22, 1)
  181. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 42 in 54 ms on slaver02 (progress: 1/2)
  182. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Remove TaskSet 22.0 from pool
  183. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(22, 0)
  184. 14/02/10 23:29:16 INFO DAGScheduler: Stage 22 (reduce at GradientDescent.scala:150) finished in 0.051 s
  185. 14/02/10 23:29:16 INFO SparkContext: Job finished: reduce at GradientDescent.scala:150, took 0.060071497 s
  186. 14/02/10 23:29:16 INFO GradientDescent: GradientDescent finished. Last 10 stochastic losses 1.973918565153662, 1.8255523040966746, 1.7699816024631967, 1.6469799583886178, 1.625661917005991, 1.5283113889552784, 1.5173506129512995, 1.422277398167446, 1.4154959896484256, 1.3621279370271806
  187. 14/02/10 23:29:16 INFO SVMWithSGD: Final model weights 0.14951408585149972,0.03831072711197627,0.037730161810440484,0.18505277569820583,-2.563032483490213E-4,0.07950273502493031,0.0946837869570233,0.007664328764458717,0.12219548598644159,0.12219548598644195,0.034482086651882085,0.035443622005655644,0.02700659703930399,-0.002137650963695721,0.007242361663251616,0.020208016800350677
  188. 14/02/10 23:29:16 INFO SVMWithSGD: Final model intercept 0.06977506975495361
  189. 14/02/10 23:29:16 INFO SparkContext: Starting job: count at SimpleApp.scala:24
  190. 14/02/10 23:29:16 INFO DAGScheduler: Got job 23 (count at SimpleApp.scala:24) with 2 output partitions (allowLocal=false)
  191. 14/02/10 23:29:16 INFO DAGScheduler: Final stage: Stage 23 (count at SimpleApp.scala:24)
  192. 14/02/10 23:29:16 INFO DAGScheduler: Parents of final stage: List()
  193. 14/02/10 23:29:16 INFO DAGScheduler: Missing parents: List()
  194. 14/02/10 23:29:16 INFO DAGScheduler: Submitting Stage 23 (FilteredRDD[46] at filter at SimpleApp.scala:24), which has no missing parents
  195. 14/02/10 23:29:16 INFO DAGScheduler: Submitting 2 missing tasks from Stage 23 (FilteredRDD[46] at filter at SimpleApp.scala:24)
  196. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Adding task set 23.0 with 2 tasks
  197. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 23.0:0 as TID 44 on executor 1: slaver02 (NODE_LOCAL)
  198. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 23.0:0 as 2164 bytes in 0 ms
  199. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 23.0:1 as TID 45 on executor 0: slaver01 (NODE_LOCAL)
  200. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 23.0:1 as 2164 bytes in 0 ms
  201. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 45 in 54 ms on slaver01 (progress: 0/2)
  202. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(23, 1)
  203. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 44 in 73 ms on slaver02 (progress: 1/2)
  204. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Remove TaskSet 23.0 from pool
  205. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(23, 0)
  206. 14/02/10 23:29:16 INFO DAGScheduler: Stage 23 (count at SimpleApp.scala:24) finished in 0.069 s
  207. 14/02/10 23:29:16 INFO SparkContext: Job finished: count at SimpleApp.scala:24, took 0.089120924 s
  208. 14/02/10 23:29:16 INFO SparkContext: Starting job: count at SimpleApp.scala:24
  209. 14/02/10 23:29:16 INFO DAGScheduler: Got job 24 (count at SimpleApp.scala:24) with 2 output partitions (allowLocal=false)
  210. 14/02/10 23:29:16 INFO DAGScheduler: Final stage: Stage 24 (count at SimpleApp.scala:24)
  211. 14/02/10 23:29:16 INFO DAGScheduler: Parents of final stage: List()
  212. 14/02/10 23:29:16 INFO DAGScheduler: Missing parents: List()
  213. 14/02/10 23:29:16 INFO DAGScheduler: Submitting Stage 24 (MappedRDD[2] at map at SimpleApp.scala:10), which has no missing parents
  214. 14/02/10 23:29:16 INFO DAGScheduler: Submitting 2 missing tasks from Stage 24 (MappedRDD[2] at map at SimpleApp.scala:10)
  215. 14/02/10 23:29:16 INFO TaskSchedulerImpl: Adding task set 24.0 with 2 tasks
  216. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 24.0:0 as TID 46 on executor 1: slaver02 (NODE_LOCAL)
  217. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 24.0:0 as 1726 bytes in 0 ms
  218. 14/02/10 23:29:16 INFO TaskSetManager: Starting task 24.0:1 as TID 47 on executor 0: slaver01 (NODE_LOCAL)
  219. 14/02/10 23:29:16 INFO TaskSetManager: Serialized task 24.0:1 as 1726 bytes in 0 ms
  220. 14/02/10 23:29:16 INFO TaskSetManager: Finished TID 47 in 35 ms on slaver01 (progress: 0/2)
  221. 14/02/10 23:29:16 INFO DAGScheduler: Completed ResultTask(24, 1)
  222. 14/02/10 23:29:17 INFO TaskSetManager: Finished TID 46 in 141 ms on slaver02 (progress: 1/2)
  223. 14/02/10 23:29:17 INFO TaskSchedulerImpl: Remove TaskSet 24.0 from pool
  224. 14/02/10 23:29:17 INFO DAGScheduler: Completed ResultTask(24, 0)
  225. 14/02/10 23:29:17 INFO DAGScheduler: Stage 24 (count at SimpleApp.scala:24) finished in 0.144 s
  226. 14/02/10 23:29:17 INFO SparkContext: Job finished: count at SimpleApp.scala:24, took 0.165520408 s
  227. Training Error = 0.4968944099378882
  228. 14/02/10 23:29:17 INFO ConnectionManager: Selector thread was interrupted!
  229. [success] Total time: 85 s, completed Feb 10, 2014 11:29:17 PM
  230. [root@master scala]#
复制代码



默认SVMWithSGD.train()方法执行L2正规化算法,正则化参数设置为1.0。如果我们想要配置这个算法,我们可以进一步设定SVMWithSGD的属性,可以直接通过创建一个新的SVMWithSGD对象和调用setter方法。所有其他MLlib算法都可以用这种方式来定制。例如,下面的代码通过正则化参数设置为0.1来产生的L1正规化随机变量向量机,迭代200次来运行这个训练算法。
  1. import org.apache.spark.mllib.optimization.L1Updater
  2. val svmAlg = new SVMWithSGD()
  3. svmAlg.optimizer.setNumIterations(200)
  4.   .setRegParam(0.1)
  5.   .setUpdater(new L1Updater)
  6. val modelL1 = svmAlg.run(parsedData)
复制代码




已有(1)人评论

跳转到指定楼层
pengjianf_ah 发表于 2014-12-28 10:27:29
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条