分享

Mahout分步式程序开发 聚类Kmeans

本帖最后由 52Pig 于 2014-10-28 22:18 编辑
阅读导读:
1.什么是聚类分析?
2.Mahout中的kmeans算法,默认的分融符是什么?
3.用kmeans算法得到的结果有什么特点?




1. 聚类算法kmeans
  聚类分析是数据挖掘及机器学习领域内的重点问题之一,在数据挖掘、模式识别、决策支持、机器学习及图像分割等领域有广泛的应用,是最重要的数据分析方法之一。聚类是在给定的数据集合中寻找同类的数据子集合,每一个子集合形成一个类簇,同类簇中的数据具有更大的相似性。聚类算法大体上可分为基于划分的方法、基于层次的方法、基于密度的方法、基于网格的方法以及基于模型的方法。
  k-means algorithm算法是一种得到最广泛使用的基于划分的聚类算法,把n个对象分为k个簇,以使簇内具有较高的相似度。相似度的计算根据一个簇中对象的平均值来进行。它与处理混合正态分布的最大期望算法很相似,因为他们都试图找到数据中自然聚类的中心。
  算法首先随机地选择k个对象,每个对象初始地代表了一个簇的平均值或中心。对剩余的每个对象根据其与各个簇中心的距离,将它赋给最近的簇,然后重新计算每个簇的平均值。这个过程不断重复,直到准则函数收敛。

2. Mahout开发环境介绍

  接上一篇文章:Mahout分步式程序开发 基于物品的协同过滤ItemCF。所有环境变量和系统配置与上文一致!

3. 用Mahout实现聚类算法kmeans
实现步骤:
  • 准备数据文件: randomData.csv
  • Java程序:KmeansHadoop.java
  • 运行程序
  • 聚类结果解读
  • HDFS产生的目录

1). 准备数据文件: randomData.csv
  数据文件randomData.csv,由R语言通过“随机正太分布函数”程序生成,单机内存实验请参考文章:用Maven构建Mahout项目。原始数据文件:这里只截取了一部分数据。
  1. ~ vi datafile/randomData.csv
  2. -0.883033363823402 -3.31967192630249
  3. -2.39312626419456 3.34726861118871
  4. 2.66976353341256 1.85144276077058
  5. -1.09922906899594 -6.06261735207489
  6. -4.36361936997216 1.90509905380532
  7. -0.00351835125495037 -0.610105996559153
  8. -2.9962958796338 -3.60959839525735
  9. -3.27529418132066 0.0230099799641799
  10. 2.17665594420569 6.77290756817957
  11. -2.47862038335637 2.53431833167278
  12. 5.53654901906814 2.65089785582474
  13. 5.66257474538338 6.86783609641077
  14. -0.558946883114376 1.22332819416237
  15. 5.11728525486132 3.74663871584768
  16. 1.91240516693351 2.95874731384062
  17. -2.49747101306535 2.05006504756875
  18. 3.98781883213459 1.00780938946366
  19. 5.47470532716682 5.35084411045171
复制代码
  注:由于Mahout中kmeans算法,默认的分融符是” “(空格),因些我把逗号分隔的数据文件,改成以空格分隔。2). Java程序:KmeansHadoop.javakmeans的算法实现,请查看Mahout in Action。
mahout-kmeans-process.png


  1. package org.conan.mymahout.cluster08;
  2. import org.apache.hadoop.fs.Path;
  3. import org.apache.hadoop.mapred.JobConf;
  4. import org.apache.mahout.clustering.conversion.InputDriver;
  5. import org.apache.mahout.clustering.kmeans.KMeansDriver;
  6. import org.apache.mahout.clustering.kmeans.RandomSeedGenerator;
  7. import org.apache.mahout.common.distance.DistanceMeasure;
  8. import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
  9. import org.apache.mahout.utils.clustering.ClusterDumper;
  10. import org.conan.mymahout.hdfs.HdfsDAO;
  11. import org.conan.mymahout.recommendation.ItemCFHadoop;
  12. public class KmeansHadoop {
  13. private static final String HDFS = "hdfs://192.168.1.210:9000";
  14. public static void main(String[] args) throws Exception {
  15. String localFile = "datafile/randomData.csv";
  16. String inPath = HDFS + "/user/hdfs/mix_data";
  17. String seqFile = inPath + "/seqfile";
  18. String seeds = inPath + "/seeds";
  19. String outPath = inPath + "/result/";
  20. String clusteredPoints = outPath + "/clusteredPoints";
  21. JobConf conf = config();
  22. HdfsDAO hdfs = new HdfsDAO(HDFS, conf);
  23. hdfs.rmr(inPath);
  24. hdfs.mkdirs(inPath);
  25. hdfs.copyFile(localFile, inPath);
  26. hdfs.ls(inPath);
  27. InputDriver.runJob(new Path(inPath), new Path(seqFile), "org.apache.mahout.math.RandomAccessSparseVector");
  28. int k = 3;
  29. Path seqFilePath = new Path(seqFile);
  30. Path clustersSeeds = new Path(seeds);
  31. DistanceMeasure measure = new EuclideanDistanceMeasure();
  32. clustersSeeds = RandomSeedGenerator.buildRandom(conf, seqFilePath, clustersSeeds, k, measure);
  33. KMeansDriver.run(conf, seqFilePath, clustersSeeds, new Path(outPath), measure, 0.01, 10, true, 0.01, false);
  34. Path outGlobPath = new Path(outPath, "clusters-*-final");
  35. Path clusteredPointsPath = new Path(clusteredPoints);
  36. System.out.printf("Dumping out clusters from clusters: %s and clusteredPoints: %s\n", outGlobPath, clusteredPointsPath);
  37. ClusterDumper clusterDumper = new ClusterDumper(outGlobPath, clusteredPointsPath);
  38. clusterDumper.printClusters(null);
  39. }
  40. public static JobConf config() {
  41. JobConf conf = new JobConf(ItemCFHadoop.class);
  42. conf.setJobName("ItemCFHadoop");
  43. conf.addResource("classpath:/hadoop/core-site.xml");
  44. conf.addResource("classpath:/hadoop/hdfs-site.xml");
  45. conf.addResource("classpath:/hadoop/mapred-site.xml");
  46. return conf;
  47. }
  48. }
复制代码
3). 运行程序
控制台输出:
  1. Delete: hdfs://192.168.1.210:9000/user/hdfs/mix_data
  2. Create: hdfs://192.168.1.210:9000/user/hdfs/mix_data
  3. copy from: datafile/randomData.csv to hdfs://192.168.1.210:9000/user/hdfs/mix_data
  4. ls: hdfs://192.168.1.210:9000/user/hdfs/mix_data
  5. ==========================================================
  6. name: hdfs://192.168.1.210:9000/user/hdfs/mix_data/randomData.csv, folder: false, size: 36655
  7. ==========================================================
  8. SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
  9. SLF4J: Defaulting to no-operation (NOP) logger implementation
  10. SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
  11. 2013-10-14 15:39:31 org.apache.hadoop.util.NativeCodeLoader
  12. 警告: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  13. 2013-10-14 15:39:31 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  14. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  15. 2013-10-14 15:39:31 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  16. 信息: Total input paths to process : 1
  17. 2013-10-14 15:39:31 org.apache.hadoop.io.compress.snappy.LoadSnappy
  18. 警告: Snappy native library not loaded
  19. 2013-10-14 15:39:31 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  20. 信息: Running job: job_local_0001
  21. 2013-10-14 15:39:31 org.apache.hadoop.mapred.Task initialize
  22. 信息: Using ResourceCalculatorPlugin : null
  23. 2013-10-14 15:39:31 org.apache.hadoop.mapred.Task done
  24. 信息: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
  25. 2013-10-14 15:39:31 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  26. 信息:
  27. 2013-10-14 15:39:31 org.apache.hadoop.mapred.Task commit
  28. 信息: Task attempt_local_0001_m_000000_0 is allowed to commit now
  29. 2013-10-14 15:39:31 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  30. 信息: Saved output of task 'attempt_local_0001_m_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/seqfile
  31. 2013-10-14 15:39:31 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  32. 信息:
  33. 2013-10-14 15:39:31 org.apache.hadoop.mapred.Task sendDone
  34. 信息: Task 'attempt_local_0001_m_000000_0' done.
  35. 2013-10-14 15:39:32 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  36. 信息: map 100% reduce 0%
  37. 2013-10-14 15:39:32 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  38. 信息: Job complete: job_local_0001
  39. ......
  40. 2013-10-14 15:39:41 org.apache.hadoop.mapred.Merger$MergeQueue merge
  41. 信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
  42. 2013-10-14 15:39:41 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  43. 信息:
  44. 2013-10-14 15:39:41 org.apache.hadoop.mapred.Task done
  45. 信息: Task:attempt_local_0009_r_000000_0 is done. And is in the process of commiting
  46. 2013-10-14 15:39:41 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  47. 信息:
  48. 2013-10-14 15:39:41 org.apache.hadoop.mapred.Task commit
  49. 信息: Task attempt_local_0009_r_000000_0 is allowed to commit now
  50. 2013-10-14 15:39:41 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  51. 信息: Saved output of task 'attempt_local_0009_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-8
  52. 2013-10-14 15:39:41 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  53. 信息: reduce > reduce
  54. 2013-10-14 15:39:41 org.apache.hadoop.mapred.Task sendDone
  55. 信息: Task 'attempt_local_0009_r_000000_0' done.
  56. 2013-10-14 15:39:42 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  57. 信息: map 100% reduce 100%
  58. 2013-10-14 15:39:42 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  59. 信息: Job complete: job_local_0009
  60. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  61. 信息: Counters: 19
  62. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  63. 信息: File Output Format Counters
  64. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  65. 信息: Bytes Written=695
  66. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  67. 信息: FileSystemCounters
  68. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  69. 信息: FILE_BYTES_READ=27256775
  70. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  71. 信息: HDFS_BYTES_READ=673669
  72. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  73. 信息: FILE_BYTES_WRITTEN=28569192
  74. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  75. 信息: HDFS_BYTES_WRITTEN=152767
  76. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  77. 信息: File Input Format Counters
  78. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  79. 信息: Bytes Read=31390
  80. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  81. 信息: Map-Reduce Framework
  82. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  83. 信息: Map output materialized bytes=681
  84. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  85. 信息: Map input records=1000
  86. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  87. 信息: Reduce shuffle bytes=0
  88. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  89. 信息: Spilled Records=6
  90. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  91. 信息: Map output bytes=666
  92. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  93. 信息: Total committed heap usage (bytes)=1772093440
  94. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  95. 信息: SPLIT_RAW_BYTES=130
  96. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  97. 信息: Combine input records=0
  98. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  99. 信息: Reduce input records=3
  100. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  101. 信息: Reduce input groups=3
  102. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  103. 信息: Combine output records=0
  104. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  105. 信息: Reduce output records=3
  106. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Counters log
  107. 信息: Map output records=3
  108. 2013-10-14 15:39:42 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  109. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  110. 2013-10-14 15:39:42 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  111. 信息: Total input paths to process : 1
  112. 2013-10-14 15:39:42 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  113. 信息: Running job: job_local_0010
  114. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task initialize
  115. 信息: Using ResourceCalculatorPlugin : null
  116. 2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  117. 信息: io.sort.mb = 100
  118. 2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  119. 信息: data buffer = 79691776/99614720
  120. 2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  121. 信息: record buffer = 262144/327680
  122. 2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
  123. 信息: Starting flush of map output
  124. 2013-10-14 15:39:42 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
  125. 信息: Finished spill 0
  126. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task done
  127. 信息: Task:attempt_local_0010_m_000000_0 is done. And is in the process of commiting
  128. 2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  129. 信息:
  130. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task sendDone
  131. 信息: Task 'attempt_local_0010_m_000000_0' done.
  132. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task initialize
  133. 信息: Using ResourceCalculatorPlugin : null
  134. 2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  135. 信息:
  136. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Merger$MergeQueue merge
  137. 信息: Merging 1 sorted segments
  138. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Merger$MergeQueue merge
  139. 信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
  140. 2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  141. 信息:
  142. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task done
  143. 信息: Task:attempt_local_0010_r_000000_0 is done. And is in the process of commiting
  144. 2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  145. 信息:
  146. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task commit
  147. 信息: Task attempt_local_0010_r_000000_0 is allowed to commit now
  148. 2013-10-14 15:39:42 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  149. 信息: Saved output of task 'attempt_local_0010_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-9
  150. 2013-10-14 15:39:42 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  151. 信息: reduce > reduce
  152. 2013-10-14 15:39:42 org.apache.hadoop.mapred.Task sendDone
  153. 信息: Task 'attempt_local_0010_r_000000_0' done.
  154. 2013-10-14 15:39:43 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  155. 信息: map 100% reduce 100%
  156. 2013-10-14 15:39:43 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  157. 信息: Job complete: job_local_0010
  158. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  159. 信息: Counters: 19
  160. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  161. 信息: File Output Format Counters
  162. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  163. 信息: Bytes Written=695
  164. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  165. 信息: FileSystemCounters
  166. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  167. 信息: FILE_BYTES_READ=30544993
  168. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  169. 信息: HDFS_BYTES_READ=741007
  170. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  171. 信息: FILE_BYTES_WRITTEN=32013760
  172. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  173. 信息: HDFS_BYTES_WRITTEN=154545
  174. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  175. 信息: File Input Format Counters
  176. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  177. 信息: Bytes Read=31390
  178. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  179. 信息: Map-Reduce Framework
  180. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  181. 信息: Map output materialized bytes=681
  182. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  183. 信息: Map input records=1000
  184. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  185. 信息: Reduce shuffle bytes=0
  186. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  187. 信息: Spilled Records=6
  188. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  189. 信息: Map output bytes=666
  190. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  191. 信息: Total committed heap usage (bytes)=1966735360
  192. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  193. 信息: SPLIT_RAW_BYTES=130
  194. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  195. 信息: Combine input records=0
  196. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  197. 信息: Reduce input records=3
  198. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  199. 信息: Reduce input groups=3
  200. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  201. 信息: Combine output records=0
  202. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  203. 信息: Reduce output records=3
  204. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Counters log
  205. 信息: Map output records=3
  206. 2013-10-14 15:39:43 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  207. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  208. 2013-10-14 15:39:43 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  209. 信息: Total input paths to process : 1
  210. 2013-10-14 15:39:43 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  211. 信息: Running job: job_local_0011
  212. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task initialize
  213. 信息: Using ResourceCalculatorPlugin : null
  214. 2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  215. 信息: io.sort.mb = 100
  216. 2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  217. 信息: data buffer = 79691776/99614720
  218. 2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer
  219. 信息: record buffer = 262144/327680
  220. 2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
  221. 信息: Starting flush of map output
  222. 2013-10-14 15:39:43 org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
  223. 信息: Finished spill 0
  224. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task done
  225. 信息: Task:attempt_local_0011_m_000000_0 is done. And is in the process of commiting
  226. 2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  227. 信息:
  228. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task sendDone
  229. 信息: Task 'attempt_local_0011_m_000000_0' done.
  230. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task initialize
  231. 信息: Using ResourceCalculatorPlugin : null
  232. 2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  233. 信息:
  234. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Merger$MergeQueue merge
  235. 信息: Merging 1 sorted segments
  236. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Merger$MergeQueue merge
  237. 信息: Down to the last merge-pass, with 1 segments left of total size: 677 bytes
  238. 2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  239. 信息:
  240. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task done
  241. 信息: Task:attempt_local_0011_r_000000_0 is done. And is in the process of commiting
  242. 2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  243. 信息:
  244. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task commit
  245. 信息: Task attempt_local_0011_r_000000_0 is allowed to commit now
  246. 2013-10-14 15:39:43 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  247. 信息: Saved output of task 'attempt_local_0011_r_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-10
  248. 2013-10-14 15:39:43 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  249. 信息: reduce > reduce
  250. 2013-10-14 15:39:43 org.apache.hadoop.mapred.Task sendDone
  251. 信息: Task 'attempt_local_0011_r_000000_0' done.
  252. 2013-10-14 15:39:44 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  253. 信息: map 100% reduce 100%
  254. 2013-10-14 15:39:44 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  255. 信息: Job complete: job_local_0011
  256. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  257. 信息: Counters: 19
  258. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  259. 信息: File Output Format Counters
  260. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  261. 信息: Bytes Written=695
  262. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  263. 信息: FileSystemCounters
  264. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  265. 信息: FILE_BYTES_READ=33833211
  266. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  267. 信息: HDFS_BYTES_READ=808345
  268. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  269. 信息: FILE_BYTES_WRITTEN=35458320
  270. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  271. 信息: HDFS_BYTES_WRITTEN=156323
  272. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  273. 信息: File Input Format Counters
  274. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  275. 信息: Bytes Read=31390
  276. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  277. 信息: Map-Reduce Framework
  278. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  279. 信息: Map output materialized bytes=681
  280. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  281. 信息: Map input records=1000
  282. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  283. 信息: Reduce shuffle bytes=0
  284. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  285. 信息: Spilled Records=6
  286. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  287. 信息: Map output bytes=666
  288. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  289. 信息: Total committed heap usage (bytes)=2166095872
  290. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  291. 信息: SPLIT_RAW_BYTES=130
  292. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  293. 信息: Combine input records=0
  294. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  295. 信息: Reduce input records=3
  296. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  297. 信息: Reduce input groups=3
  298. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  299. 信息: Combine output records=0
  300. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  301. 信息: Reduce output records=3
  302. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Counters log
  303. 信息: Map output records=3
  304. 2013-10-14 15:39:44 org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
  305. 警告: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
  306. 2013-10-14 15:39:44 org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
  307. 信息: Total input paths to process : 1
  308. 2013-10-14 15:39:44 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  309. 信息: Running job: job_local_0012
  310. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Task initialize
  311. 信息: Using ResourceCalculatorPlugin : null
  312. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Task done
  313. 信息: Task:attempt_local_0012_m_000000_0 is done. And is in the process of commiting
  314. 2013-10-14 15:39:44 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  315. 信息:
  316. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Task commit
  317. 信息: Task attempt_local_0012_m_000000_0 is allowed to commit now
  318. 2013-10-14 15:39:44 org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
  319. 信息: Saved output of task 'attempt_local_0012_m_000000_0' to hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusteredPoints
  320. 2013-10-14 15:39:44 org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
  321. 信息:
  322. 2013-10-14 15:39:44 org.apache.hadoop.mapred.Task sendDone
  323. 信息: Task 'attempt_local_0012_m_000000_0' done.
  324. 2013-10-14 15:39:45 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  325. 信息: map 100% reduce 0%
  326. 2013-10-14 15:39:45 org.apache.hadoop.mapred.JobClient monitorAndPrintJob
  327. 信息: Job complete: job_local_0012
  328. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  329. 信息: Counters: 11
  330. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  331. 信息: File Output Format Counters
  332. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  333. 信息: Bytes Written=41520
  334. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  335. 信息: File Input Format Counters
  336. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  337. 信息: Bytes Read=31390
  338. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  339. 信息: FileSystemCounters
  340. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  341. 信息: FILE_BYTES_READ=18560374
  342. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  343. 信息: HDFS_BYTES_READ=437203
  344. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  345. 信息: FILE_BYTES_WRITTEN=19450325
  346. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  347. 信息: HDFS_BYTES_WRITTEN=120417
  348. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  349. 信息: Map-Reduce Framework
  350. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  351. 信息: Map input records=1000
  352. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  353. 信息: Spilled Records=0
  354. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  355. 信息: Total committed heap usage (bytes)=1083047936
  356. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  357. 信息: SPLIT_RAW_BYTES=130
  358. 2013-10-14 15:39:45 org.apache.hadoop.mapred.Counters log
  359. 信息: Map output records=1000
  360. Dumping out clusters from clusters: hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-*-final and clusteredPoints: hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusteredPoints
  361. CL-552{n=443 c=[1.631, -0.412] r=[1.563, 1.407]}
  362. Weight : [props - optional]: Point:
  363. 1.0: [-2.393, 3.347]
  364. 1.0: [-4.364, 1.905]
  365. 1.0: [-3.275, 0.023]
  366. 1.0: [-2.479, 2.534]
  367. 1.0: [-0.559, 1.223]
  368. ...
  369. CL-847{n=77 c=[-2.953, -0.971] r=[1.767, 2.189]}
  370. Weight : [props - optional]: Point:
  371. 1.0: [-0.883, -3.320]
  372. 1.0: [-1.099, -6.063]
  373. 1.0: [-0.004, -0.610]
  374. 1.0: [-2.996, -3.610]
  375. 1.0: [3.988, 1.008]
  376. ...
  377. CL-823{n=480 c=[0.219, 2.600] r=[1.479, 1.385]}
  378. Weight : [props - optional]: Point:
  379. 1.0: [2.670, 1.851]
  380. 1.0: [2.177, 6.773]
  381. 1.0: [5.537, 2.651]
  382. 1.0: [5.663, 6.868]
  383. 1.0: [5.117, 3.747]
  384. 1.0: [1.912, 2.959]
  385. ...
复制代码
4). 聚类结果解读
我们可以把上面的日志分解析成3个部分解读
  • 初始化环境
  • 算法执行
  • 打印聚类结果
a. 初始化环境
HDFS的数据目录和工作目录,并上传数据文件。
  1. Delete: hdfs://192.168.1.210:9000/user/hdfs/mix_data
  2. Create: hdfs://192.168.1.210:9000/user/hdfs/mix_data
  3. copy from: datafile/randomData.csv to hdfs://192.168.1.210:9000/user/hdfs/mix_data
  4. ls: hdfs://192.168.1.210:9000/user/hdfs/mix_data
  5. ==========================================================
  6. name: hdfs://192.168.1.210:9000/user/hdfs/mix_data/randomData.csv, folder: false, size: 36655
复制代码
b. 算法执行
算法执行,有3个步骤。
  • 把原始数据randomData.csv,转成Mahout sequence files of VectorWritable。
  • 通过随机的方法,选中kmeans的3个中心,做为初始集群 。
  • 根据迭代次数的设置,执行MapReduce,进行计算。
1):把原始数据randomData.csv,转成Mahout sequence files of VectorWritable。  程序源代码:
  1. InputDriver.runJob(new Path(inPath), new Path(seqFile), "org.apache.mahout.math.RandomAccessSparseVector");
复制代码

日志输出:
Job complete: job_local_0001
2):通过随机的方法,选中kmeans的3个中心,做为初始集群 程序源代码:
  1. int k = 3;
  2. Path seqFilePath = new Path(seqFile);
  3. Path clustersSeeds = new Path(seeds);
  4. DistanceMeasure measure = new EuclideanDistanceMeasure();
  5. clustersSeeds = RandomSeedGenerator.buildRandom(conf, seqFilePath, clustersSeeds, k, measure);
复制代码
日志输出:
  1. Job complete: job_local_0002
复制代码
3):根据迭代次数的设置,执行MapReduce,进行计算
程序源代码:
  1. KMeansDriver.run(conf, seqFilePath, clustersSeeds, new Path(outPath), measure, 0.01, 10, true, 0.01, false);
复制代码
日志输出:
  1. Job complete: job_local_0003
  2. Job complete: job_local_0004
  3. Job complete: job_local_0005
  4. Job complete: job_local_0006
  5. Job complete: job_local_0007
  6. Job complete: job_local_0008
  7. Job complete: job_local_0009
  8. Job complete: job_local_0010
  9. Job complete: job_local_0011
  10. Job complete: job_local_0012
复制代码
c.打印聚类结果
  1. Dumping out clusters from clusters: hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusters-*-final and clusteredPoints: hdfs://192.168.1.210:9000/user/hdfs/mix_data/result/clusteredPoints
  2. CL-552{n=443 c=[1.631, -0.412] r=[1.563, 1.407]}
  3. CL-847{n=77 c=[-2.953, -0.971] r=[1.767, 2.189]}
  4. CL-823{n=480 c=[0.219, 2.600] r=[1.479, 1.385]}
复制代码
运行结果:有3个中心。
  1. Cluster1, 包括443个点,中心坐标[1.631, -0.412]
  2. Cluster2, 包括77个点,中心坐标[-2.953, -0.971]
  3. Cluster3, 包括480 个点,中心坐标[0.219, 2.600]
复制代码
5). HDFS产生的目录
  1. # 根目录
  2. ~ hadoop fs -ls /user/hdfs/mix_data
  3. Found 4 items
  4. -rw-r--r-- 3 Administrator supergroup 36655 2013-10-04 15:31 /user/hdfs/mix_data/randomData.csv
  5. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result
  6. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/seeds
  7. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/seqfile
  8. # 输出目录
  9. ~ hadoop fs -ls /user/hdfs/mix_data/result
  10. Found 13 items
  11. -rw-r--r-- 3 Administrator supergroup 194 2013-10-04 15:31 /user/hdfs/mix_data/result/_policy
  12. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusteredPoints
  13. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-0
  14. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-1
  15. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-10-final
  16. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-2
  17. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-3
  18. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-4
  19. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-5
  20. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-6
  21. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-7
  22. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-8
  23. drwxr-xr-x - Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/result/clusters-9
  24. # 产生的随机中心种子目录
  25. ~ hadoop fs -ls /user/hdfs/mix_data/seeds
  26. Found 1 items
  27. -rw-r--r-- 3 Administrator supergroup 599 2013-10-04 15:31 /user/hdfs/mix_data/seeds/part-randomSeed
  28. # 输入文件换成Mahout格式文件的目录
  29. ~ hadoop fs -ls /user/hdfs/mix_data/seqfile
  30. Found 2 items
  31. -rw-r--r-- 3 Administrator supergroup 0 2013-10-04 15:31 /user/hdfs/mix_data/seqfile/_SUCCESS
  32. -rw-r--r-- 3 Administrator supergroup 31390 2013-10-04 15:31 /user/hdfs/mix_data/seqfile/part-m-00000
复制代码
4. 用R语言可视化结果
  分别把聚类后的点,保存到不同的cluster*.csv文件,然后用R语言画图。
  1. c1<-read.csv(file="cluster1.csv",sep=",",header=FALSE)
  2. c2<-read.csv(file="cluster2.csv",sep=",",header=FALSE)
  3. c3<-read.csv(file="cluster3.csv",sep=",",header=FALSE)
  4. y<-rbind(c1,c2,c3)
  5. cols<-c(rep(1,nrow(c1)),rep(2,nrow(c2)),rep(3,nrow(c3)))
  6. plot(y, col=c("black","blue","green")[cols])
  7. center<-matrix(c(1.631, -0.412,-2.953, -0.971,0.219, 2.600),ncol=2,byrow=TRUE)
  8. points(center, col="violetred", pch = 19)
复制代码

   kmeans.png
从上图中,我们看到有 黑,蓝,绿,三种颜色的空心点,这些点就是原始数据。3个紫色实点,是Mahout的kmeans后生成的3个中心。
  对比文章中用R语言实现的kmeans的分类和中心,都不太一样。用Maven构建Mahout项目  简单总结一下,在使用kmeans时,根据距离算法,阈值,初始中心,迭代次数的不同,kmeans计算的结果是不相同的。
  因此,用kmeans算法,我们一般只能得到一个模糊的分类标准,这个标准对于我们认识未知领域的数据集是很有帮助的。不能做为精确衡量数据的指标。












已有(3)人评论

跳转到指定楼层
hahaxixi 发表于 2014-10-29 09:34:57
学习了,不错,很详细了~~~
回复

使用道具 举报

永无止进 发表于 2014-10-29 11:03:31
学习了,不错
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条