分享

RHadoop安装与使用部分

52Pig 发表于 2014-10-30 23:17:52 [显示全部楼层] 回帖奖励 阅读模式 关闭右栏 2 20586
本帖最后由 52Pig 于 2014-10-30 23:17 编辑
阅读导读:
1.如何搭建RHadoop开发环境?
2.搭建RHadoop和Hadoop环境搭建的区别?
3.如何执行rmr2任务?
4.hadoop命令与RHadoop命令有哪些区别?



环境准备

首先环境准备,这里我选择了Linux Ubuntu操作系统12.04的64位版本,大家可以根据自己的使用习惯选择顺手的Linux。


但JDK一定要用Oracle SUN官方的版本,请从官网下载,操作系统的自带的OpenJDK会有各种不兼容。JDK请选择1.6.x的版本,JDK1.7版本也会有各种的不兼容情况。
http://www.oracle.com/technetwork/java/javase/downloads/index.html
Hadoop的环境安装,相信大家都已经学会了。R语言请安装2.15以后的版本,2.14是不能够支持RHadoop的。
如果你也使用Linux Ubuntu操作系统12.04,请先更新软件包源,否则只能下载到2.14版本的R。

1. 操作系统Ubuntu 12.04 x64
  1. ~ uname -a
  2. Linux domU-00-16-3e-00-00-85 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
复制代码

2 JAVA环境
  1. ~ java -version
  2. java version "1.6.0_29"
  3. Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
  4. Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)
复制代码

3 HADOOP环境(这里只需要hadoop)
  1. hadoop-1.0.3  hbase-0.94.2  hive-0.9.0  pig-0.10.0  sqoop-1.4.2  thrift-0.8.0  zookeeper-3.4.4
复制代码

4 R的环境
  1. R version 2.15.3 (2013-03-01) -- "Security Blanket"
  2. Copyright (C) 2013 The R Foundation for Statistical Computing
  3. ISBN 3-900051-07-0
  4. Platform: x86_64-pc-linux-gnu (64-bit)
复制代码

4.1 如果是Ubuntu 12.04,请更新源再下载R2.15.3版本
  1. sh -c "echo deb http://mirror.bjtu.edu.cn/cran/bin/linux/ubuntu precise/ >>/etc/apt/sources.list"
  2. apt-get update
  3. apt-get install r-base
复制代码

RHadoop安装

RHadoop是RevolutionAnalytics的工程的项目,开源实现代码在GitHub社区可以找到。RHadoop包含三个R包 (rmr,rhdfs,rhbase),分别是对应Hadoop系统架构中的,MapReduce, HDFS, HBase 三个部分。由于这三个库不能在CRAN中找到,所以需要自己下载。
https://github.com/RevolutionAnalytics/RHadoop/wiki


接下我们需要先安装这三个库的依赖库。
首先是rJava,配置好JDK1.6的环境,运行R CMD javareconf命令,R的程序从系统变量中会读取Java配置。然后打开R程序,通过install.packages的方式,安装rJava。


然后,我还要安装其他的几个依赖库,reshape2,Rcpp,iterators,itertools,digest,RJSONIO,functional,通过install.packages都可以直接安装。


接下安装rhdfs库,在环境变量中增加 HADOOP_CMD 和 HADOOP_STREAMING 两个变量,可以用export在当前命令窗口中增加。但为下次方便使用,最好把变量增加到系统环境变更/etc/environment文件中。再用 R CMD INSTALL安装rhdfs包,就可以顺利完成了。


安装rmr库,使用R CMD INSTALL也可以顺利完成了。

最后,我们可以查看一下,RHADOOP都安装了哪些库。
由于我的硬盘是外接的,使用mount和软连接(ln -s)挂载了R类库的目录,所以是R的类库在/disk1/system下面
/disk1/system/usr/local/lib/R/site-library/
一般R的类库目录是/usr/lib/R/site-library或者/usr/local/lib/R/site-library,用户也可以使用whereis R的命令查询,自己电脑上R类库的安装位置


1. 下载RHadoop相关的3个程序包


https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads
  1. rmr-2.1.0
  2. rhdfs-1.0.5
  3. rhbase-1.1
复制代码

2. 复制到/root/R目录
  1. ~/R# pwd
  2. /root/R
  3. ~/R# ls
  4. rhbase_1.1.tar.gz  rhdfs_1.0.5.tar.gz  rmr2_2.1.0.tar.gz
复制代码

3. 安装依赖库


命令行执行
  1. ~ R CMD javareconf
  2. ~ R
复制代码

启动R程序
  1. install.packages("rJava")
  2. install.packages("reshape2")
  3. install.packages("Rcpp")
  4. install.packages("iterators")
  5. install.packages("itertools")
  6. install.packages("digest")
  7. install.packages("RJSONIO")
  8. install.packages("functional")
复制代码

4. 安装rhdfs库
  1. ~ export HADOOP_CMD=/root/hadoop/hadoop-1.0.3/bin/hadoop
  2. ~ export HADOOP_STREAMING=/root/hadoop/hadoop-1.0.3/contrib/streaming/hadoop-streaming-1.0.3.jar (rmr2会用到)
  3. ~ R CMD INSTALL /root/R/rhdfs_1.0.5.tar.gz
复制代码

4.1 最好把HADOOP_CMD设置到环境变量
  1. ~ vi /etc/environment
  2.     HADOOP_CMD=/root/hadoop/hadoop-1.0.3/bin/hadoop
  3.     HADOOP_STREAMING=/root/hadoop/hadoop-1.0.3/contrib/streaming/hadoop-streaming-1.0.3.jar
  4. . /etc/environment
复制代码

5. 安装rmr库
  1. ~  R CMD INSTALL rmr2_2.1.0.tar.gz
复制代码

6. 所有的安装包
  1. ~ ls /disk1/system/usr/local/lib/R/site-library/
  2. digest  functional  iterators  itertools  plyr  Rcpp  reshape2  rhdfs  rJava  RJSONIO  rmr2  stringr
复制代码

RHadoop程序用例


文字说明部分:


安装好rhdfs和rmr两个包后,我们就可以使用R尝试一些hadoop的操作了。


首先,是基本的hdfs的文件操作。


查看hdfs文件目录
hadoop的命令:hadoop fs -ls /user
R语言函数:hdfs.ls(”/user/“)


查看hadoop数据文件
hadoop的命令:hadoop fs -cat /user/hdfs/o_same_school/part-m-00000
R语言函数:hdfs.cat(”/user/hdfs/o_same_school/part-m-00000″)


接下来,我们执行一个rmr算法的任务


普通的R语言程序:
  1. > small.ints = 1:10
  2. > sapply(small.ints, function(x) x^2)
复制代码

MapReduce的R语言程序:
  1. > small.ints = to.dfs(1:10)
  2. > mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2))
  3. > from.dfs("/tmp/RtmpWnzxl4/file5deb791fcbd5")
复制代码

因为MapReduce只能访问HDFS文件系统,先要用to.dfs把数据存储到HDFS文件系统里。MapReduce的运算结果再用from.dfs函数从HDFS文件系统中取出。


第二个,rmr的例子是wordcount,对文件中的单词计数
  1. > input<- '/user/hdfs/o_same_school/part-m-00000'
  2. > wordcount = function(input, output = NULL, pattern = " "){
  3.   wc.map = function(., lines) {
  4.             keyval(unlist( strsplit( x = lines,split = pattern)),1)
  5.     }
  6.     wc.reduce =function(word, counts ) {
  7.             keyval(word, sum(counts))
  8.     }         
  9.     mapreduce(input = input ,output = output, input.format = "text",
  10.         map = wc.map, reduce = wc.reduce,combine = T)
  11. }
  12. > wordcount(input)
  13. > from.dfs("/tmp/RtmpfZUFEa/file6cac626aa4a7")
复制代码

我在HDFS上提前放置了数据文件/user/hdfs/o_same_school/part-m-00000。写wordcount的MapReduce函数,执行wordcount函数,最后用from.dfs从HDFS中取得结果。


代码部分:


1. rhdfs包的使用


启动R程序
  1. > library(rhdfs)
  2. Loading required package: rJava
  3. HADOOP_CMD=/root/hadoop/hadoop-1.0.3/bin/hadoop
  4. Be sure to run hdfs.init()
  5. > hdfs.init()
复制代码

1.1 命令查看hadoop目录
  1. ~ hadoop fs -ls /user
  2. Found 4 items
  3. drwxr-xr-x   - root supergroup          0 2013-02-01 12:15 /user/conan
  4. drwxr-xr-x   - root supergroup          0 2013-03-06 17:24 /user/hdfs
  5. drwxr-xr-x   - root supergroup          0 2013-02-26 16:51 /user/hive
  6. drwxr-xr-x   - root supergroup          0 2013-03-06 17:21 /user/root
复制代码

1.2 rhdfs查看hadoop目录
  1. > hdfs.ls("/user/")
  2.   permission owner      group size          modtime        file
  3. 1 drwxr-xr-x  root supergroup    0 2013-02-01 12:15 /user/conan
  4. 2 drwxr-xr-x  root supergroup    0 2013-03-06 17:24  /user/hdfs
  5. 3 drwxr-xr-x  root supergroup    0 2013-02-26 16:51  /user/hive
  6. 4 drwxr-xr-x  root supergroup    0 2013-03-06 17:21  /user/root
复制代码

1.3 命令查看hadoop数据文件
  1. ~ hadoop fs -cat /user/hdfs/o_same_school/part-m-00000
  2. 10,3,tsinghua university,2004-05-26 15:21:00.0
  3. 23,4007,北京第一七一中学,2004-05-31 06:51:53.0
  4. 51,4016,大连理工大学,2004-05-27 09:38:31.0
  5. 89,4017,Amherst College,2004-06-01 16:18:56.0
  6. 92,4017,斯坦福大学,2012-11-28 10:33:25.0
  7. 99,4017,Stanford University Graduate School of Business,2013-02-19 12:17:15.0
  8. 113,4017,Stanford University,2013-02-19 12:17:15.0
  9. 123,4019,St Paul's Co-educational College - Hong Kong,2004-05-27 18:04:17.0
  10. 138,4019,香港苏浙小学,2004-05-27 18:59:58.0
  11. 172,4020,University,2004-05-27 19:14:34.0
  12. 182,4026,ff,2004-05-28 04:42:37.0
  13. 183,4026,ff,2004-05-28 04:42:37.0
  14. 189,4033,tsinghua,2011-09-14 12:00:38.0
  15. 195,4035,ba,2004-05-31 07:10:24.0
  16. 196,4035,ma,2004-05-31 07:10:24.0
  17. 197,4035,southampton university,2013-01-07 15:35:18.0
  18. 246,4067,美国史丹佛大学,2004-06-12 10:42:10.0
  19. 254,4067,美国史丹佛大学,2004-06-12 10:42:10.0
  20. 255,4067,美国休士顿大学,2004-06-12 10:42:10.0
  21. 257,4068,清华大学,2004-06-12 10:42:10.0
  22. 258,4068,北京八中,2004-06-12 17:34:02.0
  23. 262,4068,香港中文大学,2004-06-12 17:34:02.0
  24. 310,4070,首都师范大学初等教育学院,2004-06-14 15:35:52.0
  25. 312,4070,北京师范大学经济学院,2004-06-14 15:35:52.0
  26. 1.4 rhdfs查看hadoop数据文件
复制代码

  1. >  hdfs.cat("/user/hdfs/o_same_school/part-m-00000")
  2. [1] "10,3,tsinghua university,2004-05-26 15:21:00.0"
  3. [2] "23,4007,北京第一七一中学,2004-05-31 06:51:53.0"
  4. [3] "51,4016,大连理工大学,2004-05-27 09:38:31.0"
  5. [4] "89,4017,Amherst College,2004-06-01 16:18:56.0"
  6. [5] "92,4017,斯坦福大学,2012-11-28 10:33:25.0"
  7. [6] "99,4017,Stanford University Graduate School of Business,2013-02-19 12:17:15.0"
  8. [7] "113,4017,Stanford University,2013-02-19 12:17:15.0"
  9. [8] "123,4019,St Paul's Co-educational College - Hong Kong,2004-05-27 18:04:17.0"
  10. [9] "138,4019,香港苏浙小学,2004-05-27 18:59:58.0"
  11. [10] "172,4020,University,2004-05-27 19:14:34.0"
  12. [11] "182,4026,ff,2004-05-28 04:42:37.0"
  13. [12] "183,4026,ff,2004-05-28 04:42:37.0"
  14. [13] "189,4033,tsinghua,2011-09-14 12:00:38.0"
  15. [14] "195,4035,ba,2004-05-31 07:10:24.0"
  16. [15] "196,4035,ma,2004-05-31 07:10:24.0"
  17. [16] "197,4035,southampton university,2013-01-07 15:35:18.0"
  18. [17] "246,4067,美国史丹佛大学,2004-06-12 10:42:10.0"
  19. [18] "254,4067,美国史丹佛大学,2004-06-12 10:42:10.0"
  20. [19] "255,4067,美国休士顿大学,2004-06-12 10:42:10.0"
  21. [20] "257,4068,清华大学,2004-06-12 10:42:10.0"
  22. [21] "258,4068,北京八中,2004-06-12 17:34:02.0"
  23. [22] "262,4068,香港中文大学,2004-06-12 17:34:02.0"
  24. [23] "310,4070,首都师范大学初等教育学院,2004-06-14 15:35:52.0"
  25. [24] "312,4070,北京师范大学经济学院,2004-06-14 15:35:52.0"
复制代码

2. rmr2包的使用


启动R程序
  1. > library(rmr2)
  2. Loading required package: Rcpp
  3. Loading required package: RJSONIO
  4. Loading required package: digest
  5. Loading required package: functional
  6. Loading required package: stringr
  7. Loading required package: plyr
  8. Loading required package: reshape2
复制代码

2.1 执行r任务
  1. > small.ints = 1:10
  2. > sapply(small.ints, function(x) x^2)
  3. [1]   1   4   9  16  25  36  49  64  81 100
复制代码

2.2 执行rmr2任务
  1. > small.ints = to.dfs(1:10)
  2. 13/03/07 12:12:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library
  3. 13/03/07 12:12:55 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
  4. 13/03/07 12:12:55 INFO compress.CodecPool: Got brand-new compressor
  5. > mapreduce(input = small.ints, map = function(k, v) cbind(v, v^2))
  6. packageJobJar: [/tmp/RtmpWnzxl4/rmr-local-env5deb2b300d03, /tmp/RtmpWnzxl4/rmr-global-env5deb398a522b, /tmp/RtmpWnzxl4/rmr-streaming-map5deb1552172d, /root/hadoop/tmp/hadoop-unjar7838617732558795635/] [] /tmp/streamjob4380275136001813619.jar tmpDir=null
  7. 13/03/07 12:12:59 INFO mapred.FileInputFormat: Total input paths to process : 1
  8. 13/03/07 12:12:59 INFO streaming.StreamJob: getLocalDirs(): [/root/hadoop/tmp/mapred/local]
  9. 13/03/07 12:12:59 INFO streaming.StreamJob: Running job: job_201302261738_0293
  10. 13/03/07 12:12:59 INFO streaming.StreamJob: To kill this job, run:
  11. 13/03/07 12:12:59 INFO streaming.StreamJob: /disk1/hadoop/hadoop-1.0.3/libexec/../bin/hadoop job  -Dmapred.job.tracker=hdfs://r.qa.tianji.com:9001 -kill job_201302261738_0293
  12. 13/03/07 12:12:59 INFO streaming.StreamJob: Tracking URL: http://192.168.1.243:50030/jobdetails.jsp?jobid=job_201302261738_0293
  13. 13/03/07 12:13:00 INFO streaming.StreamJob:  map 0%  reduce 0%
  14. 13/03/07 12:13:15 INFO streaming.StreamJob:  map 100%  reduce 0%
  15. 13/03/07 12:13:21 INFO streaming.StreamJob:  map 100%  reduce 100%
  16. 13/03/07 12:13:21 INFO streaming.StreamJob: Job complete: job_201302261738_0293
  17. 13/03/07 12:13:21 INFO streaming.StreamJob: Output: /tmp/RtmpWnzxl4/file5deb791fcbd5
  18. > from.dfs("/tmp/RtmpWnzxl4/file5deb791fcbd5")
  19. $key
  20. NULL
  21. $val
  22.        v
  23. [1,]  1   1
  24. [2,]  2   4
  25. [3,]  3   9
  26. [4,]  4  16
  27. [5,]  5  25
  28. [6,]  6  36
  29. [7,]  7  49
  30. [8,]  8  64
  31. [9,]  9  81
  32. [10,] 10 100
复制代码

2.3 wordcount执行rmr2任务
  1. > input<- '/user/hdfs/o_same_school/part-m-00000'
  2. > wordcount = function(input, output = NULL, pattern = " "){
  3.     wc.map = function(., lines) {
  4.             keyval(unlist( strsplit( x = lines,split = pattern)),1)
  5.     }
  6.     wc.reduce =function(word, counts ) {
  7.             keyval(word, sum(counts))
  8.     }         
  9.     mapreduce(input = input ,output = output, input.format = "text",
  10.         map = wc.map, reduce = wc.reduce,combine = T)
  11. }
  12. > wordcount(input)
  13. packageJobJar: [/tmp/RtmpfZUFEa/rmr-local-env6cac64020a8f, /tmp/RtmpfZUFEa/rmr-global-env6cac73016df3, /tmp/RtmpfZUFEa/rmr-streaming-map6cac7f145e02, /tmp/RtmpfZUFEa/rmr-streaming-reduce6cac238dbcf, /tmp/RtmpfZUFEa/rmr-streaming-combine6cac2b9098d4, /root/hadoop/tmp/hadoop-unjar6584585621285839347/] [] /tmp/streamjob9195921761644130661.jar tmpDir=null
  14. 13/03/07 12:34:41 INFO util.NativeCodeLoader: Loaded the native-hadoop library
  15. 13/03/07 12:34:41 WARN snappy.LoadSnappy: Snappy native library not loaded
  16. 13/03/07 12:34:41 INFO mapred.FileInputFormat: Total input paths to process : 1
  17. 13/03/07 12:34:41 INFO streaming.StreamJob: getLocalDirs(): [/root/hadoop/tmp/mapred/local]
  18. 13/03/07 12:34:41 INFO streaming.StreamJob: Running job: job_201302261738_0296
  19. 13/03/07 12:34:41 INFO streaming.StreamJob: To kill this job, run:
  20. 13/03/07 12:34:41 INFO streaming.StreamJob: /disk1/hadoop/hadoop-1.0.3/libexec/../bin/hadoop job  -Dmapred.job.tracker=hdfs://r.qa.tianji.com:9001 -kill job_201302261738_0296
  21. 13/03/07 12:34:41 INFO streaming.StreamJob: Tracking URL: http://192.168.1.243:50030/jobdetails.jsp?jobid=job_201302261738_0296
  22. 13/03/07 12:34:42 INFO streaming.StreamJob:  map 0%  reduce 0%
  23. 13/03/07 12:34:59 INFO streaming.StreamJob:  map 100%  reduce 0%
  24. 13/03/07 12:35:08 INFO streaming.StreamJob:  map 100%  reduce 17%
  25. 13/03/07 12:35:14 INFO streaming.StreamJob:  map 100%  reduce 100%
  26. 13/03/07 12:35:20 INFO streaming.StreamJob: Job complete: job_201302261738_0296
  27. 13/03/07 12:35:20 INFO streaming.StreamJob: Output: /tmp/RtmpfZUFEa/file6cac626aa4a7
  28. > from.dfs("/tmp/RtmpfZUFEa/file6cac626aa4a7")
  29. $key
  30. [1] "-"
  31. [2] "04:42:37.0"
  32. [3] "06:51:53.0"
  33. [4] "07:10:24.0"
  34. [5] "09:38:31.0"
  35. [6] "10:33:25.0"
  36. [7] "10,3,tsinghua"
  37. [8] "10:42:10.0"
  38. [9] "113,4017,Stanford"
  39. [10] "12:00:38.0"
  40. [11] "12:17:15.0"
  41. [12] "123,4019,St"
  42. [13] "138,4019,香港苏浙小学,2004-05-27"
  43. [14] "15:21:00.0"
  44. [15] "15:35:18.0"
  45. [16] "15:35:52.0"
  46. [17] "16:18:56.0"
  47. [18] "172,4020,University,2004-05-27"
  48. [19] "17:34:02.0"
  49. [20] "18:04:17.0"
  50. [21] "182,4026,ff,2004-05-28"
  51. [22] "183,4026,ff,2004-05-28"
  52. [23] "18:59:58.0"
  53. [24] "189,4033,tsinghua,2011-09-14"
  54. [25] "19:14:34.0"
  55. [26] "195,4035,ba,2004-05-31"
  56. [27] "196,4035,ma,2004-05-31"
  57. [28] "197,4035,southampton"
  58. [29] "23,4007,北京第一七一中学,2004-05-31"
  59. [30] "246,4067,美国史丹佛大学,2004-06-12"
  60. [31] "254,4067,美国史丹佛大学,2004-06-12"
  61. [32] "255,4067,美国休士顿大学,2004-06-12"
  62. [33] "257,4068,清华大学,2004-06-12"
  63. [34] "258,4068,北京八中,2004-06-12"
  64. [35] "262,4068,香港中文大学,2004-06-12"
  65. [36] "312,4070,北京师范大学经济学院,2004-06-14"
  66. [37] "51,4016,大连理工大学,2004-05-27"
  67. [38] "89,4017,Amherst"
  68. [39] "92,4017,斯坦福大学,2012-11-28"
  69. [40] "99,4017,Stanford"
  70. [41] "Business,2013-02-19"
  71. [42] "Co-educational"
  72. [43] "College"
  73. [44] "College,2004-06-01"
  74. [45] "Graduate"
  75. [46] "Hong"
  76. [47] "Kong,2004-05-27"
  77. [48] "of"
  78. [49] "Paul's"
  79. [50] "School"
  80. [51] "University"
  81. [52] "university,2004-05-26"
  82. [53] "university,2013-01-07"
  83. [54] "University,2013-02-19"
  84. [55] "310,4070,首都师范大学初等教育学院,2004-06-14"
  85. $val
  86. [1] 1 2 1 2 1 1 1 4 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  87. [39] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
复制代码






































































已有(2)人评论

跳转到指定楼层
anyhuayong 发表于 2014-10-31 08:28:42
不错的文章,楼主辛苦
回复

使用道具 举报

llp 发表于 2016-3-11 15:49:40
我们使用的集群不能上网,能提供相关依赖包的下载地址吗??好难找
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条