问题导读
1、你如何解决Hadoop问题的?
2、Hadoop集群搭建的重点问题有哪些?
3、Hadoop集群搭建完毕后,如何测试是否正常工作?
最近,要在沙箱的环境装一个hadoop的集群,用来建索引所需,装hadoop已经没啥难的了,后面,散仙会把重要的配置信息,贴出来,本次装的hadoop版本是hadoop1.2的版本,如果不知道怎么装的,可以参考 这篇文章,安装的具体步骤,散仙在这里不在重述,重点在于hadoop-nd,hadoop-dd,tmp目录的配置,下面是配置文件的示例:
core-site.xml的配置:
- <configuration>
- <property>
- <name>fs.default.name</name>
- <value>hdfs://h1:8020</value>
- </property>
- <property>
- <name>io.compression.codecs</name>
- <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress
- .SnappyCodec</value>
- <final>true</final>
- </property>
- </configuration>
-
复制代码
hdfs-site.xml的配置:
- <configuration>
- <property>
- <name>fs.default.name</name>
- <value>hdfs://h1:8020</value>
- </property>
- <property>
- <name>dfs.block.size</name>
- <value>134217728</value>
- </property>
- <property>
- <name>dfs.namenode.handler.count</name>
- <value>10</value>
- </property>
- <property>
- <name>dfs.replication</name>
- <value>1</value>
- </property>
- <property>
- <name>dfs.name.dir</name>
- <value>/home/search/hadoop-nd</value>
- </property>
- <property>
- <name>dfs.data.dir</name>
- <value>/home/search/hadoop-dd</value>
- </property>
- <property>
- <name>dfs.tmp.dir</name>
- <value>/home/search/tmp</value>
- </property>
- <property>
- <name>dfs.web.ugi</name>
- <value>search,search</value>
- </property>
- <property>
- <name>dfs.balance.bandwidthPerSec</name>
- <value>10485760</value>
- </property>
- <property>
- <name>dfs.support.append</name>
- <value>true</value>
- </property>
- <property>
- <name>dfs.permissions</name>
- <value>false</value>
- </property>
- </configuration>
复制代码
mapred-site.xml的配置:
- <configuration>
- <property>
- <name>mapred.job.tracker</name>
- <value>h1:8021</value>
- </property>
- <property>
- <name>mapred.tasktracker.map.tasks.maximum</name>
- <value>2</value>
- </property>
- <property>
- <name>mapred.tasktracker.reduce.tasks.maximum</name>
- <value>2</value>
- </property>
- <property>
- <name>mapred.map.child.java.opts</name>
- <value>-Xmx512M</value>
- </property>
- <property>
- <name>mapred.reduce.child.java.opts</name>
- <value>-Xmx512M</value>
- </property>
- </configuration>
复制代码
hadoop-env.sh,看情况配置,第一次安装需要配置JDK的路径
下面说重点问题:
集群,安装完毕后,
(1)先使用jps命令,查看所有的hadoop进程是否,启动正常,如果没有全部启动,需要查看,对应的log信息。
(2)如果进程都正常,可以访问对应的端口信息,在Web上查看集群页面信息
(3)如果页面上也正常,这时候,我们需要跑一个基准测试来真正的校验下,集群的计算情况,基准测试主要测试两个方面,一个是生成文件,测的是Map的运行情况,一个是排序输出,测的是Reduce的运行情况,针对hadoop1.2.x的版本我们可以使用如下的命令进行基准测试,注意需要进入到hadoop的根目录:
生成数据文件
1,hadoop jar hadoop-examples-1.2.1.jar teragen 10000000 input
排序输出
2, hadoop jar hadoop-examples-1.2.1.jar terasort input output
如果是hadoop2.x,需要使用如下方式跑基准:
- (1)./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar randomwriter rand
- (2)./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar sort rand sort-rand
复制代码
第一个命令会在rand 目录的生成没有排序的数据。第二个命令会读数据,排序,然后写入rand-sort 目录
基准测试,正是验证hadoop集群是否工作正常的一个非常重要的手段,散仙,运行之后,发现生成文件时,没有问题,而使用排序的基准时,发现reduce卡死现象,map100%之后,reduce一直不动,内存,Cpu等资源是充足的,然后看查看log,发现读取的映射地址有问题,在web页面上查看reduce的执行情况,发现解析地址错误:
注意上图做下面的地址,正常的情况,这个链接应该是本机IP的某个地址下的,但现在解析成这样,肯定获取不到数据,在reduce阶段,要拉取所有节点上的数据,进行排序,如果拉取中,出现网络异常,那么程序一直阻塞,重试,导致reduce阶段,失败,或出现运行缓慢的情况下,找到大致原因后,回到linux上,查看主机名,/etc/hosts的配置
,并使用ping命令,ping自己的主机名,或者在hosts文件里,相对应的主机名,并查看DNS的解析名,是否正常,确定无误后,把hosts文件,同步到集群上的其他机器上,确保一致,然后关掉集群,重启格式化,重启,再跑次,基准测试,运行正常:
- [search@apsaras-server5 ~/hadoop]$ hadoop jar hadoop-examples-1.2.1.jar terasort input output
- 14/10/28 15:23:29 INFO terasort.TeraSort: starting
- 14/10/28 15:23:29 INFO mapred.FileInputFormat: Total input paths to process : 2
- 14/10/28 15:23:29 WARN snappy.LoadSnappy: Snappy native library is available
- 14/10/28 15:23:29 INFO util.NativeCodeLoader: Loaded the native-hadoop library
- 14/10/28 15:23:29 INFO snappy.LoadSnappy: Snappy native library loaded
- 14/10/28 15:23:29 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
- 14/10/28 15:23:29 INFO compress.CodecPool: Got brand-new compressor
- Making 1 from 100000 records
- Step size is 100000.0
- 14/10/28 15:23:30 INFO mapred.FileInputFormat: Total input paths to process : 2
- 14/10/28 15:23:30 INFO mapred.JobClient: Running job: job_201410281520_0002
- 14/10/28 15:23:31 INFO mapred.JobClient: map 0% reduce 0%
- 14/10/28 15:23:41 INFO mapred.JobClient: map 25% reduce 0%
- 14/10/28 15:23:42 INFO mapred.JobClient: map 75% reduce 0%
- 14/10/28 15:23:51 INFO mapred.JobClient: map 100% reduce 0%
- 14/10/28 15:23:55 INFO mapred.JobClient: map 100% reduce 16%
- 14/10/28 15:23:58 INFO mapred.JobClient: map 100% reduce 66%
- 14/10/28 15:24:01 INFO mapred.JobClient: map 100% reduce 72%
- 14/10/28 15:24:04 INFO mapred.JobClient: map 100% reduce 75%
- 14/10/28 15:24:07 INFO mapred.JobClient: map 100% reduce 79%
- 14/10/28 15:24:11 INFO mapred.JobClient: map 100% reduce 82%
- 14/10/28 15:24:14 INFO mapred.JobClient: map 100% reduce 86%
- 14/10/28 15:24:17 INFO mapred.JobClient: map 100% reduce 89%
- 14/10/28 15:24:20 INFO mapred.JobClient: map 100% reduce 92%
- 14/10/28 15:24:23 INFO mapred.JobClient: map 100% reduce 96%
- 14/10/28 15:24:26 INFO mapred.JobClient: map 100% reduce 99%
- 14/10/28 15:24:27 INFO mapred.JobClient: map 100% reduce 100%
- 14/10/28 15:24:29 INFO mapred.JobClient: Job complete: job_201410281520_0002
- 14/10/28 15:24:29 INFO mapred.JobClient: Counters: 31
- 14/10/28 15:24:29 INFO mapred.JobClient: Job Counters
- 14/10/28 15:24:29 INFO mapred.JobClient: Launched reduce tasks=1
- 14/10/28 15:24:29 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=74679
- 14/10/28 15:24:29 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
- 14/10/28 15:24:29 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
- 14/10/28 15:24:29 INFO mapred.JobClient: Rack-local map tasks=3
- 14/10/28 15:24:29 INFO mapred.JobClient: Launched map tasks=8
- 14/10/28 15:24:29 INFO mapred.JobClient: Data-local map tasks=5
- 14/10/28 15:24:29 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=45667
- 14/10/28 15:24:29 INFO mapred.JobClient: File Input Format Counters
- 14/10/28 15:24:29 INFO mapred.JobClient: Bytes Read=1000024576
- 14/10/28 15:24:29 INFO mapred.JobClient: File Output Format Counters
- 14/10/28 15:24:29 INFO mapred.JobClient: Bytes Written=1000000000
- 14/10/28 15:24:29 INFO mapred.JobClient: FileSystemCounters
- 14/10/28 15:24:29 INFO mapred.JobClient: FILE_BYTES_READ=2040001344
- 14/10/28 15:24:29 INFO mapred.JobClient: HDFS_BYTES_READ=1000025344
- 14/10/28 15:24:29 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3060519016
- 14/10/28 15:24:29 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1000000000
- 14/10/28 15:24:29 INFO mapred.JobClient: Map-Reduce Framework
- 14/10/28 15:24:29 INFO mapred.JobClient: Map output materialized bytes=1020000048
- 14/10/28 15:24:29 INFO mapred.JobClient: Map input records=10000000
- 14/10/28 15:24:29 INFO mapred.JobClient: Reduce shuffle bytes=1020000048
- 14/10/28 15:24:29 INFO mapred.JobClient: Spilled Records=30000000
- 14/10/28 15:24:29 INFO mapred.JobClient: Map output bytes=1000000000
- 14/10/28 15:24:29 INFO mapred.JobClient: Total committed heap usage (bytes)=1232338944
- 14/10/28 15:24:29 INFO mapred.JobClient: CPU time spent (ms)=79710
- 14/10/28 15:24:29 INFO mapred.JobClient: Map input bytes=1000000000
- 14/10/28 15:24:29 INFO mapred.JobClient: SPLIT_RAW_BYTES=768
- 14/10/28 15:24:29 INFO mapred.JobClient: Combine input records=0
- 14/10/28 15:24:29 INFO mapred.JobClient: Reduce input records=10000000
- 14/10/28 15:24:29 INFO mapred.JobClient: Reduce input groups=10000000
- 14/10/28 15:24:29 INFO mapred.JobClient: Combine output records=0
- 14/10/28 15:24:29 INFO mapred.JobClient: Physical memory (bytes) snapshot=1721982976
- 14/10/28 15:24:29 INFO mapred.JobClient: Reduce output records=10000000
- 14/10/28 15:24:29 INFO mapred.JobClient: Virtual memory (bytes) snapshot=10064424960
- 14/10/28 15:24:29 INFO mapred.JobClient: Map output records=10000000
- 14/10/28 15:24:29 INFO terasort.TeraSort: done
复制代码
总结:
关于散仙这个异常的原因,就是因为hosts文件的配置的映射名,太多了,并且本机的host名没有配置,和其他的机器上的hosts文件也不大一致,导致了上述问题的发生,出现问题时,我们就从日志下手,找到相关的蛛丝马迹然后一点点解决,
|