问题导读:
1、Hadoop 1.x和Hadoop 2.x 有哪些不同 ?
2、我们在使用Hadoop 2.x的新特性中,该注意哪些 ?
Hadoop 2.x和1.x已经大不相同了,应该说对于存储计算都更加通用了。Hadoop 2.x实现了用来管理集群资源的YARN框架,可以面向任何需要使用基于HDFS存储来计算的需要,当然MapReduce现在已经作为外围的插件式的计算框架,你可以根据需要开发或者选择合适的计算框架。目前,貌似对MapReduce支持还是比较好的,毕竟MapReduce框架已经还算成熟。其他一些基于YARN框架的标准也在开发中。
YARN框架的核心是资源的管理和分配调度,它比Hadoop 1.x中的资源分配的粒度更细了,也更加灵活了,它的前景应该不错。由于极大地灵活性,所以在使用过程中由于这些配置的灵活性,可能使用的难度也加大了一些。另外,我个人觉得,YARN毕竟还在发展之中,也有很多不成熟的地方,各种问题频频出现,资料也相对较少,官方文档有时更新也不是很及时,如果我选择做海量数据处理,可能YARN还不能满足生产环境的需要。如果完全使用MapReduce来做计算,还是选择相对更加成熟的Hadoop 1.x版本用于生产环境。
下面使用4台机器,操作系统为CentOS 6.4 64位,一台做主节点,另外三台做从节点,实践集群的安装配置。
主机配置规划
修改/etc/hosts文件,增加如下地址映射:
- 10.95.3.48 m1
- 10.95.3.54 s1
- 10.95.3.59 s2
- 10.95.3.66 s3
复制代码
每台机器配置对应的hostname,修改/etc/sysconfig/network文件,例如s1节点内容配置为:
- NETWORKING=yes
- HOSTNAME=s1
复制代码
m1为集群主节点,s1、s2、s3为集群从节点。 关于主机资源的配置,我们这里面使用VMWare工具,创建了4个虚拟机,具体置情况如下所示: 一个主节点有1个核(core) 一个主节点内存1G 每个从节点有1个核(core) 每个从节点内存2G
目录规划
Hadoop程序存放目录为/home/shirdrn/cloud/programs/hadoop-2.2.0,相关的数据目录,包括日志、存储等指定为/home/shirdrn/cloud/storage/hadoop-2.2.0。将程序和数据目录分开,可以更加方便的进行配置的同步。
具体目录的准备与配置如下所示:
在每个节点上创建程序存储目录/home/shirdrn/cloud/programs/hadoop-2.2.0,用来存放Hadoop程序文件 在每个节点上创建数据存储目录/home/shirdrn/cloud/storage/hadoop-2.2.0/hdfs,用来存放集群数据 在主节点m1上创建目录/home/shirdrn/cloud/storage/hadoop-2.2.0/hdfs/name,用来存放文件系统元数据 在每个从节点上创建目录/home/shirdrn/cloud/storage/hadoop-2.2.0/hdfs/data,用来存放真正的数据 所有节点上的日志目录为/home/shirdrn/cloud/storage/hadoop-2.2.0/logs 所有节点上的临时目录为/home/shirdrn/cloud/storage/hadoop-2.2.0/tmp 下面配置涉及到的目录,都参照这里的目录规划。
环境变量配置 首先,使用Sun的JDK,修改~/.bashrc文件,配置如下: - export JAVA_HOME=/usr/java/jdk1.6.0_45/
- export PATH=$PATH:$JAVA_HOME/bin
- export CLASSPATH=$JAVA_HOME/lib/*.jar:$JAVA_HOME/jre/lib/*.jar
复制代码
然后配置Hadoop安装目录,相关环境变量:
- export HADOOP_HOME=/home/shirdrn/cloud/programs/hadoop-2.2.0
- export PATH=$PATH:$HADOOP_HOME/bin
- export PATH=$PATH:$HADOOP_HOME/sbin
- export HADOOP_LOG_DIR=/home/shirdrn/cloud/storage/hadoop-2.2.0/logs
- export YARN_LOG_DIR=$HADOOP_LOG_DIR
复制代码
免密码登录配置
在每各节点上,执行如下命令:
复制代码
然后点击回车一直下去即可。
在主节点m1上,执行命令:
复制代码
保证不需要密码即可登录本机m1节点。
将m1的公钥,添加到s1、s2、s3的~/.ssh/authorized_keys文件中,并且需要查看~/.ssh/authorized_keys的权限,不能对同组用户具有写权限,如果有,则执行下面命令:
- chmod g-w ~/.ssh/authorized_keys
复制代码
这时,在m1节点上,应该保证执行如下命令不需要输入密码:
复制代码
Hadoop配置文件 配置文件所在目录为/home/shirdrn/programs/hadoop-2.2.0/etc/hadoop,可以修改对应的配置文件。 配置文件core-site.xml内容 - <?xml version="1.0" encoding="UTF-8"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
- <configuration>
- <property>
- <name>fs.defaultFS</name>
- <value>hdfs://m1:9000/</value>
- <description>The name of the default file system. A URI whose scheme
- and authority determine the FileSystem implementation. The uri's
- scheme determines the config property (fs.SCHEME.impl) naming the
- FileSystem implementation class. The uri's authority is used to
- determine the host, port, etc. for a filesystem.</description>
- </property>
- <property>
- <name>dfs.replication</name>
- <value>3</value>
- </property>
- <property>
- <name>hadoop.tmp.dir</name>
- <value>/home/shirdrn/cloud/storage/hadoop-2.2.0/tmp/hadoop-${user.name}</value>
- <description>A base for other temporary directories.</description>
- </property>
- </configuration>
复制代码
配置文件hdfs-site.xml内容 - <font color="#404040" face="Ubuntu"><?xml version="1.0" encoding="UTF-8"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
- <configuration>
- <property><name>dfs.namenode.name.dir</name>
- <value>/home/shirdrn/cloud/storage/hadoop-2.2.0/hdfs/name</value>
- <description>Path on the local filesystem where the NameNode stores
- the namespace and transactions logs persistently.</description>
- </property>
- <property>
- <name>dfs.datanode.data.dir</name>
- <value>/home/shirdrn/cloud/storage/hadoop-2.2.0/hdfs/data</value>
- <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description>
- </property>
- <property>
- <name>dfs.permissions</name>
- <value>false</value>
- </property>
- </configuration></font>
复制代码
配置文件yarn-site.xml内容
- <?xml version="1.0"?>
- <configuration>
- <property>
- <name>yarn.resourcemanager.resource-tracker.address</name>
- <value>m1:8031</value>
- <description>host is the hostname of the resource manager and
- port is the port on which the NodeManagers contact the Resource Manager.
- </description>
- </property>
- <property>
- <name>yarn.resourcemanager.scheduler.address</name>
- <value>m1:8030</value>
- <description>host is the hostname of the resourcemanager and port is
- the port
- on which the Applications in the cluster talk to the Resource Manager.
- </description>
- </property>
- <property>
- <name>yarn.resourcemanager.scheduler.class</name>
- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
- <description>In case you do not want to use the default scheduler</description>
- </property>
- <property>
- <name>yarn.resourcemanager.address</name>
- <value>m1:8032</value>
- <description>the host is the hostname of the ResourceManager and the
- port is the port on
- which the clients can talk to the Resource Manager.
- </description>
- </property>
- <property>
- <name>yarn.nodemanager.local-dirs</name>
- <value>${hadoop.tmp.dir}/nodemanager/local</value>
- <description>the local directories used by the nodemanager</description>
- </property>
- <property>
- <name>yarn.nodemanager.address</name>
- <value>0.0.0.0:8034</value>
- <description>the nodemanagers bind to this port</description>
- </property>
- <property>
- <name>yarn.nodemanager.resource.cpu-vcores</name>
- <value>1</value>
- <description></description>
- </property>
- <property>
- <name>yarn.nodemanager.resource.memory-mb</name>
- <value>2048</value>
- <description>Defines total available resources on the NodeManager to be made available to running containers</description>
- </property>
- <property>
- <name>yarn.nodemanager.remote-app-log-dir</name>
- <value>${hadoop.tmp.dir}/nodemanager/remote</value>
- <description>directory on hdfs where the application logs are moved to </description>
- </property>
- <property>
- <name>yarn.nodemanager.log-dirs</name>
- <value>${hadoop.tmp.dir}/nodemanager/logs</value>
- <description>the directories used by Nodemanagers as log directories</description>
- </property>
- <property>
- <name>yarn.application.classpath</name>
- <value>$HADOOP_HOME,$HADOOP_HOME/share/hadoop/common/*,
- $HADOOP_HOME/share/hadoop/common/lib/*,
- $HADOOP_HOME/share/hadoop/hdfs/*,$HADOOP_HOME/share/hadoop/hdfs/lib/*,
- $HADOOP_HOME/share/hadoop/yarn/*,$HADOOP_HOME/share/hadoop/yarn/lib/*,
- $HADOOP_HOME/share/hadoop/mapreduce/*,$HADOOP_HOME/share/hadoop/mapreduce/lib/*</value>
- <description>Classpath for typical applications.</description>
- </property>
- <!-- Use mapreduce_shuffle instead of mapreduce.suffle (YARN-1229)-->
- <property>
- <name>yarn.nodemanager.aux-services</name>
- <value>mapreduce_shuffle</value>
- <description>shuffle service that needs to be set for Map Reduce to run </description>
- </property>
- <property>
- <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
- <value>org.apache.hadoop.mapred.ShuffleHandler</value>
- </property>
- <property>
- <name>yarn.scheduler.minimum-allocation-mb</name>
- <value>256</value>
- </property>
- <property>
- <name>yarn.scheduler.maximum-allocation-mb</name>
- <value>6144</value>
- </property>
- <property>
- <name>yarn.scheduler.minimum-allocation-vcores</name>
- <value>1</value>
- </property>
- <property>
- <name>yarn.scheduler.maximum-allocation-vcores</name>
- <value>3</value>
- </property>
- </configuration>
复制代码
配置mapred-site.xml文件
- <font color="#404040" face="Ubuntu" size="2"><?xml version="1.0"?>
- <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
-
- <configuration>
- <property>
- <name>mapreduce.framework.name</name>
- <value>yarn</value>
- <description>Execution framework set to Hadoop YARN.</description>
- </property>
- <property>
- <name>mapreduce.map.memory.mb</name>
- <value>512</value>
- <description>Larger resource limit for maps. default 1024M</description>
- </property>
- <property>
- <name>mapreduce.map.cpu.vcores</name>
- <value>1</value>
- <description></description>
- </property>
- <property>
- <name>mapreduce.reduce.memory.mb</name>
- <value>512</value>
- <description>Larger resource limit for reduces.</description>
- </property>
- <property>
- <name>mapreduce.reduce.shuffle.parallelcopies</name>
- <value>5</value>
- <description>Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.</description>
- </property>
- <property>
- <name>mapreduce.jobhistory.address</name>
- <value>m1:10020</value>
- <description>MapReduce JobHistory Server host:port, default port is 10020.</description>
- </property>
- <property>
- <name>mapreduce.jobhistory.webapp.address</name>
- <value>m1:19888</value>
- <description>MapReduce JobHistory Server Web UI host:port, default port is 19888.</description>
- </property>
- </configuration>
- </font>
复制代码
配置hadoop-env.sh、yarn-env.sh、mapred-env.sh脚本文件 修改每个脚本文件的JAVA_HOME变量即可,如下所示: - export JAVA_HOME=/usr/java/jdk1.6.0_45/
复制代码
配置slaves文件 复制代码
同步分发程序文件
在主节点m1上将上面配置好的程序文件,复制分发到各个从节点上:
- scp -r /home/shirdrn/cloud/programs/hadoop-2.2.0 shirdrn@s1:/home/shirdrn/cloud/programs/
- scp -r /home/shirdrn/cloud/programs/hadoop-2.2.0 shirdrn@s2:/home/shirdrn/cloud/programs/
- scp -r /home/shirdrn/cloud/programs/hadoop-2.2.0 shirdrn@s3:/home/shirdrn/cloud/programs/
-
复制代码
启动HDFS集群
经过上面配置以后,可以启动HDFS集群。
为了保证集群启动过程中不会出现问题,需要手动关闭每个节点上的防火墙,执行如下命令:
- sudo service iptables stop
复制代码
或者永久关闭防火墙:
- sudo chkconfig iptables off
- sudo chkconfig ip6tables off
复制代码
在主节点m1上,首先进行文件系统格式化操作,执行如下命令: - <font color="#404040" face="Ubuntu">hadoop namenode -format</font>
复制代码
然后,可以启动HDFS集群,执行如下命令: 复制代码
可以查看启动日志,确认HDFS集群启动是否成功:
- tail -100f /home/shirdrn/cloud/storage/hadoop-2.2.0/logs/hadoop-shirdrn-namenode-m1.log
- tail -100f /home/shirdrn/cloud/storage/hadoop-2.2.0/logs/hadoop-shirdrn-secondarynamenode-m1.log
- tail -100f /home/shirdrn/cloud/storage/hadoop-2.2.0/logs/hadoop-shirdrn-datanode-s1.log
- tail -100f /home/shirdrn/cloud/storage/hadoop-2.2.0/logs/hadoop-shirdrn-datanode-s2.log
- tail -100f /home/shirdrn/cloud/storage/hadoop-2.2.0/logs/hadoop-shirdrn-datanode-s3.log
复制代码
或者,查看对应的进程情况:
复制代码
可以通过登录Web控制台,查看HDFS集群状态,访问如下地址:
复制代码
未完待续,请看后贴
|