本帖最后由 hyj 于 2013-11-28 15:14 编辑
下面是都是伪分布的搭建过程,可能各有优点和缺点,供大家在伪分布搭建过程中参考
解说1:
(1)本次主要是对Hadoop Pseudo-Distributed环境搭建做介绍,以下操作都是在root用户下进行。
一、软件环境配置
1、 VM:VMware-workstationl-v7.1.4
2、 OS:ubuntu-11.04
3、 JDK:jdk1.6.0_27
4、 Hadoop:hadoop-0.20.2
5、 ssh
二、安装JDK
1、下载JDK:jdk-6u27-linux-i586.bin,并把它放到安装JDK的目录。
2、解压安装命令如下:
root@ubuntu:/usr/java# ./jdk-6u27-linux-i586.bin
3、配置环境变量
用如下命令打开/etc/profile文件:
root@ubuntu:/# gvim/etc/profile
在文件最后添加内容如下:
export JAVA_HOME=/usr/java/jdk1.6.0_27
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$CLASSPATH
保存并退出文件,然后运行source命令使文件生效:
root@ubuntu:~# source /etc/profile
4、测试JDK
java version "1.6.0_27"
Java(TM) SE Runtime Environment (build 1.6.0_27-b07)
Java HotSpot(TM) Client VM (build 20.2-b06, mixed mode, sharing)
------------------------------------------------------------------------------------------
OK!成功!
三、安装配置ssh
1、安装ssh
root@ubuntu:~# apt-get install ssh
2、免密码配置ssh
root@ubuntu:~# ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
root@ubuntu:~# cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
3、验证没有密码是否能ssh到localhost
root@ubuntu:~# ssh localhost
Welcome to Ubuntu 11.04 (GNU/Linux 2.6.38-8-generic i686)
*Documentation: https://help.ubuntu.com/
225 packages can be updated.
75 updates are security updates.
Last login: Tue Sep 27 03:00:30 2011 from ip6-localhost
------------------------------------------------------------------------------
OK!成功!
可以用who命令查看状态:
root@ubuntu:~# who
4、检查是否安装了ssh
root@ubuntu:~# dpkg --list|grep ssh
5、检查ssh是否启动
root@ubuntu:~# ps -ef|grep ssh
四、Hadoop安装配置
1、下载以前的稳定版:hadoop-0.20.2.tar.gz复制到准备安装的目录。
2、切换到安装目录,并解压。
3、配置
hadoop-env.sh:
取消JAVA_HOME注释并做如下修改:
export JAVA_HOME=/usr/java/jdk1.6.0_27
其它的可以根据需要做修改。
con/core-site.xml:
<configuration>
<property> <name>fs.default.name</name> <value>hdfs://hadoop-test1:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/hadoop/tmp</value> <description>A base for other temporary directories.</description> </property> </configuration> conf/mapred-site.xml: <configuration> <property> <name>mapred.job.tracker</name> <value>hadoop-test1:9001</value> </property> </configuration> conf/hdfs-site.xml: <configuration> <property> <name>dfs.name.dir</name> <value>/home/hadoop/dfs/name</value> <description>Determines where on the local filesystem the DFS name node should store </description> </property> <property> <name>dfs.data.dir</name> <value>/home/hadoop/dfs/data</value> <description>Determin. If this is a comma-delimited </description> </property> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replicied when the file is created. The default </description> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration> 4、运行 格式化HDFS: root@ubuntu:/usr/hadoop/hadoop-0.20.2# bin/hadoop namenode –format 启动hadoop守护进程: root@ubuntu:/usr/hadoop/hadoop-0.20.2# bin/start-all.sh 通过浏览器查看hadoop运行状态: 复制本地文件到HDFS的input目录: root@ubuntu:/usr/hadoop/hadoop-0.20.2# bin/hadoop fs –put conf input 运行hadoop提供的例子: root@ubuntu:/usr/hadoop/hadoop-0.20.2# bin/hadoop jar hadoop-0.20.2-examples.jar grep input output 'dfs[a-z.]+' 查看DFS文件: root@ubuntu:/usr/hadoop/hadoop-0.20.2# bin/hadoop fs -ls output 复制DFS文件到本地,并在本地查看: root@ubuntu:/usr/hadoop/hadoop-0.20.2# bin/hadoop fs -get output output root@ubuntu:/usr/hadoop/hadoop-0.20.2# cat output/* 或者直接查看DFS文件: root@ubuntu:/usr/hadoop/hadoop-0.20.2# bin/hadoop fs -cat output/* 关闭hadoop守护进程: root@ubuntu:/usr/hadoop/hadoop-0.20.2# bin/stop-all.sh 五、其它 Hadoop下载: Hadoop详细开发指南请参考: 解说2 在测试机上按照伪分布式方式安装了hadoop,记录下操作步骤,方便日后查找
1、下载并安装 /usr/java/jdk1.6.0_26/
2、下载并安装 openssh-5.5p1.tar.gz 安装至 /usr/local/hdpssh/etc/sshd_conf
3、下载hadoop-0.20.203.0rc1.tar.gz 并解压至 /data3/hadoop-0.20.203.0
软件安装完毕后,进行配置
第一步:SSH配置
下载并安装 openssh-5.5p1.tar.gz 安装至 /usr/local/hdpssh
修改openssh的配置文件/usr/local/hdpssh/etc/sshd_config:
Port 30433 #因本机22端口被通道机占用,因此监听30433新端口
RSAAuthentication yes
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys
Subsystem sftp /usr/local/hdpssh/libexec/sftp-server
启动ssh:
/usr/local/hdpssh/sbin/sshd -f /usr/local/hdpssh/etc/sshd_config
基于空口令创建一个新的SSH密钥。以启动无密码登录。
[root@localhost]# ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
[root@localhost]# cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
测试
[root@localhost]# ssh -p 30433 -i ~/.ssh/id_rsa localhost
Last login: Thu Jul 28 10:41:18 2011 from localhost
[root@localhost]# who
xiaozhen pts/3 2011-07-28 09:41 (***.106.182.***)
root pts/8 2011-07-28 12:35 (localhost)
SSH配置完成。
第二步 配置hadoop进行伪分布式
[root@localhost]# vi conf/hadoop-env.sh
export JAVA_HOME=/usr/java/jdk1.6.0_26/
export JRE_HOME=/usr/java/jdk1.6.0_26/jre
export HADOOP_HEAPSIZE=512
export HADOOP_SSH_OPTS="-p 30433 -i /root/.ssh/id_rsa"
[root@localhost]# vi conf/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data3/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
</configuration>
[root@localhost]# vi conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/data3/hadoop/filesystem/name</value>
<description>Determines where on the local filesystem the DFS name node should store </description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/data3/hadoop/data</value>
<description>Determin. If this is a comma-delimited </description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replicied when the file is created. The default </description>
</property>
</configuration>
[root@localhost]# vi conf/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
格式化namenode:
bin/hadoop namenode -format
开启服务:
bin/start-all.sh
查看dfs服务是否正常
bin/hadoop dfs -ls /
通过WEB查看hadoop运行状态:
查看集群状态 http://localhost:50070/dfshealth.jsp
查看JOB状态 http://localhost:50030/jobtracker.jsp
启动时遇到的问题:
1、DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/tmg/testdir/core-site.xml could only be replicated to 0
nodes, instead of 1
目前做了2步操作,解决了问题,但是仍有待研究
把safemode置于off状态:hadoop dfsadmin -safemode leave
关闭防火墙: /etc/init.d/iptables stop
2、 启动hadoop时 #./start-all.sh报错:
localhost: Unrecognized option: -jvm
localhost: Could not create the Java virtual machine.
root用户启动HADOOP,默认开启-jvm参数,应该去掉
查看hadoop/bin/hadoop 源码:
if [[ $EUID -eq 0 ]]; then
HADOOP_OPTS="$HADOOP_OPTS -jvm server $HADOOP_DATANODE_OPTS"
else
HADOOP_OPTS="$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS"
fi
修改为:
#if [[ $EUID -eq 0 ]]; then
# HADOOP_OPTS="$HADOOP_OPTS -jvm server $HADOOP_DATANODE_OPTS"
#else
HADOOP_OPTS="$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS"
#fi
第三步,进行示例代码测试
拷贝示例到input目录:
bin/hadoop dfs -put conf input
运行实例代码:
bin/hadoop jar hadoop-examples-0.20.203.0.jar grep input output 'dfs[a-z.]+'
11/07/28 14:15:13 INFO mapred.FileInputFormat: Total input paths to process : 15
11/07/28 14:15:13 INFO mapred.JobClient: Running job: job_201107281127_0011
11/07/28 14:15:14 INFO mapred.JobClient: map 0% reduce 0%
11/07/28 14:15:31 INFO mapred.JobClient: map 13% reduce 0%
11/07/28 14:15:43 INFO mapred.JobClient: map 26% reduce 0%
11/07/28 14:15:53 INFO mapred.JobClient: map 40% reduce 8%
11/07/28 14:16:02 INFO mapred.JobClient: map 53% reduce 13%
11/07/28 14:16:11 INFO mapred.JobClient: map 66% reduce 13%
11/07/28 14:16:14 INFO mapred.JobClient: map 66% reduce 17%
11/07/28 14:16:20 INFO mapred.JobClient: map 80% reduce 22%
查看结果:
bin/hadoop dfs -ls output
Found 3 items
-rw-r--r-- 1 root supergroup 0 2011-07-28 14:17 /user/root/output/_SUCCESS
drwxr-xr-x - root supergroup 0 2011-07-28 14:16 /user/root/output/_logs
-rw-r--r-- 1 root supergroup 82 2011-07-28 14:17 /user/root/output/part-00000
下载文件到本地,查看结果:
bin/hadoop dfs -get output/part-00000 my_result_log
关闭服务:
bin/stop-all.sh
解说3. 格式化namenode:
bin/Hadoop namenode -format
开启服务:
bin/start-all.sh
查看dfs服务是否正常
bin/hadoop dfs -ls /
启动时遇到的问题:
1、DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/tmg/testdir/core-site.xml could only be replicated to 0 nodes, instead of 1
目前做了2步操作,解决了问题,但是仍有待研究 把safemode置于off状态:hadoop dfsadmin -safemode leave
关闭防火墙: /etc/init.d/iptables stop
2、 启动hadoop时 #./start-all.sh报错:
localhost: Unrecognized option: -jvm
localhost: Could not create the Java virtual machine. root用户启动HADOOP,默认开启-jvm参数,应该去掉 查看hadoop/bin/hadoop 源码: if [[ $EUID -eq 0 ]]; then
HADOOP_OPTS="$HADOOP_OPTS -jvm server $HADOOP_DATANODE_OPTS"
else
HADOOP_OPTS="$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS"
fi
修改为:
#if [[ $EUID -eq 0 ]]; then
# HADOOP_OPTS="$HADOOP_OPTS -jvm server $HADOOP_DATANODE_OPTS"
#else
HADOOP_OPTS="$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS"
#fi
第三步,进行示例代码测试 拷贝示例到input目录: bin/hadoop dfs -put conf input
运行实例代码:
bin/hadoop jar hadoop-examples-0.20.203.0.jar grep input output 'dfs[a-z.]+' 11/07/28 14:15:13 INFO mapred.FileInputFormat: Total input paths to process : 15
11/07/28 14:15:13 INFO mapred.JobClient: Running job: job_201107281127_0011
11/07/28 14:15:14 INFO mapred.JobClient: map 0% reduce 0%
11/07/28 14:15:31 INFO mapred.JobClient: map 13% reduce 0%
11/07/28 14:15:43 INFO mapred.JobClient: map 26% reduce 0%
11/07/28 14:15:53 INFO mapred.JobClient: map 40% reduce 8%
11/07/28 14:16:02 INFO mapred.JobClient: map 53% reduce 13%
11/07/28 14:16:11 INFO mapred.JobClient: map 66% reduce 13%
11/07/28 14:16:14 INFO mapred.JobClient: map 66% reduce 17%
11/07/28 14:16:20 INFO mapred.JobClient: map 80% reduce 22%
查看结果:
bin/hadoop dfs -ls output
Found 3 items
-rw-r--r-- 1 root supergroup 0 2011-07-28 14:17 /user/root/output/_SUCCESS
drwxr-xr-x - root supergroup 0 2011-07-28 14:16 /user/root/output/_logs
-rw-r--r-- 1 root supergroup 82 2011-07-28 14:17 /user/root/output/part-00000 下载文件到本地,查看结果: bin/hadoop dfs -get output/part-00000 my_result_log 关闭服务:
bin/stop-all.sh 解说4 ubuntu单机伪分布式下hadoop安装 1 更新 deb 软件包列表
$ sudo apt-get update 2 安装jdk:$sudo apt-get install sun-java6-jdk //如果出错,参照我博客上的一篇文章《安装jdk遇到的问题及解决办法》 3 设置 CLASSPATH和JAVA_HOME ,系统环境变量
$ sudo gedit /etc/environment
添加以下两行内容:
CLASSPATH=".:/usr/lib/jvm/java-6-sun/lib"
JAVA_HOME="/usr/lib/jvm/java-6-sun" 4 下载 hadoop-*.tar.gz 至 /home/shiep205/ //shiep205是用户名
$ cd ~ // 选择默认路径
$ sudo tar xzf hadoop-0.20.0.tar.gz // 解压至当前路径
$ mv hadoop-0.20.0 hadoop // 重命名为 hadoop
$ sudo chown -R shiep205:shiep205 hadoop //赋予shiep205 权限 5 更新 hadoop 环境变量
$ gedit hadoop/conf/hadoop-env.sh
将 #export JAVA_HOME=/usr/lib/jvm/java-6-sun
改为export JAVA_HOME=/usr/lib/jvm/java-6-sun // 即export JAVA_HOME=****/****,*处改为自己的路径 6 配置 SSH //此步不需要改动,一步步输入终端即可
$ sudo apt-get install ssh
$ sudo apt-get install rsync //远程同步, 可能已经安装了最新版本
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ ssh localhost // 验证配置成功与否 7 伪分布运行模式是在运行在单个机器之上, 每一个 hadoop 的守护进程为一个单独的 java 进程。
(一) 配置三个文件 首先进入你存的hadoop文件夹里的conf文件夹(本例将hadoop安装在shiep205/hadoop文件夹里) $cd ~ $cd hadoop/conf $nano core-site.xml 即进入 conf/core-site.xml: (按下面逐个修改配置文件) 修改conf/core-site.xml:文件:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration> 修改conf/hdfs-site.xml: 文件:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
修改 conf/mapred-site.xml: 文件:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration> (二)格式化HDFS 进入hadoop主目录的bin目录下 $cd hadoop/bin $./hadoop namenode -format //格式化hadoop namenode,很多时候namenode启动不起来可以试试格式化一下,会好使。 $./start-all.sh //启动hadoop的各个监护进程 可以通过http://localhost:50070 和http://localhost:50030 查看namenode和jobtracker。 $./stop-all.sh //关闭hadoop的各个监护进程 |