hadoop搭建与eclipse开发环境设置

本帖最后由 pig2 于 2014-1-7 22:52 编辑

1.1 目标

目的很简单，为进行研究与学习，部署一个hadoop运行环境，并搭建一个hadoop开发与测试环境。

具体目标是：

在ubuntu系统上部署hadoop

在windows 上能够使用eclipse连接ubuntu系统上部署的hadoop进行开发与测试

1.2 软硬件要求

注意：

Hadoop版本和Eclipse版本请严格按照要求。

现在的hadoop最新版本是hadoop-0.20.203，我在windows上使用eclipse（包括3.6版本和3.3.2版本）连接ubuntu上的hadoop-0.20.203环境一直没有成功。但是开发测试程序是没有问题的，不过需要注意权限问题。

如果要减少权限问题的发生，可以这样做：ubuntu上运行hadoop的用户与windows上的用户一样。

1.3 环境拓扑图

file:///C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/msohtml1/01/clip_image002.gif

注意：ubuntu既是NameNode又是DataNode，同时也是JobTracker。
1. Ubuntu 安装

安装ubuntu11.04 server系统，具体略。

我是先在虚拟机上安装一个操作系统，然后把hadoop也安装配置好了，再克隆二份，然后把主机名与IP修改，再进行主机之间的SSH配置。

如果仅作为hadoop的运行与开发环境，不需要安装太多的系统与网络服务，或者在需要的时候通过apt-get install进行安装。不过SSH服务是必须的。

2. Hadoop 安装

以下的hadoop安装以主机ubuntu下进行安装为例。

3.1 下载安装jdk1.6

安装版本是：jdk-6u26-linux-i586.bin，我把它安装拷贝到：/opt/jdk1.6.0_26

3.2 下载解压hadoop

安装包是：hadoop-0.20.2.tar.gz。

$ tar –zxvf hadoop-0.20.2.tar.gz

$ mv hadoop-0.20.2 /opt/hadoop

3.3 修改系统环境配置文件

切换为根用户。

l 修改环境配置文件/etc/profile，加入：

export JAVA_HOME=/opt/jdk1.6.0_26

export JRE_HOME=/opt/jdk1.6.0_26/jre

export CLASSPATH=.:$JAVA_HOME/lib:$JRE_HOME/lib:$CLASSPATH

export PATH=$JAVA_HOME/bin:$JRE_HOME/bin:$PATH

export HADOOP_HOME=/opt/hadoop

export PATH=$HADOOP_HOME/bin:$PATH

l 修改地址解析文件/etc/hosts，加入

192.168.69.231 ubuntu

192.168.69.232 ubuntu1

192.168.69.233 ubuntu2

修改主机文件/etc/hostname

每台机器都不一样如ubuntu1就修改为ubuntu1

3.4 修改hadoop的配置文件

切换为hadoop用户。

l 修改hadoop目录下的conf/hadoop-env.sh文件

加入java的安装根路径：

export JAVA_HOME=/opt/jdk1.6.0_26

l 把hadoop目录下的conf/core-site.xml文件修改成如下：

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>hadoop.tmp.dir</name>

<value>/hadoop</value>

<description>A base for other temporary directories.</description>

</property>

<name>fs.default.name</name>

<value>hdfs://ubuntu:9000</value>

<description>The name of the default file system. A URI whose

scheme and authority determine the FileSystem implementation. The

uri's scheme determines the config property (fs.SCHEME.impl) naming

the FileSystem implementation class. The uri's authority is used to

determine the host, port, etc. for a filesystem.</description>

</property>

<!—这段不要 -->

<name>dfs.hosts.exclude</name>

<value>excludes</value>

</property>

<value>/hadoop/name</value>

<description>Determines where on the local filesystem the DFS name node should store the name table. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. </description>

</property>

</configuration>

l 把hadoop目录下的conf/ hdfs-site.xml文件修改成如下：

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<value>/hadoop/data</value>

<description>Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.</description>

</property>

<name>dfs.replication</name>

<description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.</description>

</property>

</configuration>

l 把hadoop目录下的conf/ mapred-site.xml文件修改成如下：

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>mapred.job.tracker</name>

<value>ubuntu:9001</value>

<description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.</description>

</property>

</configuration>

注意：

别忘了hadoop.tmp.dir，dfs.name.dir，dfs.data.dir参数，hadoop存放数据文件，名字空间等的目录，格式化分布式文件系统时会格式化这个目录。

这里指向了/hadoop，所以也要创建这个目录，并且用户归属也是hadoop:hadoop。

l 把hadoop目录下的conf/ masters文件修改成如下：

ubuntu

l 把hadoop目录下的conf/ slaves文件修改成如下：

ubuntu

ubuntu1

ubuntu2

3.5 分发hadoop安装文件

我使用VMWare的克隆功能，将主机ubuntu完全克隆两份:ubuntu1和ubuntu2，并修改相应的主机名和IP地址，这样就可以简单地保持hadoop环境基本配置相同。

如果是安装在实体物理机上，把在ubuntu安装的jdk，系统配置文件/etc/host，/etc/profile，hadoop安装目录拷贝到ubuntu1和ubuntu2相应的目录。

3.6 SSH配置无密码验证配置

切换到Hadoop用户，在Hadoop家目录下面创建.ssh目录：

$ cd

$ mkdir .ssh

在master节点（即主机ubuntu）上生成密钥对：

$ ssh-keygen –t rsa

然后一直按[Enter]键，按默认的选项生成密钥对保存在.ssh/id_rsa文件中。

然后执行命令：

$ ssh ~/.ssh

$ cp id_rsa.pub authorized_keys

$ scp authorized_keys ubuntu1:/home/hadoop/.ssh

$ scp authorized_keys ubuntu2:/home/hadoop/.ssh

从ubuntu向ubuntu1和ubuntu2发起SSH连接，第一次登录时需要输入密码，以后就不需要了。

$ ssh ubuntu1

$ ssh ubuntu2

我们只需要配置从master向slaves发起SSH连接不需要密码就可以了，但这样只能在master（即在主机ubuntu）启动或关闭hadoop服务。

3.7 运行hadoop

使用Hadoop用户。

首先说明，hadoop命令和参数都是大小写敏感的，该用大写时用大写，用小写时用小写，否则会执行错误。

格式化分布式文件系统：

$ hadoop namenode -format

在ubuntu上启动hadoop守护进行：

$ start-all.sh

停止hadoop守护进程是：

$ stop-all.sh

在ubuntu上查看运行的进程：

$ jps

2971 SecondaryNameNode

3043 JobTracker

2857 DataNode

4229 Jps

3154 TaskTracker

2737 NameNode

在ubuntu1上查看运行的进程：

$ jps

1005 DataNode

2275 Jps

1090 TaskTracker

其它命令请参考相关资料。

在windows上通过WEB查看hadoop相关信息。

修改C:\WINDOWS\system32\drivers\etc\hosts文件，加入主机名与IP对应关系：

192.168.69.231 ubuntu

192.168.69.232 ubuntu1

192.168.69.233 ubuntu2

访问：http://ubuntu:50030　可以查看JobTracker的运行状态：

file:///C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/msohtml1/01/clip_image004.jpg

访问：http://ubuntu:50070　可以查看NameNode及整个分布式文件系统的状态等：

file:///C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/msohtml1/01/clip_image006.jpg

3.8 运行WordCount实例

WordCount是hadoop自带的实例，统计一批文本文件中各单词出现的资料，输出到指定的output目录中，输出目录如果已经存在会报错。

$ cd /opt/hadoop

$ hadoop fs -mkdir input

$ hadoop fs -copyFromLocal /opt/hadoop/*.txt input/

$ hadoop jar hadoop-0.20.2-examples.jar wordcount input output

$ hadoop fs -cat output/* #最后查看结果

3. Windows下eclipse开发环境配置4.1 系统环境配置

在windows上通过WEB查看hadoop相关信息。

修改C:\WINDOWS\system32\drivers\etc\hosts文件，加入主机名与IP对应关系：

192.168.69.231 ubuntu

192.168.69.232 ubuntu1

192.168.69.233 ubuntu2

4.2 安装开发hadoop插件

将hadoop安装包hadoop\contrib\eclipse-plugin\hadoop-0.20.2-eclipse-plugin.jar拷贝到eclipse的插件目录plugins下。

　　需要注意的是插件版本（及后面开发导入的所有jar包）与运行的hadoop一致，否则可能会出现EOFException异常。

重启eclipse，打开windows->openperspective->other->map/reduce 可以看到map/reduce开发视图。

4.3 设置连接参数

打开windows->show view->other-> map/reduceLocations视图，在点击大象后弹出的对话框（General tab）进行参数的添加：

file:///C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/msohtml1/01/clip_image008.jpg

参数说明如下：

Locationname:任意

map/reducemaster：与mapred-site.xml里面mapred.job.tracker设置一致。

DFS master：与core-site.xml里fs.default.name设置一致。

User name: 服务器上运行hadoop服务的用户名。

然后是打开“Advancedparameters”设置面板，修改相应参数。上面的参数填写以后，也会反映到这里相应的参数：

主要关注下面几个参数：

fs.defualt.name：与core-site.xml里fs.default.name设置一致。

mapred.job.tracker：与mapred-site.xml里面mapred.job.tracker设置一致。

dfs.replication：与hdfs-site.xml里面的dfs.replication一致。

hadoop.tmp.dir：与core-site.xml里hadoop.tmp.dir设置一致。

hadoop.job.ugi：并不是设置用户名与密码。是用户与组名，所以这里填写hadoop,hadoop。

说明：第一次设置的时候可能是没有hadoop.job.ugi和dfs.replication参数的，不要紧，确认保存。打开ProjectExplorer中DFS　Locations目录，应该可以年看到文件系统中的结构了。但是在/hadoop/mapred/system下却没有查看权限，如下图：

file:///C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/msohtml1/01/clip_image010.jpg

而且删除文件的时候也会报错：

file:///C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/msohtml1/01/clip_image012.jpg

这个原因是我使用地本用户Administrator（我是用管理员用户登陆来地windows系统的）进行远程hadoop系统操作，没有权限。

此时再打开“Advancedparameters”设置面板，应该可以看到hadoop.job.ugi了，这个参数默认是本地操作系统的用户名，如果不幸与远程hadoop用户不一致，那就要改过来了，将hadoop加在第一个，并用逗号分隔。如：

保存配置后，重新启动eclipse。/hadoop/mapred/system下就一目了然了，删除文件也OK。

file:///C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/msohtml1/01/clip_image016.jpg

4.4 运行hadoop程序

首先将hadoop安装包下面的所有jar包都导到eclipse工程里。

然后建立一个类：DFSOperator.java，该类写了四个基本方法：创建文件，删除文件，把文件内容读为字符串，将字符串写入文件。同时有个main函数，可以修改测试:

package com.kingdee.hadoop;

import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FSDataOutputStream;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

/**

*

* The utilities to operate file on hadoop hdfs.

*

* @author luolihui 2011-07-18

*

*/

public class DFSOperator {

private static final String ROOT_PATH = "hdfs:///";

private static final int BUFFER_SIZE = 4096;

/**

* construct.

*/

public DFSOperator(){}

/**

* Create a file on hdfs.The root path is /.<br>

* for example: DFSOperator.createFile("/lory/test1.txt", true);

* @param path the file name to open

* @param overwrite if a file with this name already exists, then if true, the file will be

* @return true if delete is successful else IOException.

* @throws IOException

*/

public static boolean createFile(String path, boolean overwrite) throws IOException

{

//String uri = "hdfs://192.168.1.100:9000";

//FileSystem fs1 = FileSystem.get(URI.create(uri), conf);

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

Path f = new Path(ROOT_PATH + path);

fs.create(f, overwrite);

fs.close();

return true;

}

/**

* Delete a file on hdfs.The root path is /. <br>

* for example: DFSOperator.deleteFile("/user/hadoop/output", true);

* @param path the path to delete

* @param recursive if path is a directory and set to true, the directory is deleted else throws an exception. In case of a file the recursive can be set to either true or false.

* @return true if delete is successful else IOException.

* @throws IOException

*/

public static boolean deleteFile(String path, boolean recursive) throws IOException

{

//String uri = "hdfs://192.168.1.100:9000";

//FileSystem fs1 = FileSystem.get(URI.create(uri), conf);

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

Path f = new Path(ROOT_PATH + path);

fs.delete(f, recursive);

fs.close();

return true;

}

/**

* Read a file to string on hadoop hdfs. From stream to string. <br>

* for example: System.out.println(DFSOperator.readDFSFileToString("/user/hadoop/input/test3.txt"));

* @param path the path to read

* @return true if read is successful else IOException.

* @throws IOException

*/

public static String readDFSFileToString(String path) throws IOException

{

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

Path f = new Path(ROOT_PATH + path);

InputStream in = null;

String str = null;

StringBuilder sb = new StringBuilder(BUFFER_SIZE);

if (fs.exists(f))

{

in = fs.open(f);

BufferedReader bf = new BufferedReader(new InputStreamReader(in));

while ((str = bf.readLine()) != null)

{

sb.append(str);

sb.append("\n");

}

in.close();

bf.close();

fs.close();

return sb.toString();

}

else

{

return null;

}

/**

* Write string to a hadoop hdfs file. <br>

* for example: DFSOperator.writeStringToDFSFile("/lory/test1.txt", "You are a bad man.\nReally!\n");

* @param path the file where the string to write in.

* @param string the context to write in a file.

* @return true if write is successful else IOException.

* @throws IOException

*/

public static boolean writeStringToDFSFile(String path, String string) throws IOException

{

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(conf);

FSDataOutputStream os = null;

Path f = new Path(ROOT_PATH + path);

os = fs.create(f,true);

os.writeBytes(string);

os.close();

fs.close();

return true;

}

public static void main(String[] args)

{

try {

DFSOperator.createFile("/lory/test1.txt", true);

DFSOperator.deleteFile("/dfs_operator.txt", true);

DFSOperator.writeStringToDFSFile("/lory/test1.txt", "You are a bad man.\nReally?\n");

System.out.println(DFSOperator.readDFSFileToString("/lory/test1.txt"));

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

System.out.println("===end===");

}

然后Run AsàRun on HadoopàChoosean exitsing server from the list belowàfinish.

结果很简单（那个警告不管）：

11/07/16 18:44:32 WARN conf.Configuration: DEPRECATED: hadoop-site.xml found in the classpath. Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, mapred-site.xml and hdfs-site.xml to override properties of core-default.xml, mapred-default.xml and hdfs-default.xml respectively

You are a bad man.

Really?

===end===

也可以运行hadoop自带的WorkCount程序，找到其源代码导进来，然后设置输入输出参数，然后同样“Run on hadoop”。具体步骤不再示范。

每“Run on hadoop”都会在workspace\.metadata\.plugins\org.apache.hadoop.eclipse下生成临时jar包。不过第一次需要Run on hadoop，以后只需要点击那运行的绿色按钮了。

4. 错误及处理5.1 安全模式问题

我在eclipse上删除DFS上的文件夹时，出现下面错误：

file:///C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/msohtml1/01/clip_image018.jpg

错误提示说得也比较明示，是NameNode在安全模式中，其解决方案也一并给出。

类似的运行hadoop程序时，有时候会报以下错误：

org.apache.hadoop.dfs.SafeModeException:Cannot delete /user/hadoop/input. Name node is in safe mode

解除安全模式：

bin/hadoopdfsadmin -safemode leave

用户可以通过dfsadmin -safemode value 来操作安全模式，参数value的说明如下：

enter- 进入安全模式

leave- 强制NameNode离开安全模式

get - 返回安全模式是否开启的信息

wait- 等待，一直到安全模式结束。

5.2 开发时报错Permission denied

org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=Administrator, access=WRITE, inode="test1.txt":hadoop:supergroup:rw-r--r--

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

at java.lang.reflect.Constructor.newInstance(Constructor.java:513)

at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:96)

at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:58)

at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.<init>(DFSClient.java:2710)

at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:492)

at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:195)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:484)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:465)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:372)

at com.kingdee.hadoop.DFSOperator.createFile(DFSOperator.java:46)

at com.kingdee.hadoop.DFSOperator.main(DFSOperator.java:134)

解决方法是，在“Advancedparameters”设置面板，设置hadoop.job.ugi参数，将hadoop用户加上去。

file:///C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/msohtml1/01/clip_image020.jpg

变为：

file:///C:/DOCUME~1/ADMINI~1/LOCALS~1/Temp/msohtml1/01/clip_image022.jpg

然后重新在运行中”Run on hadoop”。

另一方法是改变要操作的文件的权限。

Permission denied: user=Administrator, access=WRITE, inode="test1.txt":hadoop:supergroup:rw-r--r--

　　上面的意思是：test1.txt文件的访问权限是rw-r--r--，归属组是supergroup，归属用户是hadoop，现在使用Administrator用户对test1.txt文件进行WRITE方式访问，被拒绝了。

所以可以改变下test1.txt文件的访问权限：

$ hadoop fs –chmod 777 /lory/test1.txt

$ hadoop fs –chmod 777 /lory #或者上一级文件夹

　　当然使用-chown命令也可以。

来自群组: 程序员生活区

xioaxu790 · 发表于 2014-2-25 21:31:27

版主壮哉！！！！！！！！！！！！！！强烈顶

xioaxu790 · 发表于 2014-2-25 20:27:30

亲，到底是那段不要呢，可以把那段截个图吗？

pig2 · 发表于 2014-2-25 20:42:36

xioaxu790 发表于 2014-2-25 20:27
亲，到底是那段不要呢，可以把那段截个图吗？

红字部分不要

烽火佳人 · 发表于 2017-8-24 12:30:30

很详细i学习了

图文精华

hadoop搭建与eclipse开发环境设置

已有(4)人评论

活跃会员

热心会员

优秀版主

论坛元老

最佳新人

突出贡献

推荐 /2