一、准备工作
1.修改Linux主机名
2.修改IP
3.修改主机名和IP的映射关系
4.关闭防火墙
5.ssh免登陆
6.安装JDK,配置环境变量等
二、集群规划
主机名 IP 安装的软件 运行的进程
spark001 192.168.198.201 jdk、hadoop 、scala 、zookeeper 、spark QuorumPeerMain、NameNode、DFSZKFailoverController、DataNode、NodeManager、JournalNode
spark002 192.168.198.202 jdk、hadoop 、scala 、zookeeper 、spark QuorumPeerMain、NameNode、DFSZKFailoverController、DataNode、NodeManager、JournalNode
spark003 192.168.198.203 jdk、hadoop 、scala 、zookeeper 、spark QuorumPeerMain、ResourceManager、DataNode、NodeManager、JournalNode
三、安装步骤:
1.安装配置zooekeeper集群
1.1解压
tar -zxvf zookeeper-3.4.5.tar.gz -C /home/hadoop/soft
1.2修改配置
cd /home/hadoop/soft/zookeeper-3.4.5-cdh5.0.0/conf
cp zoo_sample.cfg zoo.cfg
vim zoo.cfg
修改:dataDir=/home/hadoop/soft/zookeeper-3.4.5-cdh5.0.0/data
在最后添加:
server.1=spark001:2888:3888
server.2=spark002:2888:3888
server.3=spark003:2888:3888
保存退出
然后创建一个data文件夹
mkdir /home/hadoop/soft/zookeeper-3.4.5-cdh5.0.0/data
再创建一个空文件
touch /home/hadoop/soft/zookeeper-3.4.5-cdh5.0.0/data/myid
最后向该文件写入ID
echo 1 > /home/hadoop/soft/zookeeper-3.4.5-cdh5.0.0/data/myid
1.3将配置好的zookeeper拷贝到其他节点
scp -r /home/hadoop/soft/zookeeper-3.4.5-cdh5.0.0/ spark002:/home/hadoop/soft/
scp -r /home/hadoop/soft/zookeeper-3.4.5-cdh5.0.0/ spark003:/home/hadoop/soft/
注意:修改spark002、spark003对应/home/hadoop/soft/zookeeper-3.4.5-cdh5.0.0/data/myid内容
spark002:
echo 2 > /home/hadoop/soft/zookeeper-3.4.5-cdh5.0.0/data/myid
spark003:
echo 3 > /home/hadoop/soft/zookeeper-3.4.5-cdh5.0.0/data/myid
1.4配置zookeeper环境变量
vim /etc/profile
export ZOOKEEPER=/home/hadoop/soft/zookeeper-3.4.5-cdh5.0.0
export PATH=$ZOOKEEPER/bin:$SCALA_HOME/bin:$JAVA_HOME/bin:$PATH
source /etc/profile
1.4启动zookeeper服务
zkServer.sh start
1.5查看zookeeper服务状态
zkServer.sh status
2.安装配置hadoop集群
2.1解压
tar -zxvf hadoop-2.2.0.tar.gz -C /home/hadoop/soft/
2.2配置HDFS(hadoop2.0所有的配置文件都在$HADOOP_HOME/etc/hadoop目录下)
将hadoop添加到环境变量中
vim /etc/profile
export HADOOP_HOME=/home/hadoop/soft/hadoop-2.3.0-cdh5.0.0
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source /etc/profile
cd /home/hadoop/soft/hadoop-2.3.0-cdh5.0.0
2.2.1修改hadoo-env.sh
export JAVA_HOME=/home/hadoop/soft/jdk1.8.0_112
2.2.2修改core-site.xml
<configuration>
<!-- 指定hdfs的nameservice为ns1 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://ns1</value>
</property>
<!-- 指定hadoop临时目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/soft/hadoop-2.3.0-cdh5.0.0/tmp</value>
</property>
<!-- 指定zookeeper地址 -->
<property>
<name>ha.zookeeper.quorum</name>
<value>spark001:2181,spark002:2181,spark003:2181</value>
</property>
</configuration>
2.2.3修改hdfs-site.xml
<configuration>
<!--指定DataNode存储block的副本数量。默认值是3个 -->
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<!--指定hdfs的nameservice为ns1,需要和core-site.xml中的保持一致 -->
<property>
<name>dfs.nameservices</name>
<value>ns1</value>
</property>
<!-- ns1下面有两个NameNode,分别是nn1,nn2 -->
<property>
<name>dfs.ha.namenodes.ns1</name>
<value>nn1,nn2</value>
</property>
<!-- nn1的RPC通信地址 -->
<property>
<name>dfs.namenode.rpc-address.ns1.nn1</name>
<value>spark001:9000</value>
</property>
<!-- nn1的http通信地址 -->
<property>
<name>dfs.namenode.http-address.ns1.nn1</name>
<value>spark001:50070</value>
</property>
<!-- nn2的RPC通信地址 -->
<property>
<name>dfs.namenode.rpc-address.ns1.nn2</name>
<value>spark002:9000</value>
</property>
<!-- nn2的http通信地址 -->
<property>
<name>dfs.namenode.http-address.ns1.nn2</name>
<value>spark002:50070</value>
</property>
<!-- 指定NameNode的元数据在JournalNode上的存放位置 -->
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://spark001:8485;spark002:8485;spark003:8485/ns1</value>
</property>
<!-- 指定JournalNode在本地磁盘存放数据的位置 -->
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/home/hadoop/soft/hadoop-2.3.0-cdh5.0.0/journal</value>
</property>
<!-- 开启NameNode失败自动切换 -->
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<!-- 配置失败自动切换实现方式 -->
<property>
<name>dfs.client.failover.proxy.provider.ns1</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<!-- 配置隔离机制 -->
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<!-- 使用隔离机制时需要ssh免登陆 -->
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/root/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
2.2.4修改slaves
spark001
spark002
spark003
2.3配置YARN
2.3.1修改yarn-site.xml
<configuration>
<!-- 指定resourcemanager地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>spark003</value>
</property>
<!-- 指定nodemanager启动时加载server的方式为shuffle server -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
2.3.2修改mapred-site.xml
<configuration>
<!-- 指定mr框架为yarn方式 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
2.4将配置好的hadoop拷贝到其他节点
scp -r /home/hadoop/soft/hadoop-2.3.0-cdh5.0.0/ spark002:/home/hadoop/soft/
scp -r /home/hadoop/soft/hadoop-2.3.0-cdh5.0.0/ spark003:/home/hadoop/soft/
2.5启动journalnode(在spark001、spark002、spark003上启动journalnode)
hadoop-daemons.sh start journalnode
(运行jps命令检验,多了JournalNode进程)
2.6格式化ZK(在spark001上执行即可)
hdfs zkfc -formatZK
2.7格式化HDFS
在spark001上执行命令:
hadoop namenode -format
格式化后会在根据core-site.xml中的hadoop.tmp.dir配置生成个文件,这里我配置的是/home/hadoop/soft/hadoop-2.3.0-cdh5.0.0/tmp,然后将/home/hadoop/soft/hadoop-2.3.0-cdh5.0.0/tmp拷贝到spark002的/home/hadoop/soft/hadoop-2.3.0-cdh5.0.0/下。
scp -r tmp/ spark002:/home/hadoop/soft/hadoop-2.3.0-cdh5.0.0/
2.8启动HDFS(在spark001上执行)
sbin/start-dfs.sh
2.9启动YARN(在spark003上执行)(ResourceManager在哪台机器上就在哪台机器上执行命令)
sbin/start-yarn.sh
http://192.168.198.201:50070/
3.安装spark集群
3.1配置spark环境变量
vim /etc/profile
export SPARK_HOME=/home/hadoop/soft/spark-1.6.0-bin-hadoop2.3
export PATH=$SPARK_HOME/bin:$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
source /etc/profile
3.2配置spark-env.sh,其中添加以下配置信息
export JAVA_HOME=/home/hadoop/soft/jdk1.8.0_112
export SCALA_HOME=/home/hadoop/soft/scala-2.10.4
export HADOOP_HOME=/home/hadoop/soft/hadoop-2.3.0-cdh5.0.0
export HADOOP_CONF_DIR=/home/hadoop/soft/hadoop-2.3.0-cdh5.0.0/etc/hadoop
export SPARK_MASTER_IP=spark003
export SPARK_WORKER_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=1g
export SPARK_DRIVER_MEMORY=1g
export SPARK_WORKER_CORES=1
3.3配置slaves
cp slaves.template slaves
编辑其内容为:
spark001
spark002
spark003
3.4配置spark-defaults.conf
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.eventLog.enabled true
spark.eventLog.dir hdfs://spark001:9000/historyserverforSpark
spark.yarn.historyServer.address spark001:18080
spark.history.fs.logDirectory hdfs://spark001:9000/historyserverforSpark
3.5将配置好的spark拷贝到其他节点
scp -r /home/hadoop/soft/spark-1.6.0-bin-hadoop2.3/ spark002:/home/hadoop/soft/
scp -r /home/hadoop/soft/spark-1.6.0-bin-hadoop2.3/ spark003:/home/hadoop/soft/
3.6在hadoop上创建historyserverforSpark文件夹
hadoop fs -mkdir /historyserverforSpark
3.7启动spark003上spark
sbin/start-all.sh
http://192.168.198.203:8080/
3.8测试
spark-submit --master spark://spark003:7077 --class org.apache.spark.examples.SparkPi --name Spark-Pi /home/hadoop/soft/spark-1.6.0-bin-hadoop2.3/lib/spark-examples-1.6.0-hadoop2.3.0.jar 1000
spark-shell --master spark://spark003:7077
sc.textFile("/zookeeper.out").flatMap(_.split(" ")).map(word => (word, 1)).reduceByKey(_+_).map(pair => (pair._2, pair._1)).sortByKey(false, 1).map(pair => (pair._2, pair._1)).saveAsTextFile("/dt_spark_clicked1")