cloudstack笔记系列

导读：

本文类似笔记系列汇总，仅供大家参考

cloudstack4.21+kvm多个管理节点（有点问题仅供参考）

cloudstack多个管理节点

安装节点manage2，导入数据库时，不需要“--deploy-as=root:password” 这一项；

不需要导入系统虚拟机模板；

其它与建立第一个管理节点manage1相同。

manage1挂掉后，kvm host会显示掉线

在kvm host中
# vi /etc/cloudstack/agent/agent.properties
“host=” 改为manage2的IP地址

/etc/init.d/cloudstack-agent restart

cloudstack4.21+kvm设置可以在CS控制台重置VM密码的系统模板

在要做成模板的VM系统（centos为例，其它发行版可能略有不同）中执行以下操作：

# wget http://download.cloud.com/templa ... t-guest-password.in

# mv cloud-set-guest-password.in cloud-set-guest-password

# cp cloud-set-guest-password /etc/init.d/

# chmod +x /etc/init.d/cloud-set-guest-password

# chkconfig --add cloud-set-guest-password

再把VM做成模板。

windows的话，在VM中下载这个并安装：
http://nchc.dl.sourceforge.net/p ... InstanceManager.msi

版本有时候会有变化的，从Apache_CloudStack-x.x.x-Admin_Guide中查看对应的版本

说明：

VM是通过改密码脚本向虚拟路由器发出请求来获取一个新密码。(使用wget命令，最小化的centos是没有安装wget的，要装上)

而虚拟路由器的地址是改密码脚本从VM的DHCP客户端租约文件来获得的（它认为DHCP服务器一定就是虚拟路由器，无语）

如果你没有使用DHCP来获取地址，那么就要在VM中创建一个租约文件

# vi /var/lib/dhclient/dhclient.lease 文件名其实无所谓，但路径必须正确

加入一行：

dhcp-server-identifier 10.10.10.51 （虚拟路由器的地址）

CloudStack4.3版的来宾地址是随机分配的，可以通过修改数据库nics表来给虚拟路由器一个固定的ip地址。

cloudstack4.21＋kvm设置HA

1、全局配置: ha.tag 填上: hahosttag

2、添加新主机，标签设置为: hahosttag，作为HA主机

3、添加新的计算服务，设置HA(打勾)，服务主机标签不要写

4、将想要设置为HA的虚拟机选择HA计算服务

5、测试：
登录到设置了HA的虚拟机内部，关机（不是在cs控制台点关机），看是否会自动启动；
拔掉其它主机的网线，看其上运行的虚拟机是否自动转移到HA主机上。

6、注意事项：
HA主机不能用来启动普通虚机，只能用来启动设置了HA的虚拟机，因此要注意不能把所有的主机都打上hahosttag标签；

普通虚拟机不能动态迁移到HA主机（设置了HA的虚拟机可以）；

SSVM等系统虚拟机也要更改为设置了HA的系统计算服务（自己添加），不然系统虚拟机挂掉了其它所有虚拟机都起不来的。

cloudstack4.21＋kvm快照恢复

1、CS控制台，存储－－卷－－创建快照（或者设置重现快照，可以定时生成快照）；

2、存储－－快照，将想要恢复的那个快照创建为模板；

3、新建实例，选择刚才创建的那个模板，这个新的虚拟机就是你创建快照时的那个了。

安装GlusterFS、NFS、CTDB构建高可用、低价格的存储集群

*声明：本人也是在摸索当中，也许本人的一些做法并不正确，写下的这些东西不是什么教程，仅供参考而已。

   本方案尚处于测试阶段，并未经过生产环境的考验，因此要慎重，不要直接往生产服务器上装，否则可能带来灾难性后果。

*另外，不差钱的话，还是买套专业存储吧，别苦了自己。省了钱不会有人记得你，出了问题肯定是你的责任。  －－蓝蜻蜓

特点：

1、高可用，基于GlusterFS分布式文件系统，数据文件分布存放，扩展性强；

2、既可以用glusterfs-fuse做对外的挂载接口（不存在单点故障），也可以使用NFS做对外的挂载接口（给不便于安装glusterfs客户端的机器使用），简单易用；

3、使用CTDB做对外NFS接口管理（虚拟ip的方式），一台服务器挂掉时可自动切换到其它服务器，避免了NFS单点故障；

4、设备可采用普通的服务器来兼职，服务器还可以做别的应用（只要它的硬盘不是很忙，就可以做兼职），一个字：廉价。

本人的GlusterFS系统用cloudstack系统的服务器做兼职。

搞这个存储集群是为了配合cloudstack云系统应用，但也可以给其它的系统做存储使用。

cloudstack主存储用glusterfs客户端挂载到各KVM host服务器(目录名必须相同)，然后用sharemountpoint方式挂载到云平台；

cloudstack辅存储用glusterfs客户端挂载到两台manage服务器(目录名必须相同)，然后用NFS方式挂载到云平台。

（NFS挂载方式实际上是经过了二次挂载，性能有损失，能用glusterfs客户端挂载的地方尽量用）

GlusterFS、CTDB、NFS分别是什么？有什么特点？这里不再说，这里说的是这三个怎么配合使用。

系统环境：centos6.5（64位）

硬件为4台普通服务器：

10.10.10.15 nfs01.mycloud.com.cn （CTDB虚拟出来的ip地址，没有实际服务器）
10.10.10.11 gls01.mycloud.com.cn （cloudstack manage1 兼职）
10.10.10.12 gls02.mycloud.com.cn （cloudstack manage2 兼职）
10.10.10.21 gls03.mycloud.com.cn （cloudstack KVM host1 兼职）
10.10.10.22 gls04.mycloud.com.cn （cloudstack KVM host2 兼职）

以下gluster开头的命令只要在其中一台服务器上输入就可以了，其它的命令每服务器都要输入。

如果你有更多的服务器，配置也是类似的

#  vi /etc/yum.repos.d/glusterfs.repo
[glusterfs]
name=glusterfs
#baseurl=http://download.gluster.org/pub/gluster/glusterfs/3.4/3.4.3/CentOS/epel-6.5/x86_64/
baseurl=http://10.10.10.12/glusterfs/glusterfs3.4.3/  （这是把上面的文件下载到了本地）
enabled=1
gpgcheck=0

#  yum install glusterfs glusterfs-server

#  chkconfig glusterd on

#  service glusterd start

#  vi /etc/hosts
10.10.10.15 nfs01.mycloud.com.cn
10.10.10.11 gls01.mycloud.com.cn
10.10.10.12 gls02.mycloud.com.cn
10.10.10.21 gls03.mycloud.com.cn
10.10.10.22 gls04.mycloud.com.cn

创建用于cloudstack主存的卷（以glusterfs客户端方式挂载）：

#  mkdir -p /glusterfs/nfsp  （建立用于主存的文件目录）

# gluster peer probe gls01.mycloud.com.cn （添加节点）

# gluster peer probe gls02.mycloud.com.cn

# gluster peer probe gls03.mycloud.com.cn

# gluster peer probe gls04.mycloud.com.cn

# gluster peer status

# gluster volume info

# gluster volume create nfsp replica 2 gls01.mycloud.com.cn:/glusterfs/nfsp gls02.mycloud.com.cn:/glusterfs/nfsp force

（创建对外提供服务的卷，这里只指定使用2个节点，如果以后有更多的主机要加入，使用卷扩容就可以了；

distribute分布式, stripe条带式, replica副本式，可叠加组合，

“replica 2”表示每个文件保存为两个副本，最少需要2台服务器。

“stripe 2 replica 2”表示每个文件切成条保存到2个地方，且保存为两个副本，最少需要4台服务器。

如果卷目录位于根分区下，后面要加force）

# gluster volume start nfsp

# gluster volume set nfsp auth.allow 10.10.10.*  （设置允许访问的地址，如果有多个地址可以用逗号连接）

# gluster volume info

创建完毕，测试一下：

# mkdir /nfsp

# mount -t glusterfs 10.10.10.11:/nfsp /nfsp/  （挂载到本地目录来使用，不要直接往/glusterfs/nfsp里面写文件，ip随便写哪个都可以，建议写本机ip）

# vi /nfsp/test1.txt
testtesttest

到另一台机上看一下目录里面有没有文件同步生成：

# ls -lh /glusterfs/nfsp

创建用于cloudstack辅存的卷（以NFS方式挂载）：

#  mkdir -p /glusterfs/nfss  （建立用于辅存的文件目录）

# gluster volume create nfss gls03.mycloud.com.cn:/glusterfs/nfss gls04.mycloud.com.cn:/glusterfs/nfss force

# gluster volume start nfss

# gluster volume set nfss auth.allow 10.10.10.*  （设置允许访问的地址，如果有多个地址可以用逗号连接）

# gluster volume info

创建完毕，测试一下：

# mkdir /nfss

# mount -t glusterfs 10.10.10.21:/nfss /nfss/  （挂载到本地目录来使用，下面要把这个目录以NFS的方式对外提供服务）

# vi /nfsp/test1.txt
testtesttest

到另一台机上看一下目录里面有没有文件同步生成：

# ls -lh /glusterfs/nfsp

# yum install nfs-utils

# yum install ctdb

# mkdir /glusterfs/ctdb

# gluster volume create ctdb replica 4 gls01.mycloud.com.cn:/glusterfs/ctdb gls02.mycloud.com.cn:/glusterfs/ctdb gls03.mycloud.com.cn:/glusterfs/ctdb gls04.mycloud.com.cn:/glusterfs/ctdb force
（创建用来保存ctdb和nfs配置文件的卷，文件很小的，可以多存几份）

# gluster volume start ctdb

# gluster volume set ctdb auth.allow 10.10.10.*

# gluster volume info

# mkdir /ctdb

# mount -t glusterfs 10.10.10.11:/ctdb /ctdb/

创建nfs配置文件

# vi /etc/sysconfig/nfs

CTDB_MANAGES_NFS=yes
NFS_TICKLE_SHARED_DIRECTORY=/ctdb/nfs-tickles
STATD_PORT=595
STATD_OUTGOING_PORT=596
MOUNTD_PORT=597
RQUOTAD_PORT=598
LOCKD_UDPPORT=599
LOCKD_TCPPORT=599
STATD_SHARED_DIRECTORY=/ctdb/lock/nfs-state
NFS_HOSTNAME="gls.mycloud.com.cn"
STATD_HOSTNAME="$NFS_HOSTNAME -P "$STATD_SHARED_DIRECTORY/$PUBLIC_IP" -H /etc/ctdb/statd-callout -p 97"
RPCNFSDARGS="-N 4"

# mv /etc/sysconfig/nfs /ctdb/ （放到存储卷里面，给其他服务器共用）

# vi /etc/exports

/nfss  *(fsid=1235,insecure,rw,async,no_root_squash,no_subtree_check)

# mv /etc/exports /ctdb/

# cd /etc/

# ln -s /ctdb/exports exports

# cd sysconfig/

# ln -s /ctdb/nfs nfs

创建ctdb配置文件，同样是放到存储卷里面，给其他服务器共用

# vi /ctdb/ctdb

CTDB_RECOVERY_LOCK=/ctdb/lockfile
CTDB_PUBLIC_INTERFACE=eth0
CTDB_PUBLIC_ADDRESSES=/ctdb/public_addresses
CTDB_MANAGES_NFS=yes
CTDB_NODES=/ctdb/nodes
CTDB_DEBUGLEVEL=ERR

# cd /etc/sysconfig/

# ln -s /ctdb/ctdb ctdb

# vi /ctdb/public_addresses

10.10.10.15/24 eth0  （定义对外虚拟ip）

# vi /ctdb/nodes

10.10.10.21
10.10.10.22

# chkconfig ctdb on  （启动）

# chkconfig nfs off （取消nfs自启动，由ctdb来管理）

# /etc/init.d/ctdb start

# ctdb status （查看信息）

# ctdb ping -n all

# ctdb ip
Public IPs on node 1
10.10.10.15 node[0] active[] available[eth0] configured[eth0]  （显示10.10.10.15这个接口地址正工作在node0上）

# ctdb pnn
PNN:1    （自己是node1）

找台机子来测试一下，如果连不上，注意看下防火墙是不是阻挡了

# mkdir /test

# mount -t nfs 10.10.10.15:/nfss /test

# cd /test

添加开机自动挂载glusterfs卷：

# vi /etc/rc.local

/bin/sleep 60s
/bin/mount -t glusterfs 10.10.10.11:/ctdb /ctdb/
/bin/mount -t glusterfs 10.10.10.11:/nfss /nfssecond/
/etc/init.d/ctdb restart

为什么不加到/etc/fstab中？因为服务器启动的时候要花点时间去启动glusterfs服务，加到/etc/fstab中是不会成功挂载的。

接下来.........测试吧

反复读写大量文件...

运行过程中拔网线...

尽量折腾看会不会死...

* 注意 *

本人把ctdb安装在2台cloustack manage服务器上，如果装到cloustack kvm host上云系统会有问题，可能是对网桥模式有干扰，你可以自己试一下，也许运气比我好也说不定。

使用ctdb后还有个福利，一台cloustack manage服务器挂掉后，另一台会自动把虚拟地址接手过来。

（当然访问manage必须用虚拟地址，另外要检查下kvm主机中的/etc/cloudstack/agent/agent.properties 文件写的是不是虚拟地址）

另外，glusterfs跟cloudstack配合用，各种模式会产生完全不一样的效果。

千兆网络下，主存用replica模式，写入速度大约有70MB/S；辅存用distribute模式，写入速度大约有50MB/S。

本人测试cloudstack辅存用replica模式的时候做快照写入速度只有1MB/S，用stripe模式则根本写不进快照文件。

不过辅存没有主存那么重要，不用replica模式也不是很要紧（还可以节省存储空间）。

不过辅存挂掉的时候ssvm等系统虚拟机会无法重启，为了解决这个问题，本人建立了2个辅存卷nfss1、nfss2（注意所用的节点不要相同），

分别挂载到manage1服务器的/nfssecond目录和manage2服务器的/nfssecond目录（没有错，两个目录同名，这样cloudstack中看到的是一个，实际上是两个不同的glusterfs卷），

把ssvm等系统虚拟机模板文件放到manage1服务器的/nfssecond中，再复制一份到manage2服务器的/nfssecond中。

在ctdb的作用下，平时只有一个nfssecond在用，当一个挂掉的时候，切换到另一个，系统虚拟机模板文件还在，ssvm等可以正常启动。

快照文件都没了，需要重新生成，没关系，反正快照一般都是配置成定时生成新的覆盖旧的。

* 再注意 *

某些版本下qemu-kvm和glusterfs会端口冲突

本人版本如下

# rpm -q qemu-kvm
qemu-kvm-0.12.1.2-2.415.el6_5.8.x86_64

# rpm -q glusterfs
glusterfs-3.4.3-3.el6.x86_64

在做虚拟机迁移的时候，会出现以下错误

qemu-kvm: Migrate: socket bind failed: Address already in use Migration failed. Exit code tcp:[::]:49152(-1), exiting.

解决办法，在所有安装glusterfs的机器上执行：

# vi /etc/glusterfs/glusterd.vol

在“end-volume”之前加入一行：

option base-port 50152  （写其它端口也可以，反正不要写49152）

关闭云系统中的所有虚拟机，

再关闭相关程序

# /etc/init.d/cloudstack-management stop

# /etc/init.d/cloudstack-agent stop

# /etc/init.d/ctdb stop

# /etc/init.d/glusterd stop

# /etc/init.d/glusterfsd stop

启动程序

# /etc/init.d/glusterd start

# /etc/init.d/ctdb start

# /etc/init.d/cloudstack-management start

# /etc/init.d/cloudstack-agent start

启动云系统中的虚拟机，迁移测试。

** 特别注意 **

使用replica模式的时候，如果发生网络故障（比如交换机坏了、网线被碰掉了），而两台机器都还活着的时候，它们各自的数据读写还会继续。

当网络恢复时，它们都会认为自己的数据才是正确的，对方的是错误的，这就是俗称的脑裂。

双方谁都不肯妥协，结果就是文件数据读取错误，系统无法正常运行，

/var/log/glusterfs/glustershd.log中类似有以下错误记录

Unable to self-heal contents of '<gfid:6526e766-cb26-434c-9605-eacb21316447>' (possible split-brain).
Please delete the file from all but the preferred subvolume

处理方法是：

抓住其中一台（如果是某台网线掉了那么就抓它，如果是交换机坏了那么大家均等随便抓一台），把它当作精神病给它治疗

先关掉云系统相关等进程，

# /etc/init.d/cloudstack-agent stop

# virsh shutdown xx-xx-VM （如果shutdown不行就用destroy）

# virsh destroy  xx-xx-VM

# /etc/init.d/libvirtd stop

联上网，在正常的机器上执行以下操作：

# gluster volume status nfsp （看看这个节点有没有在线）

# gluster volume heal nfsp full （启动完全修复）

# gluster volume heal nfsp info  （查看需要修复的文件）

# gluster volume heal nfsp info healed  （查看修复成功的文件）

# gluster volume heal nfsp info heal-failed  （查看修复失败的文件）

# gluster volume heal nfsp info split-brain  （查看脑裂的文件）

Gathering Heal info on volume nfsp has been successful

Brick gls03.mycloud.com.cn:/glusterfs/nfsp
Number of entries: 24
at                   path on brick
-----------------------------------
2014-05-30 10:22:20 /36c741b8-2de2-46e9-9e3c-8c7475e4dd10
。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
。。。。。。。。。。。。。。。。。。。。。。。。。。。。。

在有病的那台机器上，删除脑裂的文件：

（注意！要删除的文件是在gluster volume info nfsp看到的目录中，不要去删除挂载的目录中的文件，不然就等着哭吧）

把硬链接文件找出来，也要删除：

# find /glusterfs/nfsp/ -samefile /glusterfs/nfsp/36c741b8-2de2-46e9-9e3c-8c7475e4dd10 -print

/glusterfs/nfsp/36c741b8-2de2-46e9-9e3c-8c7475e4dd10
/glusterfs/nfsp/.glusterfs/65/26/6526e766-cb26-434c-9605-eacb21316447 （这里看得出硬链接文件的目录名和日志中的gfid的对应关系）

删除掉：
# rm /glusterfs/nfsp/36c741b8-2de2-46e9-9e3c-8c7475e4dd10
# rm /glusterfs/nfsp/.glusterfs/65/26/6526e766-cb26-434c-9605-eacb21316447

在正常的机器上执行

# tail -1 /nfsprimary/36c741b8-2de2-46e9-9e3c-8c7475e4dd10 （读一下这个文件，触发修复）

# ls -l /glusterfs/nfsp/36c741b8-2de2-46e9-9e3c-8c7475e4dd10 人工查看一下两台机器的数据是否一致

其它的脑裂文件也是一样处理。

没问题的话，重新挂载目录:

# umount /nfsprimary

# /bin/mount -t glusterfs 10.10.10.21:/nfsp /nfsprimary/

启动libvirtd和cloudstack-agent进程重新加入云系统中。

如果云系统死活不认这个cloudstack-agent，

打开数据库vm_instance表检查各虚拟机的运行状态（state字段）是否跟实际有出入（比如明明应是Stopped的，说它Running），

把错误的地方修改保存后重启cloudstack-management。

掉线后的host不作处理，其上的VM可能还在运行，重新加入云系统后，如果其上运行的VM已经在另一个host中运行，则会被cloustack关闭重复的VM，

但还是建议先把host中运行的VM都关闭后再重新加入云系统，否则有可能出现两个host中运行同一个VM的诡异现象。

* 注意 *

上面的脑裂处理方式不是官方的，官方的处理方式很复杂，有兴趣的可以自己去看看。
https://access.redhat.com/site/d ... _Guide/ch10s11.html

glusterfs发生脑裂是一件很郁闷的事情，有时候折腾半天都未必能搞好，可以用Server-quorum来预防脑裂。

（只是预防而已，万一不幸发生了脑裂，还是得手工处理。）

# gluster volume set nfsp cluster.server-quorum-type server

# gluster volume set all cluster.server-quorum-ratio 51%  （有效的出席率）

有节点离线或者网络分裂时，系统举行点名，各个尚存在联系的节点组成一个圈子点名，出席率超过总数的51%，则认为本圈子有效，本圈子的系统继续运行；

其它的分裂网络的节点（或者离线的节点），组成一个圈子点名，出席率无法超过总数的51%，认为本圈子（或自己）已经被世界抛弃了，不再接受数据写入。

如果你总共只有两个节点，这样做是会有问题的，因为当一台机掉线时你无法满足51%的出席率，整个系统不再有效；

而设置为50%或以下，则任何时候都有效，没有意义。为解决这个问题，可以添加一台其它的机子到存储池中，出个席露个脸（并不参与存储）。

* 其它 *

glusterfs配置文件所在目录为 /var/lib/glusterd/vols/ 可以自己调整配置。

glusterfs性能优化：

# gluster volume set nfsp performance.read-ahead on （开启后台文件预读取）

# gluster volume set nfsp performance.readdir-ahead on （开启后台目录预读取，glusterfs3.5版本才支持）

# gluster volume set nfsp performance.cache-size 256MB （调整卷的缓存）

# gluster volume set nfsp cluster.stripe-block-size 128KB (条带大小，默认128K，使用条带模式时才起作用)

# gluster volume set nfsp performance.write-behind on （开启后台写聚合，开启这个参数对大文件写入速度提升非常大，但对小文件基本无能为力）

# gluster volume set nfsp performance.write-behind-window-size 16MB

# gluster volume set nfsp performance.io-thread-count 16 （读写线程数，并发量不大的话这个值就设小一点，够用就行，太大会崩溃）

# gluster volume set nfsp performance.flush-behind on（经测试这个优化参数开启后写入速度反而降低了一点，也许是我敲键盘的姿势不对）

其它glusterfs性能优化选项参考：
http://gluster.org/community/doc ... ning_Volume_Options

如果创建了一个卷，删掉了，再次创建，会报错
volume create: nfsp: failed: /glusterfs/nfsp or a prefix of it is already part of a volume

在改卷涉及的所有节点执行以下操作

# setfattr -x trusted.glusterfs.volume-id /glusterfs/nfsp

# setfattr -x trusted.gfid /glusterfs/nfsp

# rm -rf /glusterfs/nfsp/.glusterfs

# /etc/init.d/glusterd restart

ctdb错误排查：

# ctdb status
Number of nodes:2
pnn:0 10.10.10.21    OK
pnn:1 10.10.10.22    UNHEALTHY (THIS NODE) （这里显示有问题）
Generation:957954854
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:0

# ctdb scriptstatus
12 scripts were executed last monitor cycle
00.ctdb             Status:OK Duration:0.008 Mon May 12 11:38:40 2014
01.reclock          Status:OK Duration:0.014 Mon May 12 11:38:40 2014
10.interface       Status:OK Duration:0.020 Mon May 12 11:38:40 2014
11.natgw          Status:OK Duration:0.006 Mon May 12 11:38:40 2014
11.routing          Status:OK Duration:0.006 Mon May 12 11:38:40 2014
13.per_ip_routing Status:OK Duration:0.006 Mon May 12 11:38:40 2014
20.multipathd       Status:OK Duration:0.006 Mon May 12 11:38:40 2014
31.clamd          Status:OK Duration:0.007 Mon May 12 11:38:40 2014
40.vsftpd          Status:OK Duration:0.007 Mon May 12 11:38:40 2014
41.httpd          Status:OK Duration:0.007 Mon May 12 11:38:40 2014
50.samba          Status:OK Duration:0.006 Mon May 12 11:38:40 2014
60.nfs             Status:ERROR Duration:0.030 Mon May 12 11:38:40 2014
OUTPUT:rpcinfo: RPC: Program not registeredERROR: NFS not responding to rpc requests

解决：
# gluster volume set nfsp nfs.disable on （取消glusterfs自带的nfs服务）
# gluster volume set nfss nfs.disable on
# gluster volume set ctdb nfs.disable on

# /etc/init.d/ctdb restart
# ctdb status
Number of nodes:2
pnn:0 10.10.10.21    OK
pnn:1 10.10.10.22    OK (THIS NODE)
Generation:1287560823
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:0

*下面是抄别人的，我还没有试过

gluster运维过程常用的指令：

# 删除卷gfs
gluster volume stop gfs
gluster volume delete gfs

# 将机器移出集群
gluster peer detach agent22.kisops.org

# gfs卷扩容（由于副本数设置为2,至少要添加2（4、6、8..）台机器）
gluster peer probe agent23.kisops.org # 加节点
gluster peer probe agent24.kisops.org # 加节点
gluster volume add-brick gfs agent23.kisops.org:/data/glusterfs agent24.kisops.org:/data/glusterfs

# 迁移卷
gluster peer probe agent25.kisops.org # 将agent31.kisops.org的数据迁移到agent25.kisops.org,先将agent25.kisops.org加入集群
gluster volume replace-brick gfs agent31.kisops.org:/data/glusterfs agent25.kisops.org:/data/glusterfs start # 开始迁移
gluster volume replace-brick gfs agent31.kisops.org:/data/glusterfs agent25.kisops.org:/data/glusterfs status # 查看迁移状态
gluster volume replace-brick gfs agent31.kisops.org:/data/glusterfs agent25.kisops.org:/data/glusterfs commit # 数据迁移完毕后提交
gluster volume replace-brick gfs agent31.kisops.org:/data/glusterfs agent22.kisops.org:/data/glusterfs commit -force # 如果机器agent31.kisops.org出现故障已经不能运行,执行强制提交

gluster volume heal gfs full # 同步整个gfs卷

其他操作
1. To quota glusterfs
# gluster volume quota test-volume enable    -- 激活 quota 功能
# gluster volume quota test-volume disable    -- 关闭 quota 功能
# gluster volume quota test-volume limit-usage /data 10GB --/exp2/data  目录限制
# gluster volume quota test-volume list       --quota 信息列表
# gluster volume quota test-volume list /data  -- 限制目录的 quota 信息
# gluster volume set test-volume features.quota-timeout 5 -- 设置信息的超时时间
# gluster volume quota test-volume remove /data  –删除某个目录的 quota 设置
备注：
1 ） quota 功能，主要是对挂载点下的某个目录进行空间限额。如 :/mnt/glusterfs/data 目录 . 而不是对组成卷组的空间进行限制，如 :/exp2 /exp3
2 ） gluster volume set test-volume features.quota-timeout ，这个参数，主要用于客户端，设置客户端何时重新读配置文件。
因为相应的 quota 信息是在服务端设置的，而相应的限额生效，是在挂载点及客户端。所以，必须通知客户端，相应的配置文件何时从服务端重读。

glusterfs故障处理

一台节点发生故障的时候：

找一台完全一样的机器，至少要保证硬盘数量和大小一致，安装系统，配置和故障机同样的ip，安装gluster软件，保证配置都一样，在其他健康的节点上执行命令gluster peer status，查看故障服务器的uuid，

gluster peer status

Number of Peers: 2

Hostname: 10.10.10.172

Uuid: 64b345d4-6c9c-43d8-82ef-68c228c4b7ed

State: Peer in Cluster (Connected)

Hostname: 10.10.10.176

Uuid: 9133d139-f9c4-484d-acdf-d11f0452878a

State: Peer in Cluster (Disconnected)

修改新加机器的/var/lib/glusterd/glusterd.info和故障机器的一样

cat /var/lib/glusterd/glusterd.info

UUID=9133d139-f9c4-484d-acdf-d11f0452878a

在任意节点上执行

root@drbd01 ~]# gluster volume heal test-volume full

Launching Heal operation on volume test-volume has been successful

就会自动开始同步，但是同步的时候会影响整个系统的性能。

可以查看状态

[root@drbd01 ~]# gluster volume heal test-volume info

Gathering Heal info on volume test-volume has been successful

cloudstack4.21+kvm故障处理

本文只记录了故障处理，安装教程建议参考这个http://my.oschina.net/u/572653/blog/145013

***注意***

系统有问题，先看日志

CS管理端默认的日志文件是 /var/log/cloudstack/management/management-server.log

记得一定要看日志，才能找到出错的原因。

不要光盯着控制台看，没用的。

1、加载二级存储失败

查看辅助存储信息是否有异常，删除异常的辅助存储，重新添加。

查看数据库cloud.image_store表，清除异常的二级存储信息。

注意nfs server开放防火墙端口，不仅是针对management和kvm主机，还要针对系统虚拟机（如ssvm）开放端口，还要注意ssvm每次重启后IP是可能会变的。

2、清除僵尸虚拟机

主机被关机后，虚拟机已挂掉，但控制台中仍看到活着的虚拟机，且无法迁移、重启、关闭。

在数据库中修改cloud.vm_instance相应字段，将其状态置为Destroyed。

3、无法挂载nfs
内部错误 Child process (/bin/mount 10.10.10.11:/11exportprimary /mnt/2318917e-fb4e-3531-bd98-7c0f432f7347) unexpected exit status 32: mount.nfs: access denied by server while mounting 10.10.10.11:/11exportprimary

# vi /etc/exports
/11exportprimary  *(insecure,rw,async,no_root_squash,no_subtree_check)
/11exportsecondary  *(insecure,rw,async,no_root_squash,no_subtree_check)

# /etc/init.d/nfs restart

4、删除主机
删除主机后重新添加，可能会因信息残留而有问题。
将主机改名后再添加。

5、无法创建快照 cloudstack Failed to create snapshot due to an interna
修改全局变量 kvm.snapshot.enabled

6、无法启动svm系统虚拟机
管理主机和kvm主机都要修改配置文件 /etc/idmapd.conf
# vi /etc/idmapd.conf  //取消Domain前的注释，并修改为管理主机和虚拟主机所在的实际域名，即修改company.com
Domain = mycloud.com.cn

注意查看以下这几个文件，有时候配置参数会自己变的（变态）
/etc/libvirt/qemu.conf
/etc/idmapd.conf
/etc/libvirt/libvirtd.conf
/etc/sysconfig/libvirtd

7、全局设置：

secstorage.allowed.internal.sites 改为 10.10.10.0/24（实际存储网段）

management.network.cidr 改为 10.10.10.0/24 （实际管理网段）

其它要关注的一些全局参数：

expunge 删除实例弥留时间相关参数

ha.tag 高可用标签

ha.workers 高可用守护线程数

overprovisioning 系统资源超配相关参数

snapshot 快照相关参数
kvm.snapshot.enabled 使用kvm作为宿主机这个参数必须设置为true

allocated.capacity 资源分配阀值相关参数，超过阀值将无法创建和运行虚拟机。

8、SSVM无法挂载二级存储

日志报错：Unable to mount 10.10.10.11:/11exportsecondary at /mnt/SecStorage/b9bca14d-cc9e-364f-a5a6-130854b94f1e due to mount.nfs

登录ssvm控制台执行：

# /usr/local/cloud/systemvm/ssvm-check.sh

查看nfs挂载是否有问题，并尝试手动挂载看有没有错误：

# mount -t nfs 10.10.10.11:/11exportsecondary /mnt/SecStorage/b9bca14d-cc9e-364f-a5a6-130854b94f1e

检查网络设置，同时nfs服务器防火墙针对svm开放端口

SSVM无法挂载二级存储的一个显著症状，就是**二级存储容量不准确**，这是CS初学者遇到的最普遍的问题。

如何登录SSVM？

在CS控制台查询SSVM所在的host和link local IP（如169.254.1.123），然后到该host上，使用如下命令登陆SSVM：

ssh -i /root/.ssh/id_rsa.cloud -p 3922 root@169.254.1.123

9、无法创建镜像
com.cloud.utils.exception.CloudRuntimeException: Failed to backup 1670c7fd-2e66-42b5-8155-1564cc2c4e3a for disk /mnt/0373c9c2-5fd2-3ec4-b7be-128c11a0114b/e2341ae2-4d9a-49fb-b6b9-286918932bb0 to /mnt/7c3f19d1-784c-3585-a8bd-e881a6ac312c/snapshots/2/10

qemu-img版本兼容性问题，在kvm主机上执行：

# yum install qemu-img

# rpm -q qemu-img

qemu-img-0.12.1.2-2.415.el6_5.7.x86_64

# mkdir cloud-qemu-img

# cd cloud-qemu-img

# wget http://vault.centos.org/6.4/upda ... l6_4_4.1.x86_64.rpm

# rpm2cpio qemu-img-0.12.1.2-2.355.el6_4_4.1.x86_64.rpm |cpio -idmv

# cp ./usr/bin/qemu-img /usr/bin/cloud-qemu-img

10、无法启动虚拟机，/var/log/cloudstack/agent/agent.log报错：

Requested operation is not valid: domain 'r-4-VM' is already being started

手工把虚拟机启动：

# virsh start r-4-VM

再手工关闭：

# virsh shutdown r-4-VM

如果shutdown关闭不了，就用destroy

再到CS控制台中启动虚拟机。

附：最小化安装的centos等系统（VM）是无法用virsh shutdown来关闭的，要用的话得在VM中安装acpid

# yum -y install acpid

# service acpid restart

# chkconfig acpid on

11、某kvm host主机挂掉后重新上线，云系统死活不认这个cloudstack-agent，

打开数据库vm_instance表检查各虚拟机的运行状态（state字段）是否跟实际有出入（比如明明应是Stopped的，说它Running），

把错误的地方修改保存后重启cloudstack-management。

12、明明有资源，却说没有资源无法启动虚拟机，
manage日志报错如下：
hostId: 1 is in avoid set, skipping this and trying other available hosts
hostId: 5 is in avoid set, skipping this and trying other available hosts
...........................
No suitable hosts found
No suitable hosts found under this Cluster: 1
Could not find suitable Deployment Destination for this VM under any clusters, returning.
...........................
Removing from the clusterId list these clusters from avoid set: [1]
...........................
No clusters found after removing disabled clusters and clusters in avoid list, returning.
...........................

查看是不是有的host并没有在线（要看cloudstack-agent日志有没有错误，控制台有时看到的是假象）

13、host无法正常加入系统，反复断开

cloudstack-agent日志报错如下：
Connected to 10.10.10.15:8250
Proccess agent startup answer, agent id = 0
Set agent id 0
Startup Response Received: agent id = 0
Connected to the server
Lost connection to the server. Dealing with the remaining commands...

注意看manage的日志中的host id 和cloudstack-agent日志中的agent id，是不是对应不上，是的话删除这个host重新添加。

如果删不掉，关闭manage和agent，清除数据库中host_details和host表与这个host对应的数据（如果提示其它表有关联数据也一起删除），重启manage和agent。

14、已经能正常运行的系统，某次host掉线重新连接后，主存、二存、防火墙都没有问题，但ssvm等系统虚拟机反复重启无法启动成功

查看agent.log是否有这样的日志：

ERROR [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-4:null) org.libvirt.LibvirtException: ?????? '/nfsprimary/afe3bed3-c0ad-4ea5-9a39-b59644043966': ??????

可能这个VM已经变成了僵尸，关掉它

查看volumes表中afe3bed3-c0ad-4ea5-9a39-b59644043966对应的卷名，关掉对应的VM（卷名和VM的数字是相等的）

15、服务器意外关闭，重启后一个VM无法启动，日志报错“no usable volumes found for the VM”，查看该VM的卷，文件可访问，

查看数据库，volumes表中该VM行的state字段值为Snapshotting，显然服务器意外关闭时该VM正在做快照，VM被冻结，服务器重启后未解除冻结状态，

将state字段值改为Ready，VM可正常启动。

# Some Global Params you'll want to Check or Set:
check.pod.cidrs
consoleproxy.capacity.standby
consoleproxy.launch.max
external.firewall.default.capacity
guest.domain.suffix
host
management.network.cidr
max.account.*
max.project.*
max.template.iso.size
network.guest.cidr.limit
remote.access.vpn.client.iprange
remote.access.vpn.psk.length
remote.access.vpn.user.limit
router.cpu.mhz
sdn.ovs.controller.default.label
secstorage.allowed.internal.sites
secstorage.capacity.standby
secstorage.session.max
system.vm.default.hypervisor
usage.execution.timezone