连接数目是一方面,导致的原因其实挺多的。还有比如一直连接,但是却不释放造成客户端的连接设置为time_wait状态,服务器端连接依然为established
更多
#############################################
生产使用redis一段时间后,生产配置为sentinel方式的集群,为三台,出现问题如下
[mw_shl_code=bash,true]Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'redisTemplate' defined in class path resource [application.xml]: Cannot resolve reference to bean 'jedisConnectionFactory' while setting bean property 'connectionFactory'; nested exception is org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'jedisConnectionFactory' defined in class path resource [application.xml]: Invocation of init method failed; nested exception is redis.clients.jedis.exceptions.JedisDataException: ERR max number of clients reached
at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveReference(BeanDefinitionValueResolver.java:328) ~[spring-beans-3.1.0.RELEASE.jar:3.1.0.RELEASE]
at org.springframework.beans.factory.support.BeanDefinitionValueResolver.resolveValueIfNecessary(BeanDefinitionValueResolver.java:106) ~[spring-beans-3.1.0.RELEASE.jar:3.1.0.RELEASE][/mw_shl_code]
到redis服务器查看连接数,发现连接数为10004,重启sentinel,问题解决,但是sentinel连接数依然在增加
[mw_shl_code=bash,true]netstat -nap|grep 26379|wc -l
[/mw_shl_code]
到weblogic服务器查看连接数 发现为15个,查看代码,redistemplate初始化一个JedisSentinelPool,JedisSentinelPool创建三个线程订阅了sentinel,每个线程与sentinel创建一个连接,weblogic部署了三个应用, 应该为9个连接,发现进程为27710的与sentinel 建立了9个连接
[mw_shl_code=bash,true] for (String sentinel : sentinels) {
final HostAndPort hap = toHostAndPort(Arrays.asList(sentinel.split(":")));
MasterListener masterListener = new MasterListener(masterName, hap.getHost(), hap.getPort());
masterListeners.add(masterListener);
masterListener.start();
}[/mw_shl_code]
接着查 在weblogic服务器上面使用命令,将test.hprof导入到jvisualvm.exe查看类有两个JedisSentinelPool类和实例三个
[mw_shl_code=bash,true]jrcmd 27710 hprofdump filename=test.hprof
27710:[/mw_shl_code]
返回查看项目代码发现单独另外初始化了一个JedisSentinelPool,并且没有加入destroy-method,weblogic 重新部署的时候MasterListener 线程没有关闭,所以没重新部署一次都会增加三个连接,加入destroy-method后重部署客户端sentinel连接正常喂9个每个应用三个连接
[mw_shl_code=bash,true]<bean id="jedisSentinelPool"
class="redis.clients.jedis.JedisSentinelPool" destroy-method="destroy">[/mw_shl_code]
至此问题似乎解决,但是sentinel连接数依然增加,使用命令,查看进程文件描述符发现有一段时间有很多文件描述符,意味着sentinel连接客户端被回收后又新建了很多连接,似乎是sentinel连接异常后客户端重新建立了连接,但是服务器没有释放,查看sentinel没有心跳检测,出现异常连接不是自动释放。
[mw_shl_code=bash,true]lsof -i:26379
ll /proc/{进程ID}/fd |grep {文件描述符}[/mw_shl_code]
理论上JedisSentinelPool只有初始化时才和sentinel建立连接查看JedisSentinelPool源码
[mw_shl_code=bash,true] running.set(true);
while (running.get()) {
j = new Jedis(host, port);
try {
j.subscribe(new JedisPubSub() {
@Override
public void onMessage(String channel, String message) {
log.fine("Sentinel " + host + ":" + port + " published: " + message + ".");
String[] switchMasterMsg = message.split(" ");
if (switchMasterMsg.length > 3) {
if (masterName.equals(switchMasterMsg[0])) {
initPool(toHostAndPort(Arrays.asList(switchMasterMsg[3], switchMasterMsg[4])));
} else {
log.fine("Ignoring message on +switch-master for master name "
+ switchMasterMsg[0] + ", our master name is " + masterName);
}
} else {
log.severe("Invalid message received on Sentinel " + host + ":" + port
+ " on channel +switch-master: " + message);
}
}
}, "+switch-master");
} catch (JedisConnectionException e) {
if (running.get()) {
log.severe("Lost connection to Sentinel at " + host + ":" + port
+ ". Sleeping 5000ms and retrying.");
try {
Thread.sleep(subscribeRetryWaitTimeMillis);
} catch (InterruptedException e1) {
e1.printStackTrace();
}
} else {
log.fine("Unsubscribing from Sentinel at " + host + ":" + port);
}
}
}[/mw_shl_code]
果然当订阅失败后,会重新建立sentinel连接,但是为什么会订阅失败,查看weblogic日志,发现每隔7875s重新建立sentinel连接。至此进入死胡同,测试JedisSentinelpool的监听线程,发现订阅sentinel只要没有切换,一直就不会有消息,也就是说sentinel连接是不活动的,猜测是否有防火墙,导致senintel连接异常,果然生产使用了juniper防火墙,缺省情况下,Juniper防火墙对每一个会话的连接保持时间是30分钟(TCP)和1分钟(UDP),超时后状态表项将会被清除。顿时脑洞大开,修改防火墙策略,问题解决。
剩下最后一个问题,为什么是7875s,查看tcp,发现linux有个keepalive设置
[mw_shl_code=bash,true]sysctl -a |grep keep
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200[/mw_shl_code]
这个意思是tcp连接存活时间是7200s,然后每隔75发送一个keepalive包,重发9次,时间刚好7875s,Jedis在创建连接的时候设置keepalive=true,但是redis默认keepalive为0没有开启,sentinel是特殊的redis,启动时使用了redis keepalive参数,所以sentinel不会向客户端发送keepalive心跳包,客户端两小时会向服务端发送心跳包,但是此时连接已经被被防火墙设置为失效,然后客户端的连接设置为time_wait状态,服务器端连接依然为established,不会释放
|