分享

Hbase启动出现的很怪异的问题???? 折磨了一天了,求教!!!

nobileamir 发表于 2015-4-13 18:20:28 [显示全部楼层] 只看大图 回帖奖励 阅读模式 关闭右栏 3 170991

起因:    昨天早上,公司突然断电。 来电后,手残,用root启动了集群,之后就关闭。

导致的问题:

        第一阶段:  HMaster 启动后自动退出。  查了资料知道是权限的问题,进行修改,但是问题并没有解决。o(︶︿︶)o 唉 老老实实查看日志吧! 发现日志里有一段:   

  1. 2015-04-13 11:16:42,457 INFO org.apache.hadoop.hbase.master.SplitLogManager: found 0 orphan tasks and 0 rescan nodes
  2. 2015-04-13 11:16:42,479 INFO org.apache.hadoop.hdfs.DFSClient: No node available for block: blk_-5251943924290716300_1002 file=/hbase/hbase.version
  3. 2015-04-13 11:16:42,479 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-5251943924290716300_1002 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
  4. 2015-04-13 11:16:45,482 INFO org.apache.hadoop.hdfs.DFSClient: No node available for block: blk_-5251943924290716300_1002 file=/hbase/hbase.version
  5. 2015-04-13 11:16:45,482 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-5251943924290716300_1002 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
  6. 2015-04-13 11:16:48,483 INFO org.apache.hadoop.hdfs.DFSClient: No node available for block: blk_-5251943924290716300_1002 file=/hbase/hbase.version
  7. 2015-04-13 11:16:48,484 INFO org.apache.hadoop.hdfs.DFSClient: Could not obtain block blk_-5251943924290716300_1002 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
  8. 2015-04-13 11:16:51,487 WARN org.apache.hadoop.hdfs.DFSClient: DFS Read: java.io.IOException: Could not obtain block: blk_-5251943924290716300_1002 file=/hbase/hbase.version
  9.         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:2266)
  10.         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:2060)
  11.         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2221)
  12.         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2149)
  13.         at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:337)
  14.         at java.io.DataInputStream.readUTF(DataInputStream.java:589)
  15.         at org.apache.hadoop.hbase.util.FSUtils.getVersion(FSUtils.java:289)
  16.         at org.apache.hadoop.hbase.util.FSUtils.checkVersion(FSUtils.java:327)
  17.         at org.apache.hadoop.hbase.master.MasterFileSystem.checkRootDir(MasterFileSystem.java:444)
  18.         at org.apache.hadoop.hbase.master.MasterFileSystem.createInitialFileSystemLayout(MasterFileSystem.java:148)
  19.         at org.apache.hadoop.hbase.master.MasterFileSystem.<init>(MasterFileSystem.java:133)
  20.         at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:549)
  21.         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:408)
  22.         at java.lang.Thread.run(Thread.java:745)
复制代码
  后来找到在  .logs下有一个名字叫: Slave1.Hadoop,60020,1428640214680-splitting 文件夹下的 日志文件,有问题!  我就把这个文件给删了。 重启集群, 问题解决了! 但新的问题来了:

          第二阶段:  集群正常启动了, 但是在 http://192.168.1.2:60010页面上看到的结果是:

60010.png


赶紧查看集群进程, 也让俺郁闷啊!

  1. [hadoop@Master ~]$ ./hadoopCtrl.sh list
  2. 3118 NameNode
  3. 4829 Jps
  4. 3533 QuorumPeerMain
  5. 3288 SecondaryNameNode
  6. 3380 JobTracker
  7. 3820 HMaster
  8. ======= Master.Hadoop  ==========
  9. 2896 DataNode
  10. 3000 TaskTracker
  11. 3627 Jps
  12. 3121 QuorumPeerMain
  13. 3230 HRegionServer
  14. ======= Slave1.Hadoop  ==========
  15. 2511 Jps
  16. 2186 HRegionServer
  17. 1968 TaskTracker
  18. 1874 DataNode
  19. 2105 QuorumPeerMain
  20. ======= Slave2.Hadoop  ==========
  21. 2067 DataNode
  22. 2522 Jps
  23. 2386 HRegionServer
  24. 2308 QuorumPeerMain
  25. 2172 TaskTracker
  26. ======= Slave3.Hadoop  ==========
复制代码

那我看看 Hbase Shell  怎么样吧!  结果:

  1. hbase(main):002:0> list
  2. TABLE
  3. COLUMNSTABLE
  4. PERSONALINFO
  5. configtable
  6. 3 row(s) in 0.1230 seconds
  7. hbase(main):002:0> scan 'configtable'
  8. ROW                                         COLUMN+CELL
  9. ERROR: org.apache.hadoop.hbase.client.NoServerForRegionException: Unable to find region for configtable,,99999999999999 after 7 tries.
  10. Here is some help for this command:
  11. Scan a table; pass table name and optionally a dictionary of scanner
  12. specifications.  Scanner specifications may include one or more of:
  13. TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH,
  14. or COLUMNS, CACHE
  15. If no columns are specified, all columns will be scanned.
  16. To scan all members of a column family, leave the qualifier empty as in
  17. 'col_family:'.
  18. The filter can be specified in two ways:
  19. 1. Using a filterString - more information on this is available in the
  20. Filter Language document attached to the HBASE-4176 JIRA
  21. 2. Using the entire package name of the filter.
  22. Some examples:
  23.   hbase> scan '.META.'
  24.   hbase> scan '.META.', {COLUMNS => 'info:regioninfo'}
  25.   hbase> scan 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
  26.   hbase> scan 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}
  27.   hbase> scan 't1', {FILTER => "(PrefixFilter ('row2') AND (QualifierFilter (>=, 'binary:xyz'))) AND (TimestampsFilter ( 123, 456))"}
  28.   hbase> scan 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
  29. For experts, there is an additional option -- CACHE_BLOCKS -- which
  30. switches block caching for the scanner on (true) or off (false).  By
  31. default it is enabled.  Examples:
  32.   hbase> scan 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}
  33. Also for experts, there is an advanced option -- RAW -- which instructs the
  34. scanner to return all cells (including delete markers and uncollected deleted
  35. cells). This option cannot be combined with requesting specific COLUMNS.
  36. Disabled by default.  Example:
  37.   hbase> scan 't1', {RAW => true, VERSIONS => 10}
复制代码

   发现 Hbase 集群启动了,但 怎么也不像一个集群啊! 继续看日志: 

  1. 2015-04-13 17:00:38,737 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: The identifier of this process is 3820@Master.Hadoop
  2. 2015-04-13 17:00:38,738 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server Slave1.Hadoop/192.168.1.3:2181. Will not attempt to authenticate using SASL (unknown error)
  3. 2015-04-13 17:00:38,738 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to Slave1.Hadoop/192.168.1.3:2181, initiating session
  4. 2015-04-13 17:00:38,755 WARN org.apache.zookeeper.ClientCnxnSocket: Connected to an old server; r-o mode will be unavailable
  5. 2015-04-13 17:00:38,755 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server Slave1.Hadoop/192.168.1.3:2181, sessionid = 0x24cb1fcf74e0004, negotiated timeout = 40000
  6. 2015-04-13 17:00:39,178 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 267374 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
  7. 2015-04-13 17:00:40,680 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 268876 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
  8. 2015-04-13 17:00:42,182 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 270378 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
  9. 2015-04-13 17:00:43,684 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 271880 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
  10. 2015-04-13 17:00:45,186 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 273382 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
  11. 2015-04-13 17:00:46,688 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 274884 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
  12. 2015-04-13 17:00:48,190 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 276386 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
  13. 2015-04-13 17:00:49,692 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 277888 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
  14. 2015-04-13 17:00:51,194 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 279390 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
  15. 2015-04-13 17:00:52,696 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 280892 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
  16. 2015-04-13 17:00:54,199 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 282395 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
  17. 2015-04-13 17:00:55,701 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 283897 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
  18. 2015-04-13 17:00:57,203 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 285399 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
  19. HMaster  正常启动后,成了这个样子。。。。。。
复制代码
HRegionServer  日志 是这个样子的:

  1. 2015-04-13 16:10:03,292 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at localhost,60000,1428912179738
  2. 2015-04-13 16:11:03,326 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to master. Retrying. Error was:
  3. java.net.ConnectException: 拒绝连接
  4.         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
  5.         at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
  6.         at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
  7.         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656)
  8.         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:390)
  9.         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:436)
  10.         at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1124)
  11.         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
  12.         at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
  13.         at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
  14.         at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
  15.         at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
  16.         at org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1995)
  17.         at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2041)
  18.         at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:736)
  19.         at java.lang.Thread.run(Thread.java:745)
  20. 2015-04-13 16:11:03,527 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at localhost,60000,1428912179738
  21. 2015-04-13 16:12:03,562 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to master. Retrying. Error was:
  22. org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is in the failed servers list: localhost/192.168.1.3:60000
  23.         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:425)
  24.         at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1124)
  25.         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
  26.         at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
  27.         at com.sun.proxy.$Proxy8.getProtocolVersion(Unknown Source)
  28.         at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
  29.         at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
  30.         at org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1995)
  31.         at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2041)
  32.         at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:736)
  33.         at java.lang.Thread.run(Thread.java:745)
复制代码

试过恢复 .meta.  然后行不通。

  1. hbase hbck
  2. 1.重新修复hbase meta表
  3. hbase hbck -fixMeta
  4. 2.重新将hbase meta表分给regionserver
  5. hbase hbck -fixAssignments
  6. 但是一直 报提示:
  7. <div align="left">15/04/13 18:13:31 INFO zookeeper.ZooKeeper: Initiating client connection, connec                                                                                        tString=192.168.1.4:2181,192.168.1.3:2181,192.168.1.5:2181 sessionTimeout=180000                                                                                         watcher=hconnection
  8. 15/04/13 18:13:31 INFO zookeeper.RecoverableZooKeeper: The identifier of this pr                                                                                        ocess is 5476@Master.Hadoop
  9. 15/04/13 18:13:31 INFO zookeeper.ClientCnxn: Opening socket connection to server                                                                                         Slave2.Hadoop/192.168.1.4:2181. Will not attempt to authenticate using SASL (un                                                                                        known error)
  10. 15/04/13 18:13:31 INFO zookeeper.ClientCnxn: Socket connection established to Sl                                                                                        ave2.Hadoop/192.168.1.4:2181, initiating session
  11. 15/04/13 18:13:31 WARN zookeeper.ClientCnxnSocket: Connected to an old server; r                                                                                        -o mode will be unavailable
  12. 15/04/13 18:13:31 INFO zookeeper.ClientCnxn: Session establishment complete on s                                                                                        erver Slave2.Hadoop/192.168.1.4:2181, sessionid = 0x34cb1fcf83e0000, negotiated                                                                                         timeout = 40000
  13. <font color="Red">15/04/13 18:14:31 DEBUG client.HConnectionManager$HConnectionImplementation: Looked up root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@735404c6; serverName=
  14. </font></div>
复制代码

还请大神,指点一二!!   小弟不胜感激, 现在这里谢谢,谢谢




已有(3)人评论

跳转到指定楼层
desehawk 发表于 2015-4-14 00:33:31
hbase shell跟hbase master是不一样的。即使master挂掉,hbase shell照样还是能使用的。
从错误
hbase shell和下面错误来安

来看是hmaster挂掉了,连接不上
回复

使用道具 举报

nobileamir 发表于 2015-4-14 09:06:40
回复

使用道具 举报

nobileamir 发表于 2015-4-14 12:03:51
原来Hbase配置有点问题,修改了hbase-site.xml 添加
  1. <property>
  2.         <name>hbase.rpc.timeout</name>
  3.         <value>1200000</value>
  4. </property>
  5. <property>
  6.         <name>hbase.snapshot.master.timeoutMillis</name>
  7.         <value>1200000</value>
  8. </property>
复制代码


还有 忽略了一个错误:
  1. WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to master. Retrying. Error was:
  2. java.net.ConnectException: Connection refused
  3.         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
  4.         at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
  5.         at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
  6.         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489)
  7.         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:328)
  8.         at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:362)
  9.         at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1046)
  10.         at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:898)
复制代码


感谢: http://blog.csdn.net/xiaolang85/article/details/8018112

回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条