ckaiwj1314 发表于 2015-1-5 12:36:28

security hadoop持续认证失败

你用rpm安装cdh5.2的hadoop集群 并用kerberos做了安全认证,namenode进程起来的时候 提示kerberos认证成功。但是一天之后(24小时)就提示认证失败 很规律 每次重新起进程 就只有24小时可以用 过后就报错 提示认证失败求各位大神帮帮忙


日志情况如下

2015-01-01 16:37:53,116 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for hdfs/hadoop2@EXAMPLE.COM (auth:KERBEROS)
2015-01-01 16:37:53,119 INFO SecurityLogger.org.apache.hadoop.security.authorize.ServiceAuthorizationManager: Authorization successful for hdfs/hadoop2@EXAMPLE.COM (auth:KERBEROS) for protocol=interface org.apache.hadoop.hdfs.server.protocol.NamenodeProtocol
2015-01-01 16:37:53,119 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.80.14.22
2015-01-01 16:37:53,119 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2015-01-01 16:37:53,119 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 12668
2015-01-01 16:37:53,119 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 22 74
2015-01-01 16:37:53,127 WARN org.apache.hadoop.security.UserGroupInformation: Not attempting to re-login since the last re-login was attempted less than 600 seconds before.
2015-01-01 16:37:53,127 WARN org.apache.hadoop.security.UserGroupInformation: Not attempting to re-login since the last re-login was attempted less than 600 seconds before.
2015-01-01 16:37:54,764 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 1644ms to send a batch of 1 edits (17 bytes) to remote journal 10.80.14.21:8485
2015-01-01 16:37:56,004 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 2883ms to send a batch of 1 edits (17 bytes) to remote journal 10.80.14.22:8485
2015-01-01 16:37:56,010 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of syncs: 2 SyncTimes(ms): 2908 79
2015-01-01 16:37:56,037 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /hadoopdata/hadoop/name/current/edits_inprogress_0000000000000012668 -> /hadoopdata/hadoop/name/current/edits_0000000000000012668-0000000000000012669
2015-01-01 16:37:56,037 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 12670
2015-01-01 16:37:56,419 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 3299ms to send a batch of 1 edits (17 bytes) to remote journal 10.80.14.26:8485
2015-01-01 16:38:09,685 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-01-01 16:38:09,686 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
2015-01-01 16:38:19,785 INFO org.apache.hadoop.hdfs.server.namenode.ImageServlet: ImageServlet allowing checkpointer: hdfs/hadoop2@EXAMPLE.COM
2015-01-01 16:38:19,825 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Transfer took 0.04s at 100.00 KB/s
2015-01-01 16:38:19,825 INFO org.apache.hadoop.hdfs.server.namenode.TransferFsImage: Downloaded file fsimage.ckpt_0000000000000012669 size 4704 bytes.
2015-01-01 16:38:19,873 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 12609
2015-01-01 16:38:19,873 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Purging old image FSImageFile(file=/hadoopdata/hadoop/name/current/fsimage_0000000000000012549, cpktTxId=0000000000000012549)
2015-01-01 16:38:39,685 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-01-01 16:38:39,685 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
2015-01-01 16:39:09,685 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 30000 milliseconds
2015-01-01 16:39:09,686 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).   这边很突然 没有error 认证就失败了 进程还在但是已经不能用了
2015-01-01 16:39:23,958 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.80.14.21:45417:null (GSS initiate failed)
2015-01-01 16:39:23,959 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8020: readAndProcess from client 10.80.14.21 threw exception ]
2015-01-01 16:39:24,567 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.80.14.21:36313:null (GSS initiate failed)
2015-01-01 16:39:24,567 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8020: readAndProcess from client 10.80.14.21 threw exception ]
2015-01-01 16:39:24,704 WARN SecurityLogger.org.apache.hadoop.ipc.Server: Auth failed for 10.80.14.21:34757:null (GSS initiate failed)
2015-01-01 16:39:24,704 INFO org.apache.hadoop.ipc.Server: Socket Reader #1 for port 8020: readAndProcess from client 10.80.14.21 threw exception ]


tntzbzc 发表于 2015-1-5 14:39:44

看看本地是否没有了ticket cache
详细参考:
YARN & HDFS2 安装和配置Kerberos


ckaiwj1314 发表于 2015-1-5 17:45:46

tntzbzc 发表于 2015-1-5 14:39
看看本地是否没有了ticket cache
详细参考:
YARN & HDFS2 安装和配置Kerberos

请问下 这个cache要怎么查看

bioger_hit 发表于 2015-1-5 18:11:31

ckaiwj1314 发表于 2015-1-5 17:45
请问下 这个cache要怎么查看

可以使用这个命令klist例如:



$ klist



例子解释:

$ klist
Ticket cache: FILE:/tmp/krb5cc_500
Default principal: hadoop@DIANPING.COM
Valid starting   Expires            Service principal
09/11/13 15:25:3409/12/13 15:25:34krbtgt/DIANPING.COM@DIANPING.COM
renew until 09/12/13 15:25:34


其中/tmp/krb5cc_500就是kerberos ticket cache, 默认会在/tmp下创建名字为“krb5cc_”加上uid的文件,此处500表示hadoop帐号的uid




执行命令kinit,获得一张tgt(ticket granting ticket)

$ kinit -r 24l -k -t /home/hadoop/.keytab hadoop







$ getent passwd
hadoop:x:500:500::/home/hadoop:/bin/bash

用户也可以通过设置export KRB5CCNAME=/tmp/krb5cc_500到环境变量来指定ticket cache路径

ckaiwj1314 发表于 2015-1-5 18:17:53

因为我是rpm包安装的 所以进程需要root来启动
servicehadoop-hdfs-namenode start
# klist
Ticket cache: FILE:/tmp/krb5cc_0
Default principal: hdfs/hadoop1@EXAMPLE.COM

Valid starting   Expires            Service principal
01/05/15 18:00:0201/06/15 18:00:02krbtgt/EXAMPLE.COM@EXAMPLE.COM
      renew until 01/12/15 18:00:02

这是我的信息 而且我用crontab做了持续认证
# crontab -l
00 * * * * /usr/bin/kinit -k -t /etc/hadoop/conf/hdfs.keytab hdfs/hadoop1




muyannian 发表于 2015-1-5 20:39:46

ckaiwj1314 发表于 2015-1-5 18:17
因为我是rpm包安装的 所以进程需要root来启动
servicehadoop-hdfs-namenode start
#...

注意权限问题,这是集群中比较忌讳的。

ckaiwj1314 发表于 2015-1-7 12:00:10

我集群能正常起来 启动时 日志显示认证成功的。

墨魂 发表于 2015-2-6 11:55:32

本帖最后由 墨魂 于 2015-2-6 12:01 编辑

ckaiwj1314 发表于 2015-1-7 12:00
我集群能正常起来 启动时 日志显示认证成功的。
1.看下启动日志,看看启动时的环境变量是否有KRBCCNAME位置,2.PRINCIPAL renewliftime设置是否正确,再检察下krb5.conf配置里的续订时间是否正确,3.检查下配置,确保是从默认的HDFS-SITE.XML上的配置进行认证启动的,而不是通过JAVA_OPT设置JAAS配置登陆。

PS:就我自己遇到的问题,是在HBASE上存在的,KDC用的是windows server 票据有效期统一为10小时(10小时内必须续订),由于自己在配置时做测试时使用了-Djava.security.auth.login.config = jaas.conf,即使用了JAVA的Krb5LoginModule获取的认证(配置里设置的useTicketcache 为 false) 导致了RIGIONSERVER不会读取票据缓存也不会续订票据,结果每次都是10小时准时挂掉。后面修正配置后就正常了
页: [1]
查看完整版本: security hadoop持续认证失败