HDFS2.7.6 &Ranger NPE

2018-10-10 | 阅读：次

前言

为了集成HDFS&Ranger,部门目前使用的Hadoop版本为2.7.6,Ranger1.2官方依赖建议使用的是2.7.1,在集成后，执行`hdfs dfs -ls /`出现NPE,笔者通过DEBUG源代码,
寻找bug.

开启debug模式查看日志

在hadoop-daemon.sh 修改如下配置,重启HDFS,开启debug模式

export HADOOP_ROOT_LOGGER=${HADOOP_ROOT_LOGGER:-"DEBUG,RFA"}

执行hdfs dfs -ls / 后日志中报错如下

2018-03-04 15:27:32,158 DEBUG org.apache.hadoop.ipc.Server: Successfully authorized userInfo {
effectiveUser: "tidb"
}
protocol: "org.apache.hadoop.hdfs.protocol.ClientProtocol"
2018-03-04 15:27:32,158 DEBUG org.apache.hadoop.ipc.Server:  got #0
2018-03-04 15:27:32,158 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020: org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from xxxx:47550 Call#0 Retry#0 for RpcKind RPC_PROTOCOL_BUFFER
2018-03-04 15:27:32,159 DEBUG org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:tidb (auth:SIMPLE) from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2213)
2018-03-04 15:27:32,160 DEBUG org.apache.hadoop.security.UserGroupInformation: ACCESS CHECK: org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker@19d49e04, doCheckOwner=false, ancestorAccess=null, 
  parentAccess=null, access=null, subAccess=null, ignoreEmptyDir=false
2018-03-04 15:27:32,160 DEBUG org.apache.hadoop.ipc.Server: Served: getFileInfo queueTime= 1 procesingTime= 1 exception= NullPointerException
2018-03-04 15:27:32,160 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from xx.xx.xx.xx.7:47550 Call#0 Retry#0
java.lang.NullPointerException
at org.apache.hadoop.hdfs.DFSUtil.bytes2String(DFSUtil.java:314)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.getINodeAttrs(FSPermissionChecker.java:238)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:183)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1752)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:100)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3831)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1012)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:855)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1758)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2213)

源代码跟踪

调试源代码

从堆栈信息中,我们可以发现在FSPermissionChecker.java该类中,238行出现了空指针

   private INodeAttributes getINodeAttrs(byte[][] pathByNameArr, int pathIdx,
      INode inode, int snapshotId) {
    INodeAttributes inodeAttrs = inode.getSnapshotINode(snapshotId);
    if (getAttributesProvider() != null) {
      String[] elements = new String[pathIdx + 1];
      for (int i = 0; i < elements.length; i++) {
		    elements[i] = DFSUtil.bytes2String(pathByNameArr[i]);
      }
      inodeAttrs = getAttributesProvider().getAttributes(elements, inodeAttrs);
    }
    return inodeAttrs;
  }
  

进一步分析此时,调用DFSUtil.bytes2String出现空指针,一定是pathByNameArr在传入的时候没有做判断,pathByNameArr二维数组有可能为空, 导致了空指针异常,进一步打印循环pathByNameArr可以看得出来如果执行hdfs dfs -ls / 命令时候,pathByNameArr[0]为null.笔者提交了issue 社区HDFS PMC 已经在hadoop3.0.0beta发现了该bug,已经修复参见 HDFS-12614.

修复

修复代码

  private INodeAttributes getINodeAttrs(byte[][] pathByNameArr, int pathIdx,
      INode inode, int snapshotId) {
    INodeAttributes inodeAttrs = inode.getSnapshotINode(snapshotId);
    if (getAttributesProvider() != null) {
      String[] elements = new String[pathIdx + 1];
      for (int i = 0; i < elements.length; i++) {
		//see HDFS-12614
		if (pathByNameArr.length == 1 && pathByNameArr[0] == null) {
			elements[0] = "";
		}else{
		    elements[i] = DFSUtil.bytes2String(pathByNameArr[i]);
		}
      }
      inodeAttrs = getAttributesProvider().getAttributes(elements, inodeAttrs);
    }
    return inodeAttrs;
  }
  

当输入hdfs dfs -ls /切分字符串的时候,pathByNameArr[0]长度为 1,并且为pathByNameArr[0]为null.判断后,赋值给elements[0]=””. 以下为HDFS PMC所困扰的对于此 bug

I've not liked the inconsistency with whether the root inode's name is considered null or empty string. However I've been leery to touch it since it is a public method and inevitably something somewhere always breaks when fixing/changing semantics. I'd be more comfortable with the change in the call to the attr provider in the code you reference above. I'll take this chance to rant a bit about how enabling an attribute provider ruins a lot of the work I put into reducing all the string/byte conversions. Those aren't cheap. The interface is fundamentally flawed: an attr provider requires converting each byte[] component of the path back into a string multiple times. Ie. The path "/a/b/c" requires calling the attr provider with: "", "", "a", "", "a","b", "", "a","b", "c". Every single one of those strings were freshly (re)converted from a byte[]. Bonus points if you avoid all the redundant conversions. I punted on it because at the time we had no intention of using the attr provider but now we might.

总结

目前我们数据平台部使用的是Hadoop2.7.6,已经落后社区好几个版本,后期计划滚动升级Hadoop版本到较新的版本.该bug已经在测试环境修复,观察一段时间后
进入生产环境.

参考

https://issues.apache.org/jira/browse/HDFS-12614
https://blog.csdn.net/heshang285170811/article/details/51242788

BQ

边城

HDFS2.7.6 &Ranger NPE

前言

开启debug模式查看日志

源代码跟踪

修复

总结

参考