前言
为了集成HDFS&Ranger,部门目前使用的Hadoop版本为2.7.6,Ranger1.2官方依赖建议使用的是2.7.1,在集成后,执行`hdfs dfs -ls /`出现NPE,笔者通过DEBUG源代码,
寻找bug.
开启debug模式查看日志
在hadoop-daemon.sh 修改如下配置,重启HDFS,开启debug模式
export HADOOP_ROOT_LOGGER=${HADOOP_ROOT_LOGGER:-"DEBUG,RFA"}
执行hdfs dfs -ls /
后日志中报错如下
1 2018-03-04 15:27:32,158 DEBUG org.apache.hadoop.ipc.Server: Successfully authorized userInfo { 2 effectiveUser: "tidb" 3 } 4 protocol: "org.apache.hadoop.hdfs.protocol.ClientProtocol" 5 2018-03-04 15:27:32,158 DEBUG org.apache.hadoop.ipc.Server: got #0 6 2018-03-04 15:27:32,158 DEBUG org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020: org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from xxxx:47550 Call#0 Retry#0 for RpcKind RPC_PROTOCOL_BUFFER 7 2018-03-04 15:27:32,159 DEBUG org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:tidb (auth:SIMPLE) from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2213) 8 2018-03-04 15:27:32,160 DEBUG org.apache.hadoop.security.UserGroupInformation: ACCESS CHECK: org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker@19d49e04, doCheckOwner=false, ancestorAccess=null, parentAccess=null, access=null, subAccess=null, ignoreEmptyDir=false 9 2018-03-04 15:27:32,160 DEBUG org.apache.hadoop.ipc.Server: Served: getFileInfo queueTime= 1 procesingTime= 1 exception= NullPointerException 10 2018-03-04 15:27:32,160 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from xx.xx.xx.xx.7:47550 Call#0 Retry#0 11 java.lang.NullPointerException 12 at org.apache.hadoop.hdfs.DFSUtil.bytes2String(DFSUtil.java:314) 13 at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.getINodeAttrs(FSPermissionChecker.java:238) 14 at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:183) 15 at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1752) 16 at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:100) 17 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFileInfo(FSNamesystem.java:3831) 18 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getFileInfo(NameNodeRpcServer.java:1012) 19 at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getFileInfo(ClientNamenodeProtocolServerSideTranslatorPB.java:855) 20 at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) 21 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) 22 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) 23 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217) 24 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213) 25 at java.security.AccessController.doPrivileged(Native Method) 26 at javax.security.auth.Subject.doAs(Subject.java:422) 27 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1758) 28 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2213)
源代码跟踪
调试源代码
从堆栈信息中,我们可以发现在FSPermissionChecker.java
该类中,238行出现了空指针
private INodeAttributes getINodeAttrs(byte[][] pathByNameArr, int pathIdx,
INode inode, int snapshotId) {
INodeAttributes inodeAttrs = inode.getSnapshotINode(snapshotId);
if (getAttributesProvider() != null) {
String[] elements = new String[pathIdx + 1];
for (int i = 0; i < elements.length; i++) {
elements[i] = DFSUtil.bytes2String(pathByNameArr[i]);
}
inodeAttrs = getAttributesProvider().getAttributes(elements, inodeAttrs);
}
return inodeAttrs;
}
进一步分析此时,调用DFSUtil.bytes2String出现空指针,一定是pathByNameArr在传入的时候没有做判断,pathByNameArr二维数组有可能为空, 导致了空指针异常,进一步打印循环pathByNameArr可以看得出来如果执行
hdfs dfs -ls /
命令时候,pathByNameArr[0]为null.笔者提交了issue 社区HDFS PMC 已经在hadoop3.0.0beta发现了该bug,已经修复 参见 HDFS-12614.
修复
修复代码
private INodeAttributes getINodeAttrs(byte[][] pathByNameArr, int pathIdx,
INode inode, int snapshotId) {
INodeAttributes inodeAttrs = inode.getSnapshotINode(snapshotId);
if (getAttributesProvider() != null) {
String[] elements = new String[pathIdx + 1];
for (int i = 0; i < elements.length; i++) {
//see HDFS-12614
if (pathByNameArr.length == 1 && pathByNameArr[0] == null) {
elements[0] = "";
}else{
elements[i] = DFSUtil.bytes2String(pathByNameArr[i]);
}
}
inodeAttrs = getAttributesProvider().getAttributes(elements, inodeAttrs);
}
return inodeAttrs;
}
当输入
hdfs dfs -ls /
切分字符串的时候,pathByNameArr[0]长度为 1,并且为pathByNameArr[0]为null.判断后,赋值给elements[0]=””. 以下为HDFS PMC所困扰的对于此 bug
I've not liked the inconsistency with whether the root inode's name is considered null or empty string. However I've been
leery to touch it since it is a public method and inevitably something somewhere always breaks when fixing/changing semantics.
I'd be more comfortable with the change in the call to the attr provider in the code you reference above.
I'll take this chance to rant a bit about how enabling an attribute provider ruins a lot of the work I put into reducing all
the string/byte conversions. Those aren't cheap. The interface is fundamentally flawed: an attr provider requires converting
each byte[] component of the path back into a string multiple times. Ie. The path "/a/b/c" requires calling the attr provider
with: "", "", "a", "", "a","b", "", "a","b", "c". Every single one of those strings were freshly (re)converted from a byte[].
Bonus points if you avoid all the redundant conversions. I punted on it because at the time we had no intention of using the
attr provider but now we might.
总结
目前我们数据平台部使用的是Hadoop2.7.6,已经落后社区好几个版本,后期计划滚动升级Hadoop版本到较新的版本.该bug已经在测试环境修复,观察一段时间后
进入生产环境.
参考
- https://issues.apache.org/jira/browse/HDFS-12614
- https://blog.csdn.net/heshang285170811/article/details/51242788