beat_it 2017-11-21 11:51 采纳率: 0%
浏览 1541

Hadoop NameNode 死亡原因?

情况1:
Remote journal 192.168.8.195:8485 failed to write txns 1698499-1698499. Will try to write to this JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(java.io.IOException): IPC's epoch 48 is less than the last promised epoch 49
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:429)
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkWriteRequest(Journal.java:457)
at org.apache.hadoop.hdfs.qjournal.server.Journal.journal(Journal.java:352)
at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.journal(JournalNodeRpcServer.java:149)

** 情况2:**
2017-11-21 19:26:01,859 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Rescanning after 43505 milliseconds
2017-11-21 19:26:01,860 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 21624 ms (timeout=20000 ms) for a response for sendEdits. No responses yet.
2017-11-21 19:26:01,861 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.8.191:8485, 192.168.8.192:8485, 192.168.8.193:8485, 192.168.8.194:8485, 192.168.8.195:8485], stream=QuorumOutputStream starting at txid 110343))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:659)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:593)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInt(FSNamesystem.java:4070)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:4053)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.delete(NameNodeRpcServer.java:845)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.delete(AuthorizationProviderProxyClientProtocol.java:308)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.delete(ClientNamenodeProtocolServerSideTranslatorPB.java:603)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
2017-11-21 19:26:01,867 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 110343
2017-11-21 19:26:01,863 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: Scanned 0 directive(s) and 0 block(s) in 4 millisecond(s).
2017-11-21 19:26:01,870 INFO BlockStateChange: BLOCK* BlockManager: ask 192.168.8.191:50010 to delete [blk_1073750549_9840]
2017-11-21 19:26:01,947 INFO BlockStateChange: BLOCK* BlockManager: ask 192.168.8.194:50010 to delete [blk_1073750549_9840]
2017-11-21 19:26:01,947 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 21713ms to send a batch of 1 edits (127 bytes) to remote journal 192.168.8.193:8485
2017-11-21 19:26:01,945 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 21487ms
GC pool 'ParNew' had collection(s): count=1 time=21603ms
2017-11-21 19:26:01,944 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 21709ms to send a batch of 1 edits (127 bytes) to remote journal 192.168.8.192:8485
2017-11-21 19:26:01,938 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 21703ms to send a batch of 1 edits (127 bytes) to remote journal 192.168.8.194:8485
2017-11-21 19:26:01,938 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 21703ms to send a batch of 1 edits (127 bytes) to remote journal 192.168.8.195:8485
2017-11-21 19:26:01,934 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Took 21700ms to send a batch of 1 edits (127 bytes) to remote journal 192.168.8.191:8485
2017-11-21 19:26:01,960 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 610 Total time for transactions(ms): 110 Number of transactions batched in Syncs: 179 Number of syncs: 0 SyncTimes(ms):
2017-11-21 19:26:01,997 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2017-11-21 19:26:02,121 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:

  • 写回答

1条回答 默认 最新

  • lshen01 2023-03-16 09:53
    关注

    参考GPT和自己的思路:

    在这两种情况下,Hadoop NameNode 均死亡,原因在于与 JournalNode 的通信发生了问题。在情况1中,Hadoop NameNode 由于 Remote Journal 节点未能将 txns 1698499 写入 JournalNode,导致出现了异常并失败了。在情况2中,多个 JournalNode 未能随时响应 NameNode 的写入请求,原因可能是 JournalNode 故障或网络延迟等原因,这也导致了异常和失败。因此,需要检查 JournalNode 节点的状态以确定问题的具体位置,并检查网络和其他运行状况以确保环境是否稳定。当然,在生产环境中,还需要设置 JournalNode 之间的冗余和其它监控机制,保证其可靠性,防止出现单点故障等问题。

    评论

报告相同问题?

悬赏问题

  • ¥60 版本过低apk如何修改可以兼容新的安卓系统
  • ¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
  • ¥50 有数据,怎么建立模型求影响全要素生产率的因素
  • ¥50 有数据,怎么用matlab求全要素生产率
  • ¥15 TI的insta-spin例程
  • ¥15 完成下列问题完成下列问题
  • ¥15 C#算法问题, 不知道怎么处理这个数据的转换
  • ¥15 YoloV5 第三方库的版本对照问题
  • ¥15 请完成下列相关问题!
  • ¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像,如何解决?