weixin_39619433
weixin_39619433
2020-11-25 23:02

Trinity crash

What is wrong?

Trinity crashed, logs are unclear as to why it shutdown.

https://gist.githubusercontent.com/pipermerriam/f33802d8d8890af3c6c7a128e2745e75/raw/a272bab381ed7b2c8dc68573cc3ac53058ca4c8b/trinity-strange-crash.log

How can it be fixed

Still need to try and diagnose...

该提问来源于开源项目:ethereum/py-evm

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

29条回答

  • weixin_39619433 weixin_39619433 5月前

    OperationCancelled escaping during the Dao fork check which the discovery service interpreted as an upstream cancellation.

    I think we need some sort of guard that ensures that we only silence that exception if the triggering token is in the inheritance chain of the service's local cancel token.

    点赞 评论 复制链接分享
  • weixin_39611666 weixin_39611666 5月前

    Seems very reasonable for OperationCancelled to contain a reference to the token that triggered it, and then:

    py
    except OperationCancelled as exc:
      if self.cancel_token.is_child_of(exc.token):
        pass
      else:
        raise
    
    点赞 评论 复制链接分享
  • weixin_39611666 weixin_39611666 5月前

    Also, I guess even swallowed OperationCancelled exceptions could log at the trace level, which I'm assuming you had turned on.

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    We should be sure to do a thorough audit of everywhere that we catch OperationCancelled while fixing this.

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    good thoughts and I believe I agree with that approach. Unfortunately, it may be closer to the end of this week before I can really spend much time on this. I'm closing #1255. If anyone else wants to take a stab at this :+1:

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    I'm thinking that we should do something like this:

    python
    class Peer:
        ready = asynio.Event()  # but not as a class variabe :)
        def _run(self):
            # do dao check
            self.ensure_same_side_of_dao()
            self.ready.set()
    
    class PeerPool:
        def stat_peer(self, peer):
            self.run_child_service(peer)
            await peer.ready.wait()
            # do the rest of adding peer to pool
            # also: handle failure case however that works.
    

    This should properly start the peer, do the dao check, and keep the peer away from the subscriber API until the dao check has been fully completed.

    点赞 评论 复制链接分享
  • weixin_39752352 weixin_39752352 5月前

    I'll have a go at this, but I'm not sure what you suggest above is the best alternative, for a few reasons:

    1. It feels wrong to run the Peer service and after that wait for it to be ready, but maybe that's just a naming issue
    2. We're waiting on a Peer attribute, which happens to be just an asyncio.Event now, but if it's changed into something that can raise OperationCancelled, we'll have the same problem as we have now
    3. It doesn't consider the fact that the messages received during the DAO fork check need to be collected, but maybe you meant to keep the with peer.collect_sub_proto_messages() as buffer: there?

    Anyway, I'll have to think about this a bit more, just thought it was worth mentioning those points

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    my suggestion was merely dumping my thoughts. Take this whatever direction you think is right :+1:

    点赞 评论 复制链接分享
  • weixin_39752352 weixin_39752352 5月前

    If you squint, I think the constraint implies that a BaseService subclass should never have any async functions that are public (ie~ not prefixed with _). Maybe that's slightly easier to check for.

    Indeed it would, but it just occurred to me that the await other_service.method() pattern is actually not a problem in itself if it runs in the background (e.g. via BaseService.run_task() & co). It's only code that runs inside a BaseService._run() that has to avoid that pattern.

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    Alright, new cleaned up version of my fix: https://github.com/ethereum/py-evm/pull/1265

    点赞 评论 复制链接分享
  • weixin_39611666 weixin_39611666 5月前

    I don't have the answer, but this seems like the critical bit:

    
       DEBUG  08-31 17:48:55        peer  Malformed message from peer ETHPeer <node>: CMD:BlockHeaders Error: MalformedMessage('Malformed %s message: %r',)
       DEBUG  08-31 17:48:55        peer  Disconnecting from remote peer; reason: bad_protocol
       DEBUG  08-31 17:48:55     service  Cancelling ETHPeer <node>
       DEBUG  08-31 17:48:55     service  <responsecandidatestream>, <class>)> finished: Cancellation requested by ETHPeer:PeerPool:Server:MainnetFullNode token
    </class></responsecandidatestream></node></node>

    That first debug log happens right before re-raising the exception. So I guess the exception is somehow bubbling all the way up to somewhere it shouldn't, and causing a shutdown. If we find where that catch-all exception is, we should log at the exception level to get the full stack trace.

    点赞 评论 复制链接分享
  • weixin_39611666 weixin_39611666 5月前

    BTW, I see the Malformed %s message: %r message is not formatted correctly, I'll fix that in the next PR I'm working on.

    点赞 评论 复制链接分享
  • weixin_39752352 weixin_39752352 5月前

    This caught my attention

    
       DEBUG  08-31 17:48:55     service  Cancelling ETHPeer <node>
       DEBUG  08-31 17:48:55     service  <responsecandidatestream>, <class>)> finished: Cancellation requested by ETHPeer:PeerPool:Server:MainnetFullNode token
        INFO  08-31 17:48:55   discovery  stopping discovery
       DEBUG  08-31 17:48:55     service  ETHPeer <node> finished: Cancellation requested by Server:MainnetFullNode token
    </node></class></responsecandidatestream></node>

    It's curious that the Server token gets triggered immediately after that peer's .cancel() method is called (which calls peer.cancel_token.trigger()). It suggests a bug in the cancelling of chained tokens, or maybe something is keeping a reference to the server's cancel_token instead of chaining it to create their own

    点赞 评论 复制链接分享
  • weixin_39752352 weixin_39752352 5月前

    These two line indicate the exception has been caught in Peer._run(), which then calls .disconnect() and .cancel()

    DEBUG 08-31 17:48:55 peer Malformed message from peer ETHPeer : CMD:BlockHeaders Error: MalformedMessage('Malformed %s message: %r',) DEBUG 08-31 17:48:55 peer Disconnecting from remote peer; reason: bad_protocol

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    More logs: https://gist.githubusercontent.com/pipermerriam/f33802d8d8890af3c6c7a128e2745e75/raw/1a9019afccd98647c32bbe3f31e4599e69f8a6d3/trinity-discovery-crash.log

    This validates thoughts here https://github.com/ethereum/py-evm/issues/1234#issuecomment-418193499

    Specifically this section which indicates that the ResponseCandidateStream is part of the cancellation chain that causes the DiscoveryService to halt.

    
     83    DEBUG  09-03 15:45:26     service  <responsecandidatestream>, <class .blockheaders>)> finished: Cancellation requested by ETHPeer:PeerPool:Server:MainnetFullNode token
     84    DEBUG  09-03 15:45:26     service  <p2p.discovery.discoveryservice object at> finished: Cancellation requested by Respon
        seCandidateStream:ETHPeer:PeerPool:Server:MainnetFullNode token
     85     INFO  09-03 15:45:26   discovery  stopping discovery
    </p2p.discovery.discoveryservice></class></responsecandidatestream>

    The chain of tokens that caused the discover cancellation has the

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    Also, this might indicate that we need to do something to fix the cancel token chaining API because honestly, I've had trouble sometimes figuring out what the right order should be. Maybe we need something like token.create_child_token() rather than actually combining two tokens and thus, having to figure out the right ordering.

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    Found part of it: https://github.com/ethereum/py-evm/blob/master/trinity/protocol/common/managers.py#L209

    The ResponseCandidateStream is being run using run_daemon which isn't exactly wrong since we do need this process to keep running. When this crashes (which is I haven't dug into why that is happening) the peer cancels itself. Still looking for the link for why this is cascading up to the PeerPool.

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    Not yet confirmed but this may have been a local error on my machine. I had a version of asyncio-cancel-token installed from source locally which I'd been mucking around with a few weeks ago. Running experiment now to see if it is fixed.

    点赞 评论 复制链接分享
  • weixin_39752352 weixin_39752352 5月前

    I think it makes sense that ResponseCandidateStream is executed with run_daemon() as we probably want the Peer to stop if it terminates unexpectedly, right? One thing that is not right is for ~ResponseCandidateStream~ExchangeManager to keep a direct reference to the cancel token passed onto it (https://github.com/ethereum/py-evm/blob/master/trinity/protocol/common/managers.py#L200), as that means it will always cancel its parent when terminating but that is a decision that belongs to the parent. This still does not explain how a peer terminating would trigger the server's token, though

    点赞 评论 复制链接分享
  • weixin_39752352 weixin_39752352 5月前

    The log lines from https://github.com/ethereum/py-evm/issues/1234#issuecomment-418220374 indicate that the token was correctly chained by ResponseCandidateStream, but when it was triggered it ended up cancelling one of the parent tokens in the chain, which doesn't seem to happen (https://github.com/ethereum/asyncio-cancel-token/pull/9), so it's very possible this was all caused by your locally modified version of asyncio-cancel-token

    点赞 评论 复制链接分享
  • weixin_39611666 weixin_39611666 5月前

    One thing that is not right is for ResponseCandidateStreamExchangeManager to keep a direct reference to the cancel token passed onto it (https://github.com/ethereum/py-evm/blob/master/trinity/protocol/common/managers.py#L200), as that means it will always cancel its parent when terminating but that is a decision that belongs to the parent.

    ExchangeManager isn't a service, it keeps the cancel token so that it can hand the token to the ResponseCandidateStream on init. It doesn't otherwise use _cancel_token.

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    Closing until I can reproduce with a fresh virtual environment.

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    I am pretty sure that I'm now seeing this in the v0.1.0-alpha.15 release running on one of our VPS instances. Should know for sure pretty soon.

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    I've confirmed this bug is not isolated to my machine. Encountered during light sync in a VPS node, I expect I'll see it in full sync too.

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    More info:

    Looks like DiscoveryProtocol retains a direct link to the Server cancel token

    https://github.com/ethereum/py-evm/blob/v0.2.0-alpha.32/trinity/server.py#L150-L151

    https://github.com/ethereum/py-evm/blob/v0.2.0-alpha.32/p2p/discovery.py#L145

    Which is then triggered when the DiscoveryService cleans up, calling proto.stop() which triggers the token

    https://github.com/ethereum/py-evm/blob/v0.2.0-alpha.32/p2p/discovery.py#L819-L820

    https://github.com/ethereum/py-evm/blob/v0.2.0-alpha.32/p2p/discovery.py#L397-L399

    点赞 评论 复制链接分享
  • weixin_39752352 weixin_39752352 5月前

    I can reproduce it when running in light mode.

    As for the fact that DiscoveryProtocol retains a direct link to the Server cancel token, it is indeed something that ought to be fixed, but that token is only triggered when DiscoveryProtocol.stop() is called, and we still don't know why DiscoveryService is terminating

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    check out my branch piper/weird-crash-thing for some additional logging output that I've injected to try and diagnose this. I still haven't been able to see the crash with that logging enabled.

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    For anyone trying to reproduce it does appear that running in light mode is more likely to trigger this.

    点赞 评论 复制链接分享
  • weixin_39619433 weixin_39619433 5月前

    GOT IT!!!!!

    
       DEBUG  09-04 18:00:00     service  STACKTRACE:
    Traceback (most recent call last):
      File "/home/piper/projects/py-evm/p2p/service.py", line 126, in run
        await self._run()
      File "/home/piper/projects/py-evm/p2p/discovery.py", line 780, in _run
        await self.maybe_connect_to_more_peers()
      File "/home/piper/projects/py-evm/p2p/discovery.py", line 800, in maybe_connect_to_more_peers
        self.proto.get_nodes_to_connect(self.peer_pool.max_peers))
      File "/home/piper/projects/py-evm/p2p/peer.py", line 838, in connect_to_nodes
        await self.start_peer(peer)
      File "/home/piper/projects/py-evm/p2p/peer.py", line 742, in start_peer
        await self.ensure_same_side_on_dao_fork(peer)
      File "/home/piper/projects/py-evm/p2p/peer.py", line 864, in ensure_same_side_on_dao_fork
        timeout=CHAIN_SPLIT_CHECK_TIMEOUT,
      File "/home/piper/projects/py-evm/trinity/protocol/les/exchanges.py", line 62, in __call__
        timeout,
      File "/home/piper/projects/py-evm/trinity/protocol/common/exchanges.py", line 83, in get_result
        timeout,
      File "/home/piper/projects/py-evm/trinity/protocol/common/managers.py", line 233, in get_result
        async for payload in stream.payload_candidates(request, tracker, timeout=timeout):
      File "/home/piper/projects/py-evm/trinity/protocol/common/managers.py", line 104, in payload_candidates
        yield await self._get_payload(timeout_remaining)
      File "/home/piper/projects/py-evm/trinity/protocol/common/managers.py", line 151, in _get_payload
        payload = await self.wait(future, timeout=timeout)
      File "/home/piper/projects/py-evm/p2p/cancellable.py", line 19, in wait
        return await self.wait_first(awaitable, token=token, timeout=timeout)
      File "/home/piper/projects/py-evm/p2p/cancellable.py", line 41, in wait_first
        return await token_chain.cancellable_wait(*awaitables, timeout=timeout)
      File "/home/piper/python-environments/evm/lib/python3.6/site-packages/cancel_token/token.py", line 150, in cancellable_wait
        "Cancellation requested by {} token".format(self.triggered_token)
    cancel_token.exceptions.OperationCancelled: Cancellation requested by ResponseCandidateStream:LESPeer:PeerPool:LightServer:MainnetLightNode token
    
    点赞 评论 复制链接分享

相关推荐