weixin_39645268
weixin_39645268
2020-12-29 09:17

Fix: all traffic ingress rule triggers fatal nil dereference

What this PR does / why we need it:

This change does two things:

  1. Handles ingress rules that don't have a "port" notion. Counterintuitively, ICMP and ICMPv6 traffic does have a "port" notion in AWS-land.
  2. Fix a bug related to deleting ingress rules without a description, which AWS considers distinct from those with an empty description

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Special notes for your reviewer:

These patches fix the crasher, but the story isn't yet fully told. We found ourselves in this situation after spinning up a Service of Type: LoadBalancer, which caused the cloud provider to helpfully add a rule to the nodes' security group for "all traffic" coming from the ELB.

This change will enable the management tooling to handle that new rule, decide it's not spec'd, and clean it up properly. At which point, I expect, the cloud provider will put it back. We're still investigating what to do about that tug of war, but I wanted to open the PR for collaboration.

Release note:

release-note
prevent fatal crash when handling "all traffic" rules

该提问来源于开源项目:kubernetes-sigs/cluster-api-provider-aws

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

10条回答

  • weixin_39878401 weixin_39878401 4月前

    Hi -nr. Thanks for your PR.

    I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

    Once the patch is verified, the new status will be reflected by the ok-to-test label.

    I understand the commands that are listed here.

    Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
    点赞 评论 复制链接分享
  • weixin_39878401 weixin_39878401 4月前

    [APPROVALNOTIFIER] This PR is APPROVED

    This pull-request has been approved by: sethp-nr, vincepri

    The full list of commands accepted by this bot can be found here.

    The pull request process is described here

    Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/OWNERS)~~ [vincepri] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
    点赞 评论 复制链接分享
  • weixin_39645268 weixin_39645268 4月前

    I've been ruminating on what to do about the trouble with type: LoadBalancer that caused this crasher, and this fix would break. So far my ideas fall into one of two camps:

    1. Be more permissive with ingress rules. In my mind, that breaks down into either: going "add-only" and not worrying about removing rules (outside of deletion), or trying to identify the LoadBalancer rules and either slurp them back into the Spec or just ignore them entirely.
    2. Separate out the security group(s) owned by cluster-api and owned by the cluster itself: remove the "owned" tag from the security group spun up by cluster-api. The tricky bit here becomes the cloud provider config: either we need to create (but not update) a security group for the rules to be added to, or maybe we pre-configure something like: https://medium.com//kubernetes-and-aws-elb-what-to-do-when-you-reach-the-security-group-limit-in-aws-45207e423e

    What do you think?

    点赞 评论 复制链接分享
  • weixin_39645268 weixin_39645268 4月前

    I'd also like to know your preference on trying to fix that issue in this PR vs. fixing the crasher and opening a new issue/PR for the LoadBalancer thing.

    点赞 评论 复制链接分享
  • weixin_39837139 weixin_39837139 4月前

    ICMP requires a port range (though I'm mystified what that could map to), and for ICMPv6 "ports" are apparently overloaded to restrict certain types of messages?

    For both ICMP and ICMPv6, from/to port will actually correspond to the ICMP types allowed.

    I've been ruminating on what to do about the trouble with type: LoadBalancer that caused this crasher, and this fix would break.

    Hmm, given AWS still hasn't updated the limit of 5 security groups per network interface (raised to 10 by request), I guess we need to coalesce our rules into the smallest number of SGs possible, as to maximise the number of possible LB attachments.

    So, 1 security group per node, regardless of type, and the merger of control plane & worker node rules for the control plane instance.

    This only helps us in the case that the cloud provider doesn't try to inject rules into an SG managed by Cluster API. This relates to https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/608#issuecomment-467900867 where it was found that the provider doesn't set k8s.io/cluster-provider-aws/managed=true, and only checks for the cluster tag. Cloud provider can't tell the difference between resources Cluster API has provisioned, and which belong to it.

    Feels like if we do something in this repo, it's a hack, but getting changes made to the provider may take time.

    点赞 评论 复制链接分享
  • weixin_39645268 weixin_39645268 4月前

    I guess we need to coalesce our rules into the smallest number of SGs possible, as to maximise the number of possible LB attachments.

    Yeah, that 5 limit is a lot lower than I'd like. I guess my question is whether "smallest number" of SGs in this case is one or two.

    Cloud provider can't tell the difference between resources Cluster API has provisioned, and which belong to it.

    Cluster-api tells the cloud provider it owns the resource: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/pkg/cloud/aws/tags/tags.go#L94

    It seems like either CAPA should allow the provider-injected rules to exist in the shared group, or it should create a separate group for the cloud provider to inject its rules into. I'm presently leaning toward the latter for its simplicity, but that does mean the carrying cost of a CAPA cluster is 2/5 security group slots.

    点赞 评论 复制链接分享
  • weixin_39837139 weixin_39837139 4月前

    It seems like either CAPA should allow the provider-injected rules to exist in the shared group, or it should create a separate group for the cloud provider to inject its rules into. > I'm presently leaning toward the latter for its simplicity, but that does mean the carrying cost of a CAPA cluster is 2/5 security group slots.

    Don't think that was ever our intent, though maybe can clarify. We're tagging that cluster-api-provider-aws is managing the resource (sigs.k8s.io/cluster-api-provider-aws/managed=true, and we are also tagging everything that is managed by the provider with the cluster ID (kubernetes.io/cluster/blah=owned, because we use it for reconciliation. Unfortunately, the cloud provider only checks for cluster ID kubernetes.io/cluster/blah=owned.

    Either we make the AWS cloud provider aware of tooling that uses kubernetes.io/cluster/blah=owned but it is not managing, or we may ultimately have to not use kubernetes.io/cluster/blah=owned if we expect further clashes with the cloud provider.

    点赞 评论 复制链接分享
  • weixin_39645268 weixin_39645268 4月前

    /retest

    点赞 评论 复制链接分享
  • weixin_39837139 weixin_39837139 4月前

    lgtm from me

    点赞 评论 复制链接分享
  • weixin_39789327 weixin_39789327 4月前

    /lgtm

    点赞 评论 复制链接分享