pengdott 2024-11-07 15:01 采纳率: 75%
浏览 410
已结题

K8S部署二进制集群过程中calico一直报错

  1. 参考的calico官网,安装calico过程如下:
    (1)、tigera-operator.yaml和custom-resources.yaml文件下载到服务器上

(2)、执行kubectl create -f tigera-operator.yaml,安装过程未报错

(3)、修改配置文件custom-resources.yaml,加入如下配置内容
nodeAddressAutodetectionV4:
interface: ens33
确定所有网卡是ens33

(4)、执行kubectl create -f custom-resources.yaml,安装过程未报错

(5)、查看calico相关pod,pod处于running状态

NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-6696b5fc97-hlb84   1/1     Running   0          2m33s
calico-node-28flc                          1/1     Running   0          2m34s
calico-node-p9tcg                          1/1     Running   0          2m34s
calico-typha-9f54f8447-sgpnl               1/1     Running   0          2m34s
csi-node-driver-67h28                      2/2     Running   0          2m33s
csi-node-driver-wgwvs                      2/2     Running   0          2m33s

(6)、执行7.

kubectl taint nodes --all node-role.kubernetes.io/control-plane-


```,结果如下:

```xml
taint "node-role.kubernetes.io/control-plane" not found
taint "node-role.kubernetes.io/control-plane" not found

后查看pod状态也正常

2.查看calico pod相关信息始终报错:

Normal   Pulling    27m   kubelet            Pulling image "docker.io/calico/pod2daemon-flexvol:v3.28.2"
  Normal   Scheduled  27m   default-scheduler  Successfully assigned calico-system/calico-node-rhrj9 to dk8s-work1
  Normal   Pulled     27m   kubelet            Successfully pulled image "docker.io/calico/pod2daemon-flexvol:v3.28.2" in 10.195808051s
  Normal   Created    27m   kubelet            Created container flexvol-driver
  Normal   Started    27m   kubelet            Started container flexvol-driver
  Normal   Pulling    27m   kubelet            Pulling image "docker.io/calico/cni:v3.28.2"
  Normal   Started    26m   kubelet            Started container install-cni
  Normal   Pulled     26m   kubelet            Successfully pulled image "docker.io/calico/cni:v3.28.2" in 51.718175648s
  Normal   Created    26m   kubelet            Created container install-cni
  Normal   Pulling    26m   kubelet            Pulling image "docker.io/calico/node:v3.28.2"
  Normal   Pulled     25m   kubelet            Successfully pulled image "docker.io/calico/node:v3.28.2" in 50.979018706s
  Normal   Created    25m   kubelet            Created container calico-node
  Normal   Started    25m   kubelet            Started container calico-node
  Warning  Unhealthy  25m   kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
  Warning  Unhealthy  25m   kubelet            Readiness probe failed: 2024-11-06 07:59:09.401 [INFO][231] confd/health.go 202: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 192.168.100.186


或:

Warning  Unhealthy       42m   kubelet          Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/bird/bird.ctl: connect: no such file or directory
```xml


其它说明:
custom-resources.yaml即使不加入配置nodeAddressAutodetectionV4:
      interface: ens33
也报同样的错误。

工作节点端口查看如下看上去似乎已经建立了连接:
netstat -ltunp | grep 179
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      12607/bird

lsof -i:179
COMMAND   PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
bird    12607 root    7u  IPv4  68970      0t0  TCP *:bgp (LISTEN)
bird    12607 root    8u  IPv4  76064      0t0  TCP dk8s-work1:bgp->dk8s-work2:55595 (ESTABLISHED)

/var/run/bird/bird.ctl文件calico刚部署完是存在的,但是重启节点后不存在了

通过ip link命令查看没有br开头的多余虚拟网卡,并且删除了一部分state DOWN的网卡也没起作用,一直的bird的错误。
版本信息如下:OS:CENTOS7,
calico version:v3.28.2
K8S:v1.21.10

网上搜索了几个方法(如加入nodeAddressAutodetectionV4识别网卡的配置)均不起作用,请教如何处理此错误

补充内容:

```xml
1.ens33是正确配置在网卡上的IP地址,并且集群中无重复
[root@dk8s-work1 ~]# ip addr show ens33
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:0c:29:1b:c2:fb brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.185/24 brd 192.168.100.255 scope global noprefixroute ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::89d2:8fb4:d1c9:74f/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

[root@dk8s-work2 ~]# ip addr show ens33
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:0c:29:7e:94:ac brd ff:ff:ff:ff:ff:ff
    inet 192.168.100.186/24 brd 192.168.100.255 scope global noprefixroute ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::f470:3449:c28b:1f02/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever
2.节点间通讯正常
[root@dk8s-work1 ~]# ping dk8s-work2
PING dk8s-work2 (192.168.100.186) 56(84) bytes of data.
64 bytes from dk8s-work2 (192.168.100.186): icmp_seq=1 ttl=64 time=0.455 ms
64 bytes from dk8s-work2 (192.168.100.186): icmp_seq=2 ttl=64 time=0.357 ms
64 bytes from dk8s-work2 (192.168.100.186): icmp_seq=3 ttl=64 time=0.288 ms
64 bytes from dk8s-work2 (192.168.100.186): icmp_seq=4 ttl=64 time=0.298 ms
64 bytes from dk8s-work2 (192.168.100.186): icmp_seq=5 ttl=64 time=0.287 ms

[root@dk8s-work2 ~]# ping dk8s-work1
PING dk8s-work1 (192.168.100.185) 56(84) bytes of data.
64 bytes from dk8s-work1 (192.168.100.185): icmp_seq=1 ttl=64 time=0.246 ms
64 bytes from dk8s-work1 (192.168.100.185): icmp_seq=2 ttl=64 time=0.300 ms
64 bytes from dk8s-work1 (192.168.100.185): icmp_seq=3 ttl=64 time=0.303 ms
64 bytes from dk8s-work1 (192.168.100.185): icmp_seq=4 ttl=64 time=0.240 ms
3.179端口节点间可以连接
[root@dk8s-work1 ~]# telnet 192.168.100.186 179
Trying 192.168.100.186...
Connected to 192.168.100.186.
Escape character is '^]'.
        ?    d " @xA EFG

[root@dk8s-work2 ~]# telnet 192.168.100.185 179
Trying 192.168.100.185...
Connected to 192.168.100.185.
Escape character is '^]'.
        ?    d " @xA EFG

4.日志
(1)、calico-node Pod 日志
2024-11-14 01:42:12.192 [INFO][58] felix/route_table.go 1153: Failed to get interface; it's down/gone. error=Link not found ifaceName="vxlan.calico" ifaceRegex="^vxlan.calico$" ipVersion=0x4
2024-11-14 01:42:12.192 [ERROR][58] felix/route_table.go 920: Failed to get link attributes error=interface not present ifaceRegex="^vxlan.calico$" ipVersion=0x4
2024-11-14 01:42:12.240 [INFO][58] felix/vxlan_mgr.go 425: Failed to get VXLAN tunnel device, assuming it isn't present error=Link not found
(2)、calico-kube-controllers Pod 日志
无错误内容

5.bird启动及命令无法使用
[root@dk8s-work1 ~]# systemctl start bird
Failed to start bird.service: Unit not found.
[root@dk8s-work1 ~]# birdc
bash: birdc: command not found...

6.calico版本降低了,无效;经查询目前所使用的CALICO版本与K8S版本是匹配的

7.custom-resources.yaml中改成默认找网卡还是报一样的错误

  • 写回答

30条回答 默认 最新

  • 铭渊老黄 新星创作者: Python技术领域 2024-11-07 17:44
    关注
    获得0.30元问题酬金
    🕒下午好,题主!目前:🟢[在线]
    📢参考GPT-4o mini 模型、由清水白石008整理提供!
    📣今日金句:“成功的关键在于相信自己有能力实现目标。” — 诺曼·文森特·皮尔
    

    Calico的 BIRD 服务(BGP 协议路由守护进程)在 Kubernetes 中主要负责节点间路由,但当 Calico 报错 “unable to connect to BIRDv4 socket” 或 BGP 建立失败时,通常与网络配置和 BIRD 进程的启动情况有关。以下步骤可以帮助你解决这些问题:

    1. 确保 BIRD 配置的正确性

    BIRD 在 Calico 中依赖 BGP 设置与其他节点建立路由,如果在 custom-resources.yaml 中配置了 interface: ens33,请确保该网卡配置正确且所有节点都一致:

    nodeAddressAutodetectionV4:
      interface: ens33
    

    此外,若 Calico 配置中的 nodeAddressAutodetectionV4 错误,可能导致 BIRD 进程连接失败。若问题仍存,可以尝试注释该配置项,或设置自动检测:

    nodeAddressAutodetectionV4: {}
    

    2. 检查 BIRD 和网卡配置

    BIRD 使用端口 179,如果已有连接但仍然无法建立 BGP 邻居关系,可能由于多个节点的 BGP 对等没有配置好:

    • 通过 kubectl get nodes -o wide 确认每个节点的 IP 地址是否正确。

    • 确保节点之间的网络连通性,并检查防火墙规则是否放行了 BGP 的端口(179):

      sudo iptables -A INPUT -p tcp --dport 179 -j ACCEPT
      

    3. 确保 BIRD Socket 文件正常

    Calico 使用 /var/run/calico/bird.ctl 来与 BIRD 进程通信。Socket 文件不存在的原因可能是 BIRD 进程启动失败或配置错误。尝试以下步骤:

    1. 手动检查 Calico BIRD 状态

      • 使用以下命令进入 Calico 容器内,检查 BIRD 状态:

        kubectl exec -it -n calico-system calico-node-<pod-name> -- /bin/bash
        birdc show protocols
        
    2. 重启 Calico Pod

      • 如果 BIRD 无法启动或 socket 文件消失,可以尝试重启 Calico Pod:

        kubectl rollout restart daemonset/calico-node -n calico-system
        
    3. 检查 BIRD 日志

      • 查看 BIRD 日志以获取更多信息:

        kubectl logs -n calico-system calico-node-<pod-name> -c calico-node
        

    4. 调整 Calico 配置

    若问题持续,请尝试在 custom-resources.yaml 中调整 calicoNetwork 配置,以确保 BGP 正确配置,例如:

    calicoNetwork:
      bgp: Enabled
      ipPools:
        - cidr: "192.168.0.0/16"  # 根据你的网络配置修改
          encapsulation: IPIP
          natOutgoing: Enabled
          nodeSelector: all()
    

    注意:如果集群跨多个子网,可以尝试将 encapsulation 设置为 VXLAN,这可以帮助网络穿越不同子网。

    5. 确保 K8S 节点配置无 taint

    执行 kubectl taint nodes --all node-role.kubernetes.io/control-plane- 时未找到 control-plane taint,说明节点上未设置 taint,可以忽略此警告。如果需要确保工作节点无特殊标记,可执行以下命令:

    kubectl taint nodes --all node.kubernetes.io/not-ready- || true
    

    6. 监控和调试网络

    1. 使用 calicoctl 检查 Calico 状态和 BGP 邻居关系:

      calicoctl node status
      
    2. 确保所有节点的 Calico 版本和配置一致,若节点版本不一致可能导致 BGP 建立失败。

    通过这些步骤,应该可以帮助你解决 Calico BIRD 的问题,如果仍有异常可以更新描述以便进一步排查。

    评论

报告相同问题?

问题事件

  • 系统已结题 11月15日
  • 修改了问题 11月14日
  • 修改了问题 11月14日
  • 创建了问题 11月7日