今天一共搭建了6台3组两两互备的centos 6.5 mysql+keepalived的主主复制+双backup模式的集群。主机107的keepalived.conf如下:
! Configuration File for keepalived
global_defs {
notification_email {
acassen@firewall.loc
failover@firewall.loc
sysadmin@firewall.loc
}
notification_email_from Alexandre.Cassen@firewall.loc
smtp_server 127.0.0.1
smtp_connect_timeout 30
router_id mysql_ha
vrrp_skip_check_adv_addr
vrrp_strict
vrrp_garp_interval 0
vrrp_gna_interval 0
}
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 117
priority 100
advert_int 1
nopreempt
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
10.40.6.117
}
}
virtual_server 10.40.6.117 3306 {
delay_loop 2
#lb_algo wrr
#lb_kind DR
persistence_timeout 60
protocol TCP
real_server 10.40.6.107 3306 {
weight 3
notify_down /usr/local/etc/keepalived/mysql.sh
TCP_CHECK {
connect_timeout 3
nb_get_retry 3
delay_before_retry 3
connect_port 3306
}
}
}
备机108的keepalived.conf如下:
! Configuration File for keepalived
global_defs {
notification_email {
acassen@firewall.loc
failover@firewall.loc
sysadmin@firewall.loc
}
notification_email_from Alexandre.Cassen@firewall.loc
smtp_server 127.0.0.1
smtp_connect_timeout 30
router_id mysql_ha
vrrp_skip_check_adv_addr
vrrp_strict
vrrp_garp_interval 0
vrrp_gna_interval 0
}
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 117
priority 90
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
10.40.6.117
}
}
virtual_server 10.40.6.117 3306 {
delay_loop 2
#lb_algo wrr
#lb_kind DR
persistence_timeout 60
protocol TCP
real_server 10.40.6.108 3306 {
weight 3
notify_down /usr/local/etc/keepalived/mysql.sh
TCP_CHECK {
connect_timeout 3
nb_get_retry 3
delay_before_retry 3
connect_port 3306
}
}
}
其中有两台集群很奇怪,real_ip分别为107和108,vip为117,当MySQL服务和keepalived服务都启动完成后,一切正常,107占有117的虚拟ip,此时测试切换,将mysql服务停止,按理说3306端口检查不健康的时候会执行我的mysql.sh脚本,实际上就是pkill keepalived,使备机占有vip,但实际上117的vip没有正常漂移到备机,而一直被主机占有,查看message日志发现一直报错:
Jul 12 16:09:12 hs-10-40-6-107 Keepalived_healthcheckers[9204]: TCP connection to [10.40.6.107]:3306 failed.
Jul 12 16:09:15 hs-10-40-6-107 Keepalived_healthcheckers[9204]: TCP connection to [10.40.6.107]:3306 failed.
Jul 12 16:09:15 hs-10-40-6-107 Keepalived_healthcheckers[9204]: Check on service [10.40.6.107]:3306 failed after 1 retry.
Jul 12 16:09:15 hs-10-40-6-107 Keepalived_healthcheckers[9204]: Removing service [10.40.6.107]:3306 from VS [10.40.6.117]:3306
Jul 12 16:09:15 hs-10-40-6-107 Keepalived_healthcheckers[9204]: IPVS: Service not defined
Jul 12 16:09:15 hs-10-40-6-107 Keepalived_healthcheckers[9204]: SMTP connection ERROR to [127.0.0.1]:25.
Jul 12 16:09:17 hs-10-40-6-107 Keepalived_healthcheckers[9204]: TCP connection to [10.40.6.107]:3306 failed.
Jul 12 16:09:20 hs-10-40-6-107 Keepalived_healthcheckers[9204]: TCP connection to [10.40.6.107]:3306 failed.
Jul 12 16:09:20 hs-10-40-6-107 Keepalived_healthcheckers[9204]: Check on service [10.40.6.107]:3306 failed after 1 retry.
Jul 12 16:09:20 hs-10-40-6-107 Keepalived_healthcheckers[9204]: Removing service [10.40.6.107]:3306 from VS [10.40.6.117]:3306
Jul 12 16:09:20 hs-10-40-6-107 Keepalived_healthcheckers[9204]: IPVS: Service not defined
Jul 12 16:09:20 hs-10-40-6-107 Keepalived_healthcheckers[9204]: SMTP connection ERROR to [127.0.0.1]:25.
然后重新恢复所有服务,测试从108切换到107,一切正常,108的mysql 停止之后,执行notify_down脚本,杀掉keepalived进程,从而使之前108所占用的vip 117备107所抢占,108的操作系统日志如下:
ul 12 14:18:40 hs-10-40-6-108 Keepalived_healthcheckers[6258]: TCP connection to [10.40.6.108]:3306 failed.
Jul 12 14:18:43 hs-10-40-6-108 Keepalived_healthcheckers[6258]: TCP connection to [10.40.6.108]:3306 failed.
Jul 12 14:18:43 hs-10-40-6-108 Keepalived_healthcheckers[6258]: Check on service [10.40.6.108]:3306 failed after 1 retry.
Jul 12 14:18:43 hs-10-40-6-108 Keepalived_healthcheckers[6258]: Removing service [10.40.6.108]:3306 from VS [10.40.6.117]:3306
Jul 12 14:18:43 hs-10-40-6-108 Keepalived_healthcheckers[6258]: IPVS: No such destination
Jul 12 14:18:43 hs-10-40-6-108 Keepalived_healthcheckers[6258]: Executing [/usr/local/etc/keepalived/mysql.sh] for service [10.40.6.108]:3306 in VS [10.40.6.117]:3306
Jul 12 14:18:43 hs-10-40-6-108 Keepalived_healthcheckers[6258]: Lost quorum 1-0=1 > 0 for VS [10.40.6.117]:3306
Jul 12 14:18:43 hs-10-40-6-108 Keepalived_healthcheckers[6258]: SMTP connection ERROR to [127.0.0.1]:25.
Jul 12 14:18:43 hs-10-40-6-108 Keepalived_vrrp[6259]: VRRP_Instance(VI_1) sent 0 priority
Jul 12 14:18:43 hs-10-40-6-108 Keepalived[6257]: Stopping
Jul 12 14:18:43 hs-10-40-6-108 Keepalived_vrrp[6259]: VRRP_Instance(VI_1) removing protocol VIPs.
Jul 12 14:18:43 hs-10-40-6-108 Keepalived_healthcheckers[6258]: Netlink reflector reports IP 10.40.6.117 removed
Jul 12 14:18:43 hs-10-40-6-108 Keepalived_healthcheckers[6258]: IPVS: No such file or directory
Jul 12 14:18:43 hs-10-40-6-108 Keepalived_healthcheckers[6258]: Stopped
今天一共装了6台机器,只有这一组主切备的时候有问题,notify_down 脚本一直不会执行,并且报错,不知道哪位大牛知道原因?