etcd故障-recovering backend from snapshot error: failed to find database snapshot file
一、问题描述
服务器意外宕机,k8s无法启动,查看kubelet日志提示node “master” not found,查看etcd 的日志报错如下面:
很明显是数据库文件损坏了
[root@master01 ~]# journalctl -u etcd -f ... Oct 08 19:13:40 master01 etcd[66468]: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist) Oct 08 19:13:40 master01 etcd[66468]: panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
二、解决问题
进一步查看发现:etcd1 和etcd2 都存在上面的报错,但是3没有
2.1 备份
etcd1和etcd2上执行:
mv /var/lib/etcd/member /opt/
2.2 复制数据
将etcd3上的正常数据复制到etcd1和etcd2上
scp /var/lib/etcd/member 10.0.0.87:/var/lib/etcd/
scp /var/lib/etcd/member 10.0.0.97:/var/lib/etcd/
2.3 启动etcd1和etcd2
[root@master01 etcd]# systemctl start etcd
[root@master01 etcd]# systemctl status etcd
● etcd.service - Etcd Service
Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2023-10-08 19:24:37 CST; 27min ago
Docs: https://coreos.com/etcd/docs/latest/
Main PID: 71124 (etcd)
Tasks: 11
Memory: 84.5M
CGroup: /system.slice/etcd.service
└─71124 /usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml
-------------------------------------
[root@master02 etcd]# systemctl start etcd
[root@master02 etcd]# systemctl status etcd
● etcd.service - Etcd Service
Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2023-10-08 19:24:37 CST; 27min ago
Docs: https://coreos.com/etcd/docs/latest/
Main PID: 4643 (etcd)
Tasks: 12
Memory: 72.2M
CGroup: /system.slice/etcd.service
└─4643 /usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml
2.4 启动etcd3
但是启动etcd3 报错了
[root@master03 etcd]# systemctl start etcd
Job for etcd.service failed because the control process exited with error code. See "systemctl status etcd.service" and "journalctl -xe" for details.
因为etcd1和etcd2 都正常了,所以可以把etcd3的数据删除,然后让他自己同步数据
[root@master03 etcd]# pwd
/var/lib/etcd
[root@master03 etcd]# rm -rf ./*
#验证
[root@master03 etcd]# journalctl -u etcd -f
-- Logs begin at Sat 2023-10-07 20:57:40 CST. --
Oct 08 19:54:41 master03 etcd[12512]: store.index: compact 2706
Oct 08 19:54:41 master03 etcd[12512]: finished scheduled compaction at 2706 (took 1.3901ms)
Oct 08 19:54:41 master03 etcd[12512]: store.index: compact 3302
Oct 08 19:54:41 master03 etcd[12512]: finished scheduled compaction at 3302 (took 2.115ms)
Oct 08 19:54:41 master03 etcd[12512]: published {Name:master03 ClientURLs:[https://10.0.0.107:2379]} to cluster 514c88d14c1a2aa1
Oct 08 19:54:41 master03 etcd[12512]: ready to serve client requests
Oct 08 19:54:41 master03 etcd[12512]: ready to serve client requests
Oct 08 19:54:41 master03 etcd[12512]: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
Oct 08 19:54:41 master03 systemd[1]: Started Etcd Service.
Oct 08 19:54:41 master03 etcd[12512]: serving client requests on 10.0.0.107:2379
Oct 08 19:54:42 master03 etcd[12512]: updated the cluster version from 3.0 to 3.4
Oct 08 19:54:42 master03 etcd[12512]: enabled capabilities for version 3.4
[root@master03 etcd]# systemctl status etcd
● etcd.service - Etcd Service
Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: active (running) since Sun 2023-10-08 19:54:41 CST; 2min 41s ago
Docs: https://coreos.com/etcd/docs/latest/
Main PID: 12512 (etcd)
Tasks: 10
Memory: 58.7M
CGroup: /system.slice/etcd.service
└─12512 /usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml
三、验证数据
3.1 验证etcd集群
[root@master01 init]# export ETCDCTL_API=3
[root@master01 init]# etcdctl --endpoints="10.0.0.87:2379,10.0.0.97:2379,10.0.0.107:2379" --cacert=/etc/kubernetes/pki/etcd/etcd-ca.pem --cert=/etc/kubernetes/pki/etcd/etcd.pem --key=/etc/kubernetes/pki/etcd/etcd-key.pem endpoint status --write-out=table
+-----------------+------------------+---------+---------+--------
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------+------------------+---------+---------+--------
| 10.0.0.87:2379 | 949b9ccaa465bea8 | 3.4.13 | 3.9 MB | true | false | 14 | 5239 | 5239 | |
| 10.0.0.97:2379 | 795272eff6c8418e | 3.4.13 | 3.8 MB | false | false | 14 | 5239 | 5239 | |
| 10.0.0.107:2379 | 41172b80a9c89e7f | 3.4.13 | 3.9 MB | false | false | 14 | 5239 | 5239 | |
+-----------------+------------------+---------+---------+--------
3.2 验证k8s 集群
[root@master01 ~]# kubectl get node
NAME STATUS ROLES AGE VERSION
master01 Ready <none> 35m v1.20.0
master02 Ready <none> 35m v1.20.0
master03 Ready <none> 35m v1.20.0
node02 Ready <none> 35m v1.20.0
[root@master01 dashboard]# kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system calico-kube-controllers-5f6d4b864b-jf7sl 1/1 Running 1 50m
kube-system calico-node-5vkdg 1/1 Running 2 50m
kube-system calico-node-k4jtq 1/1 Running 1 50m
kube-system calico-node-l27hd 1/1 Running 1 50m
kube-system calico-node-vt9jf 1/1 Running 2 50m
kube-system calico-node-w7x9b 1/1 Running 1 50m
kube-system coredns-867d46bfc6-6rx4r 1/1 Running 2 38m
kube-system metrics-server-595f65d8d5-wvp8s 1/1 Running 1 36m
kubernetes-dashboard dashboard-metrics-scraper-79c5968bdc-ghmzr 1/1 Running 2 31m
kubernetes-dashboard kubernetes-dashboard-9f9799597-vqzmq 1/1 Running 1 24m