etcd故障-recovering backend from snapshot error: failed to find database snapshot file

一、问题描述

服务器意外宕机,k8s无法启动,查看kubelet日志提示node “master” not found,查看etcd 的日志报错如下面:
很明显是数据库文件损坏了
[root@master01 ~]# journalctl -u etcd -f ... Oct 08 19:13:40 master01 etcd[66468]: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist) Oct 08 19:13:40 master01 etcd[66468]: panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)

二、解决问题

进一步查看发现:etcd1 和etcd2 都存在上面的报错,但是3没有

2.1 备份

etcd1和etcd2上执行:

mv /var/lib/etcd/member /opt/

2.2 复制数据

将etcd3上的正常数据复制到etcd1和etcd2上

scp /var/lib/etcd/member  10.0.0.87:/var/lib/etcd/
scp /var/lib/etcd/member  10.0.0.97:/var/lib/etcd/

2.3 启动etcd1和etcd2

[root@master01 etcd]# systemctl start etcd
[root@master01 etcd]# systemctl status etcd
● etcd.service - Etcd Service
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2023-10-08 19:24:37 CST; 27min ago
     Docs: https://coreos.com/etcd/docs/latest/
 Main PID: 71124 (etcd)
    Tasks: 11
   Memory: 84.5M
   CGroup: /system.slice/etcd.service
           └─71124 /usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml

-------------------------------------
[root@master02 etcd]# systemctl start etcd
[root@master02 etcd]# systemctl status etcd
● etcd.service - Etcd Service
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2023-10-08 19:24:37 CST; 27min ago
     Docs: https://coreos.com/etcd/docs/latest/
 Main PID: 4643 (etcd)
    Tasks: 12
   Memory: 72.2M
   CGroup: /system.slice/etcd.service
           └─4643 /usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml

2.4 启动etcd3

但是启动etcd3 报错了

[root@master03 etcd]# systemctl start etcd
Job for etcd.service failed because the control process exited with error code. See "systemctl status etcd.service" and "journalctl -xe" for details.

因为etcd1和etcd2 都正常了,所以可以把etcd3的数据删除,然后让他自己同步数据

[root@master03 etcd]# pwd
/var/lib/etcd
[root@master03 etcd]# rm -rf ./*

#验证
[root@master03 etcd]# journalctl -u etcd -f
-- Logs begin at Sat 2023-10-07 20:57:40 CST. --
Oct 08 19:54:41 master03 etcd[12512]: store.index: compact 2706
Oct 08 19:54:41 master03 etcd[12512]: finished scheduled compaction at 2706 (took 1.3901ms)
Oct 08 19:54:41 master03 etcd[12512]: store.index: compact 3302
Oct 08 19:54:41 master03 etcd[12512]: finished scheduled compaction at 3302 (took 2.115ms)
Oct 08 19:54:41 master03 etcd[12512]: published {Name:master03 ClientURLs:[https://10.0.0.107:2379]} to cluster 514c88d14c1a2aa1
Oct 08 19:54:41 master03 etcd[12512]: ready to serve client requests
Oct 08 19:54:41 master03 etcd[12512]: ready to serve client requests
Oct 08 19:54:41 master03 etcd[12512]: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
Oct 08 19:54:41 master03 systemd[1]: Started Etcd Service.
Oct 08 19:54:41 master03 etcd[12512]: serving client requests on 10.0.0.107:2379
Oct 08 19:54:42 master03 etcd[12512]: updated the cluster version from 3.0 to 3.4
Oct 08 19:54:42 master03 etcd[12512]: enabled capabilities for version 3.4

[root@master03 etcd]# systemctl status etcd
● etcd.service - Etcd Service
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2023-10-08 19:54:41 CST; 2min 41s ago
     Docs: https://coreos.com/etcd/docs/latest/
 Main PID: 12512 (etcd)
    Tasks: 10
   Memory: 58.7M
   CGroup: /system.slice/etcd.service
           └─12512 /usr/local/bin/etcd --config-file=/etc/etcd/etcd.config.yml

三、验证数据

3.1 验证etcd集群

[root@master01 init]# export ETCDCTL_API=3
[root@master01 init]# etcdctl --endpoints="10.0.0.87:2379,10.0.0.97:2379,10.0.0.107:2379" --cacert=/etc/kubernetes/pki/etcd/etcd-ca.pem --cert=/etc/kubernetes/pki/etcd/etcd.pem --key=/etc/kubernetes/pki/etcd/etcd-key.pem  endpoint status --write-out=table
+-----------------+------------------+---------+---------+--------
|    ENDPOINT     |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------+------------------+---------+---------+--------
|  10.0.0.87:2379 | 949b9ccaa465bea8 |  3.4.13 |  3.9 MB |      true |      false |        14 |       5239 |               5239 |        |
|  10.0.0.97:2379 | 795272eff6c8418e |  3.4.13 |  3.8 MB |     false |      false |        14 |       5239 |               5239 |        |
| 10.0.0.107:2379 | 41172b80a9c89e7f |  3.4.13 |  3.9 MB |     false |      false |        14 |       5239 |               5239 |        |
+-----------------+------------------+---------+---------+--------

3.2 验证k8s 集群

[root@master01 ~]# kubectl get node
NAME       STATUS   ROLES    AGE   VERSION
master01   Ready    <none>   35m   v1.20.0
master02   Ready    <none>   35m   v1.20.0
master03   Ready    <none>   35m   v1.20.0
node02     Ready    <none>   35m   v1.20.0

[root@master01 dashboard]# kubectl get pod -A
NAMESPACE              NAME                                         READY   STATUS    RESTARTS   AGE
kube-system            calico-kube-controllers-5f6d4b864b-jf7sl     1/1     Running   1          50m
kube-system            calico-node-5vkdg                            1/1     Running   2          50m
kube-system            calico-node-k4jtq                            1/1     Running   1          50m
kube-system            calico-node-l27hd                            1/1     Running   1          50m
kube-system            calico-node-vt9jf                            1/1     Running   2          50m
kube-system            calico-node-w7x9b                            1/1     Running   1          50m
kube-system            coredns-867d46bfc6-6rx4r                     1/1     Running   2          38m
kube-system            metrics-server-595f65d8d5-wvp8s              1/1     Running   1          36m
kubernetes-dashboard   dashboard-metrics-scraper-79c5968bdc-ghmzr   1/1     Running   2          31m
kubernetes-dashboard   kubernetes-dashboard-9f9799597-vqzmq         1/1     Running   1          24m