故障现像

  • harbor服务里的redis容器启动失败

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    [root@acp2-master-1 ~]# kubectl get po -n default
    NAME READY STATUS RESTARTS AGE
    docker-registry-fb854474f-jmwq5 1/1 Running 16 212d
    gitlab-ce-gitlab-ce-5c7b984fc-85clk 1/1 Running 8 9d
    gitlab-ce-gitlab-ce-database-8f7d789ff-hm2rf 1/1 Running 8 9d
    gitlab-ce-gitlab-ce-redis-c6b479b95-t5rjr 1/1 Running 8 9d
    harbor-harbor-chartmuseum-7bfd86c887-7dvnt 1/1 Running 0 33m
    harbor-harbor-clair-5d6bd4fdf-nxcw8 1/1 Running 3 33m
    harbor-harbor-core-d95c5f884-5x2cm 0/1 CrashLoopBackOff 5 14m
    harbor-harbor-database-0 1/1 Running 0 32m
    harbor-harbor-jobservice-f5d9c4995-nh8qk 1/1 Running 6 14m
    harbor-harbor-nginx-774f9569cb-njxtn 1/1 Running 0 33m
    harbor-harbor-notary-server-867d58d99f-hdq8v 1/1 Running 0 33m
    harbor-harbor-notary-signer-6f6955b4fc-z99sh 1/1 Running 0 33m
    harbor-harbor-portal-65fd74dcbb-8pctx 1/1 Running 0 33m
    harbor-harbor-redis-bd8dbdf49-kttnr 0/1 CrashLoopBackOff 6 7m36s
    harbor-harbor-registry-7bbf6cb89f-ncxz9 2/2 Running 0 33m
  • 日志报错

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    [root@acp2-master-1 ~]# kubectl logs --tail 20 -f -n default harbor-harbor-redis-bd8dbdf49-kttnr
    ( ' , .-` | `, ) Running in standalone mode
    |`-._`-...-` __...-.``-._|'` _.-'| Port: 6379
    | `-._ `._ / _.-' | PID: 1
    `-._ `-._ `-./ _.-' _.-'
    |`-._`-._ `-.__.-' _.-'_.-'|
    | `-._`-._ _.-'_.-' | http://redis.io
    `-._ `-._`-.__.-'_.-' _.-'
    |`-._`-._ `-.__.-' _.-'_.-'|
    | `-._`-._ _.-'_.-' |
    `-._ `-._`-.__.-'_.-' _.-'
    `-._ `-.__.-' _.-'
    `-._ _.-'
    `-.__.-'

    1:M 22 Apr 2020 06:14:17.117 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
    1:M 22 Apr 2020 06:14:17.117 # Server initialized
    1:M 22 Apr 2020 06:14:17.117 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
    1:M 22 Apr 2020 06:14:17.120 * Reading RDB preamble from AOF file...
    1:M 22 Apr 2020 06:14:17.730 * Reading the remaining AOF tail...
    1:M 22 Apr 2020 06:14:18.095 # Bad file format reading the append only file: make a backup of your AOF file, then use ./redis-check-aof --fix <filename>

    主要是4个问题,导致redis启动不了的主要是最后一行。“Bad file format reading the append only file: make a backup of your AOF file, then use ./redis-check-aof –fix ” ,翻译一下读取仅追加文件的错误文件格式:备份AOF文件,然后使用./redis-check-a of–fix,有一个AOF的备份文件,通过这个./redis-check-a of–fix还原,查了一下redis的备份文件为“appendonly.aof”.

解决,其中有WARNING的告警可以不需要处理

  • 第一个警告:WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.

    1
    2
    3
    4
    5
    方法1: 临时设置生效: sysctl -w net.core.somaxconn = 511
    方法2: 永久生效: 修改/etc/sysctl.conf文件,增加一行
    net.core.somaxconn= 511
    然后执行命令
    sysctl -p
  • 第二个警告:WARNING overcommitmemory is set to 0! Background save may fail under low memory condition. To fix this issue add ‘vm.overcommitmemory = 1’ to /etc/sysctl.conf and then reboot or run the command ‘sysctl vm.overcommit_memory=1’ for this to take effect.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    解决方案
    方法1: 临时设置生效: sysctl -w vm.overcommit_memory = 1
    方法2: 永久生效: 修改/etc/sysctl.conf文件,增加一行
    vm.overcommit_memory = 1
    然后执行命令
    sysctl -p

    补充:
    overcommit_memory参数说明:
    设置内存分配策略(可选,根据服务器的实际情况进行设置)
    /proc/sys/vm/overcommit_memory
    可选值:0、1、2。
    0, 表示内核将检查是否有足够的可用内存供应用进程使用;如果有足够的可用内存,内存申请允许;否则,内存申请失败,并把错误返回给应用进程。
    1, 表示内核允许分配所有的物理内存,而不管当前的内存状态如何。
    2, 表示内核允许分配超过所有物理内存和交换空间总和的内存
    注意:redis在dump数据的时候,会fork出一个子进程,理论上child进程所占用的内存和parent是一样的,比如parent占用的内存为8G,这个时候也要同样分配8G的内存给child,如果内存无法负担,往往会造成redis服务器的down机或者IO负载过高,效率下降。所以这里比较优化的内存分配策略应该设置为 1(表示内核允许分配所有的物理内存,而不管当前的内存状态如何)。
  • 第三个警告:WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command ‘echo never > /sys/kernel/mm/transparent_hugepage/enabled’ as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.

    1
    2
    解决方案:上面也提供了解决方案,'echo never > /sys/kernel/mm/transparent_hugepage/enabled'  这一行,直接执行就好了,但是这样的话,只是当前生效而已,如果电脑重启之后,又是需要重新设置的,所以把这个命令加入到启动过程中。
    编辑/etc/rc.local,加入echo never > /sys/kernel/mm/transparent_hugepage/enabled。

    三个警告都已经解决,可以不解决,不影响服务,接下为就处理最后导致起不来的问题

  • 通过inspect redis容器找到数据在本地的目录

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    },
    "Mounts": [
    {
    "Type": "bind",
    "Source": "/var/lib/kubelet/pods/965118f1-a115-43f1-b517-870028afb64f/volumes/kubernetes.io~cephfs/pvc-37d4ca75-5316-11ea-b6ed-525400d63265",
    "Destination": "/bitnami/redis",
    "Mode": "",
    "RW": true,
    "Propagation": "rprivate"
    },
    {
    "Type": "bind",
    "Source": "/var/lib/kubelet/pods/965118f1-a115-43f1-b517-870028afb64f/volumes/kubernetes.io~secret/default-token-tv2rp",
    "Destination": "/var/run/secrets/kubernetes.io/serviceaccount",
    "Mode": "ro",
    "RW": false,
    "Propagation": "rprivate"
    },
    {
    "Type": "bind",
    "Source": "/var/lib/kubelet/pods/965118f1-a115-43f1-b517-870028afb64f/etc-hosts",
    "Destination": "/etc/hosts",
    "Mode": "",
    "RW": true,
    "Propagation": "rprivate"
    },
    {
    "Type": "bind",
    "Source": "/var/lib/kubelet/pods/965118f1-a115-43f1-b517-870028afb64f/containers/harbor-harbor/fdc4d83f",
    "Destination": "/dev/termination-log",
    "Mode": "",
    "RW": true,
    "Propagation": "rprivate"
    }
    ],
  • 进入宿主机的目录找到 /var/lib/kubelet/pods/965118f1-a115-43f1-b517-870028afb64f/ 可以使用find 找到备份文件appendonly.aof

    1
    2
    3
    [root@acp2-node-1 ~]# cd /var/lib/kubelet/pods/965118f1-a115-43f1-b517-870028afb64f
    [root@acp2-node-1 volumes]# find ./* -name appendonly.aof
    ./kubernetes.io~cephfs/pvc-37d4ca75-5316-11ea-b6ed-525400d63265/data/appendonly.aof
  • 然后再查找恢复工具redis-check-aof,也可使用find命令

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    [root@acp2-node-1 pods]# find /* -name redis-check-aof
    /var/lib/docker/overlay2/b7d53e5601b2f83193463f47c11ac363cb15d4e7152935484944ff94d8ff0e49/diff/opt/bitnami/redis/bin/redis-check-aof
    [root@acp2-node-1 pods]# cd /var/lib/docker/overlay2/b7d53e5601b2f83193463f47c11ac363cb15d4e7152935484944ff94d8ff0e49/diff/opt/bitnami/redis/bin/
    [root@acp2-node-1 bin]# ll
    total 6732
    -rwxrwxr-x 1 root root 679512 Sep 21 2019 redis-benchmark
    -rwxrwxr-x 1 root root 1786296 Sep 21 2019 redis-check-aof
    -rwxrwxr-x 1 root root 1786296 Sep 21 2019 redis-check-rdb
    -rwxrwxr-x 1 root root 841624 Sep 21 2019 redis-cli
    -rwxrwxr-x 1 root root 1786296 Sep 21 2019 redis-server
    [root@acp2-node-1 bin]#
  • 然后就是使用以下命令开始恢复

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    [root@acp2-node-1 bin]# ./redis-check-aof --fix /var/lib/kubelet/pods/965118f1-a115-43f1-b517-870028afb64f/volumes/kubernetes.io~cephfs/pvc-37d4ca75-5316-11ea-b6ed-525400d63265/data/appendonly.aof
    The AOF appears to start with an RDB preamble.
    Checking the RDB preamble to start:
    [offset 0] Checking RDB file --fix
    [offset 26] AUX FIELD redis-ver = '5.0.5'
    [offset 40] AUX FIELD redis-bits = '64'
    [offset 52] AUX FIELD ctime = '1587523879'
    [offset 67] AUX FIELD used-mem = '79175024'
    [offset 83] AUX FIELD aof-preamble = '1'
    [offset 85] Selecting DB ID 0
    [offset 30115710] Selecting DB ID 1
    [offset 30117290] Selecting DB ID 2
    [offset 30596730] Checksum OK
    [offset 30596730] \o/ RDB looks OK! \o/
    [info] 472735 keys read
    [info] 470549 expires
    [info] 470549 already expired
    RDB preamble is OK, proceeding with AOF tail...
    0x 2b98f71: Expected prefix '*', got: '
    AOF analyzed: size=45715470, ok_up_to=45715313, diff=157
    This will shrink the AOF from 45715470 bytes, with 157 bytes, to 45715313 bytes
    Continue? [y/N]: y
    Successfully truncated AOF
    [root@acp2-node-1 bin]#
  • 启动容器

    1
    2
    3
    4
    5
    6
    7
    8
    9
    [root@acp2-node-1 bin]# docker ps -a | grep harbor-harbor-redis
    9c917c4164c8 180c2ecb6e22 "/entrypoint.sh /run…" About a minute ago Exited (1) About a minute ago k8s_harbor-harbor_harbor-harbor-redis-bd8dbdf49-kttnr_default_965118f1-a115-43f1-b517-870028afb64f_6
    cb1ce5f0547c 10.0.129.100:60080/claas/pause:3.1 "/pause" About an hour ago Up About an hour k8s_POD_harbor-harbor-redis-bd8dbdf49-kttnr_default_965118f1-a115-43f1-b517-870028afb64f_0
    [root@acp2-node-1 bin]# docker restart 9c917c4164c8
    9c917c4164c8
    [root@acp2-node-1 bin]# docker ps -a | grep harbor-harbor-redis
    08ecd6aa40b1 180c2ecb6e22 "/entrypoint.sh /run…" About a minute ago Up About a minute k8s_harbor-harbor_harbor-harbor-redis-bd8dbdf49-kttnr_default_965118f1-a115-43f1-b517-870028afb64f_7
    cb1ce5f0547c 10.0.129.100:60080/claas/pause:3.1 "/pause" About an hour ago Up About an hour k8s_POD_harbor-harbor-redis-bd8dbdf49-kttnr_default_965118f1-a115-43f1-b517-870028afb64f_0
    [root@acp2-node-1 bin]#
  • 测试
    通过测试发现已经没有问题,pod也都正常

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    [root@acp2-master-1 ~]# kubectl get po -n default | grep harbor
    harbor-harbor-chartmuseum-7bfd86c887-7dvnt 1/1 Running 0 101m
    harbor-harbor-clair-5d6bd4fdf-nxcw8 1/1 Running 3 101m
    harbor-harbor-core-d95c5f884-b9cp6 1/1 Running 0 28m
    harbor-harbor-database-0 1/1 Running 0 100m
    harbor-harbor-jobservice-f5d9c4995-ljz8p 1/1 Running 0 28m
    harbor-harbor-nginx-774f9569cb-njxtn 1/1 Running 0 101m
    harbor-harbor-notary-server-867d58d99f-hdq8v 1/1 Running 0 101m
    harbor-harbor-notary-signer-6f6955b4fc-z99sh 1/1 Running 0 101m
    harbor-harbor-portal-65fd74dcbb-8pctx 1/1 Running 0 101m
    harbor-harbor-redis-bd8dbdf49-kttnr 1/1 Running 7 75m
    harbor-harbor-registry-7bbf6cb89f-ncxz9 2/2 Running 0 101m
    [root@acp2-master-1 ~]#