Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted containerd/config.toml after several operator install/uninstall iterations #481

Open
ldoktor opened this issue Jan 7, 2025 · 8 comments · May be fixed by #482
Open

Corrupted containerd/config.toml after several operator install/uninstall iterations #481

ldoktor opened this issue Jan 7, 2025 · 8 comments · May be fixed by #482

Comments

@ldoktor
Copy link
Contributor

ldoktor commented Jan 7, 2025

Describe the bug
During coco/operator installation the /etc/containerd/config.toml is modified (at least) 2 times. When I try running a loop of install/stabilize/uninstall I always quite quickly (2-6 iterations) end up with NotReady node. After evaluation I noticed the containerd service is not running, always complaining about broken config file. Once it was imports = [ , ], next time it was imports = ["/etc/containerd/config.toml.d/nydus-snapshotter.toml"] while the nydus-snapshotter.toml file was missing.

To Reproduce
Steps to reproduce the behavior:

cd tests/e2e
ansible-playbook -i localhost, -c local --tags untagged ansible/main.yaml
bash -xc './cluster/up.sh'
export KUBECONFIG=/etc/kubernetes/admin.conf
bash -xc './operator.sh'
I=0
while :; do
        echo $I
        ./operator.sh uninstall
        ./operator.sh install
        ./operator.sh wait_for_stabilization
        ((I++)) || true
done

Describe the results you expected
It should run forever with no errors

Describe the results you received:
After 2-6 iterations I'm getting timeout errors and kubectl get nodes reports the node as NotReady, systemctl status containerd shows the service as down.

Additional context
Using upstream coco/operator e7d4946 on fresh (vm) ubuntu-22.04.

@ldoktor
Copy link
Contributor Author

ldoktor commented Jan 7, 2025

This time it survived 8 iterations but then it corrupted the imports again, this time with multiple entries of imports = ["/etc/containerd/config.toml.d/nydus-snapshotter.toml", "/etc/containerd/config.toml.d/nydus-snapshotter.toml", "/etc/containerd/config.toml.d/nydus-snapshotter.toml", "/opt/kata/containerd/config.d/kata-deploy.toml"].

After two more iterations I was left without /etc/containerd/config.toml at all (with /etc/containerd/config.toml.d/nydus-snapshotter.toml present)

@ldoktor
Copy link
Contributor Author

ldoktor commented Jan 8, 2025

There is one more thing I noticed which troubles me a bit and might be slightly related. Please let me know whether I should create a separate issue about it, track it here or simply ignore it. The events output on uninstall always complains about: confidential-containers-system 10s Warning FailedKillPod pod/cc-operator-daemon-install-vstff error killing pod: [failed to "KillContainer" for "cc-runtime-install-pod" with KillContainerError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/containerd/containerd.sock: connect: no such file or directory\"", failed to "KillPodSandbox" for "f95e46f4-e5db-49b6-8903-7e535d784919" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/containerd/containerd.sock: connect: no such file or directory\""]

which is logical as we are restarting containerd.sock. After this there is no visible activity for a long time and usually all pods are removed in a few minutes. So I guess it tries again or simply wipes things ungratefully after a timeout. It'd be perhaps nice to include some kind of barrier or try/catch to try again after containerd is up again?

Anyway go is still quite strange to me and it's hard for me to find the places where these things are happening, perhaps it's already handled somehow and this is a meaningless warning. I'm just mentioning it here since it's related to the containerd handling.

Full output of a successful uninstall

confidential-containers-system   19m         Normal    Created                   pod/cc-operator-daemon-install-vstff                   Created container cc-runtime-install-pod
confidential-containers-system   19m         Normal    Started                   pod/cc-operator-daemon-install-vstff                   Started container cc-runtime-install-pod
confidential-containers-system   19m         Normal    Pulled                    pod/cc-operator-daemon-install-vstff                   Successfully pulled image "quay.io/kata-containers/kata-deploy-ci:kata-containers-latest" in 5m8.199s (5m8.199s including waiting). Image size: 1029942957 bytes.
confidential-containers-system   2m22s       Normal    SuccessfulCreate          daemonset/cc-operator-daemon-uninstall                 Created pod: cc-operator-daemon-uninstall-8cm7l
confidential-containers-system   2m22s       Normal    Scheduled                 pod/cc-operator-daemon-uninstall-8cm7l                 Successfully assigned confidential-containers-system/cc-operator-daemon-uninstall-8cm7l to e2e
confidential-containers-system   2m21s       Normal    Pulling                   pod/cc-operator-daemon-uninstall-8cm7l                 Pulling image "quay.io/kata-containers/kata-deploy-ci:kata-containers-latest"
confidential-containers-system   2m21s       Normal    Pulled                    pod/cc-operator-daemon-uninstall-8cm7l                 Successfully pulled image "quay.io/kata-containers/kata-deploy-ci:kata-containers-latest" in 503ms (503ms including waiting). Image size: 1029942957 bytes.   
confidential-containers-system   2m21s       Normal    Created                   pod/cc-operator-daemon-uninstall-8cm7l                 Created container cc-runtime-install-pod
confidential-containers-system   2m21s       Normal    Started                   pod/cc-operator-daemon-uninstall-8cm7l                 Started container cc-runtime-install-pod
confidential-containers-system   82s         Normal    SuccessfulCreate          daemonset/cc-operator-post-uninstall-daemon            Created pod: cc-operator-post-uninstall-daemon-vmwbh
confidential-containers-system   82s         Normal    Scheduled                 pod/cc-operator-post-uninstall-daemon-vmwbh            Successfully assigned confidential-containers-system/cc-operator-post-uninstall-daemon-vmwbh to e2e
confidential-containers-system   81s         Normal    Created                   pod/cc-operator-post-uninstall-daemon-vmwbh            Created container cc-runtime-post-uninstall-pod
confidential-containers-system   81s         Normal    Pulling                   pod/cc-operator-post-uninstall-daemon-vmwbh            Pulling image "localhost:5000/reqs-payload"
confidential-containers-system   81s         Normal    Pulled                    pod/cc-operator-post-uninstall-daemon-vmwbh            Successfully pulled image "localhost:5000/reqs-payload" in 78ms (78ms including waiting). Image size: 85426452 bytes.
confidential-containers-system   81s         Normal    Started                   pod/cc-operator-post-uninstall-daemon-vmwbh            Started container cc-runtime-post-uninstall-pod
confidential-containers-system   22s         Normal    Killing                   pod/cc-operator-daemon-uninstall-8cm7l                 Stopping container cc-runtime-install-pod
confidential-containers-system   11s         Normal    Killing                   pod/cc-operator-daemon-install-vstff                   Stopping container cc-runtime-install-pod
confidential-containers-system   22s         Normal    Killing                   pod/cc-operator-post-uninstall-daemon-vmwbh            Stopping container cc-runtime-post-uninstall-pod
confidential-containers-system   22s         Normal    Killing                   pod/cc-operator-pre-install-daemon-cgkb4               Stopping container cc-runtime-pre-install-pod
confidential-containers-system   11s         Warning   FailedPreStopHook         pod/cc-operator-daemon-install-vstff                   PreStopHook failed 
confidential-containers-system   10s         Warning   FailedKillPod             pod/cc-operator-daemon-install-vstff                   error killing pod: [failed to "KillContainer" for "cc-runtime-install-pod" with KillContainerError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/containerd/containerd.sock: connect: no such file or directory\"", failed to "KillPodSandbox" for "f95e46f4-e5db-49b6-8903-7e535d784919" with KillPodSandboxError: "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /var/run/containerd/containerd.sock: connect: no such file or directory\""]

For comparison full output of an unsuccessful uninstall where containerd config got corrupted:

confidential-containers-system   19m         Normal    SuccessfulCreate    replicaset/cc-operator-controller-manager-6f5466fb6b   Created pod: cc-operator-controller-manager-6f5466fb6b-92njx
confidential-containers-system   19m         Normal    Scheduled           pod/cc-operator-controller-manager-6f5466fb6b-92njx    Successfully assigned confidential-containers-system/cc-operator-controller-manager-6f5466fb6b-92njx to e2e
confidential-containers-system   19m         Normal    ScalingReplicaSet   deployment/cc-operator-controller-manager              Scaled up replica set cc-operator-controller-manager-6f5466fb6b to 1
confidential-containers-system   19m         Normal    Created             pod/cc-operator-controller-manager-6f5466fb6b-92njx    Created container kube-rbac-proxy
confidential-containers-system   19m         Normal    Started             pod/cc-operator-controller-manager-6f5466fb6b-92njx    Started container kube-rbac-proxy
confidential-containers-system   19m         Normal    Pulling             pod/cc-operator-controller-manager-6f5466fb6b-92njx    Pulling image "localhost:5000/cc-operator:latest"
confidential-containers-system   19m         Normal    Pulled              pod/cc-operator-controller-manager-6f5466fb6b-92njx    Successfully pulled image "localhost:5000/cc-operator:latest" in 50ms (50ms including waiting). Image size: 26581550 bytes.
confidential-containers-system   19m         Normal    Created             pod/cc-operator-controller-manager-6f5466fb6b-92njx    Created container manager
confidential-containers-system   19m         Normal    Started             pod/cc-operator-controller-manager-6f5466fb6b-92njx    Started container manager
confidential-containers-system   19m         Normal    Pulled              pod/cc-operator-controller-manager-6f5466fb6b-92njx    Container image "gcr.io/kubebuilder/kube-rbac-proxy:v0.13.1" already present on machine
confidential-containers-system   19m         Normal    LeaderElection      lease/69bf4d38.confidentialcontainers.org              cc-operator-controller-manager-6f5466fb6b-92njx_fc2400dc-25ca-4af6-b095-042240d88f6d became leader
confidential-containers-system   19m         Normal    Scheduled           pod/cc-operator-pre-install-daemon-hkt7d               Successfully assigned confidential-containers-system/cc-operator-pre-install-daemon-hkt7d to e2e
confidential-containers-system   19m         Normal    SuccessfulCreate    daemonset/cc-operator-pre-install-daemon               Created pod: cc-operator-pre-install-daemon-hkt7d
confidential-containers-system   19m         Normal    Pulling             pod/cc-operator-pre-install-daemon-hkt7d               Pulling image "localhost:5000/reqs-payload"
confidential-containers-system   19m         Normal    Pulled              pod/cc-operator-pre-install-daemon-hkt7d               Successfully pulled image "localhost:5000/reqs-payload" in 70ms (70ms including waiting). Image size: 85426452 bytes.
confidential-containers-system   19m         Normal    Created             pod/cc-operator-pre-install-daemon-hkt7d               Created container cc-runtime-pre-install-pod
confidential-containers-system   19m         Normal    Started             pod/cc-operator-pre-install-daemon-hkt7d               Started container cc-runtime-pre-install-pod
confidential-containers-system   19m         Normal    SuccessfulCreate    daemonset/cc-operator-daemon-install                   Created pod: cc-operator-daemon-install-dvzm7
confidential-containers-system   19m         Normal    Scheduled           pod/cc-operator-daemon-install-dvzm7                   Successfully assigned confidential-containers-system/cc-operator-daemon-install-dvzm7 to e2e
confidential-containers-system   19m         Normal    Created             pod/cc-operator-daemon-install-dvzm7                   Created container cc-runtime-install-pod
confidential-containers-system   19m         Normal    Started             pod/cc-operator-daemon-install-dvzm7                   Started container cc-runtime-install-pod
confidential-containers-system   19m         Normal    Pulled              pod/cc-operator-daemon-install-dvzm7                   Successfully pulled image "quay.io/kata-containers/kata-deploy-ci:kata-containers-latest" in 568ms (568ms including waiting). Image size: 1029942957 bytes.
confidential-containers-system   19m         Normal    Pulling             pod/cc-operator-daemon-install-dvzm7                   Pulling image "quay.io/kata-containers/kata-deploy-ci:kata-containers-latest"
default                          14s         Warning   ContainerGCFailed   node/e2e                                               rpc error: code = Unimplemented desc = unknown service runtime.v1.RuntimeService
default                          19m         Normal    NodeNotReady        node/e2e                                               Node e2e status is now: NodeNotReady
default                          4m15s       Warning   ImageGCFailed       node/e2e                                               rpc error: code = Unimplemented desc = unknown service runtime.v1.ImageService

@ldoktor
Copy link
Contributor Author

ldoktor commented Jan 13, 2025

Hello @fidencio I went ahead and tried to get any sense of this and IIUC the main problem here is that the operator uses katacontainers.io/kata-runtime": "cleanup" for 2 purposes:

  1. node selector for daemon uninstall
  2. label to detect kata is already uninstalled

So what it does is:

  1. start uninstallation
  2. handleFinalizers - checks for nodes with "cleanup" and sees []
  3. handleFinalizers - calls setCleanupNodeLabels which sets cleanup labels
  4. this results in daemon uninstall having nodes to run on and starts the cleanup (kata-deploy cleanup)
  5. handleFinalizers - checks again for nodes with cleanup which succeeds even though the kata-deploy hadn't finished yet and runs the postUninstall

to confirm my assumption I tried modified kata-deploy which sets "katacontainers.io/kata-runtime2": "cleanup" label and modified the condition in the operator to check for it and it worked well (kata-runtime=cleanup is set by operator, then the uninstall deamonset is scheduled, after it finishes it sets kata-runtime2=cleanup which lets the operator to start postUninstall and afterwards the operator cleans both the kata-runtime and kata-runtime2 labels.

As for the proper fix, I'm very new to this area, but I think the operator should introduce it's own label to be used as node selector for daemon uninstall deployment (eg.: cc-uninstall/run=true) and in setCleanupNodeLabels set that one rather than interfering with katacontainers.io/kata-runtime label (which it's not owning in the first place). What do you think?

@ldoktor
Copy link
Contributor Author

ldoktor commented Jan 13, 2025

Btw the setCleanupNodeLabels could have actually been the source for all the hangs as in case we ran the uninstall earlier than kata-deploy set the kata-runtime=true this function would detect such label is missing on all nodes and would not set it to cleanup, which then results in the operator uninstall daemonset not having nodes to run on and therefore not uninstalling the kata at all.

I think it'd be safer to introduce cc/status label that would track the stage:

  • installing
  • ready (after kata-runtime=true)
  • uninstalling (when we tell the operator to uninstall, would work as the uninstall daemon node selector

@ldoktor
Copy link
Contributor Author

ldoktor commented Jan 14, 2025

Well, the kata-runtime=true label is only one part of the puzzle, the second is (just confirmed) repeated operator install. From time to time it fails and is re-executed, which results in multiple nydus-snapshotter imports and a broken cleanup.

ldoktor added a commit to ldoktor/coco-operator that referenced this issue Jan 14, 2025
the install pod can fail and be re-scheduled. Modify the containerd
setup to allow re-execution without breaking the setup.

Fixes: confidential-containers#481

Signed-off-by: Lukáš Doktor <[email protected]>
ldoktor added a commit to ldoktor/coco-operator that referenced this issue Jan 14, 2025
the install pod can fail and be re-scheduled. Modify the containerd
setup to allow re-execution without breaking the setup.

Fixes: confidential-containers#481

Signed-off-by: Lukáš Doktor <[email protected]>
@ldoktor
Copy link
Contributor Author

ldoktor commented Jan 14, 2025

@fidencio I went ahead and removed the sleep. It seems stable but the solution is ugly. Please let me know if you think my assumptions were correct and whether I should spend the cycles to introduce a proper solution, which in my eyes is to create a new operator label and use that one to start the uninstall, while keeping the kata label as signal the cleanup is done.

A demonstration which contains all the fixes is here: #483

With this PR my setup/uninstall speed-ups to about 200s (from about 300-600s; note I haven't used wait_for_stabilization as I'm not geting restarts on my system so the install finishes when kata-qemu runtime is ready).

Note I'm still running a loop of tests to see whether it really resolves the issue but so far looks fine

@ldoktor
Copy link
Contributor Author

ldoktor commented Jan 14, 2025

Without waiting for stabilization it survived 20 iterations and hanged with "cc-operator-damon-install-XXXXX" pod running so I re-added the wait-for-stabilization and am on 57th iteration and still going (using the #483).

@ldoktor
Copy link
Contributor Author

ldoktor commented Jan 15, 2025

So I interrupted it in the morning after 199 successful iterations, which never happened since I joined this team (max I had seen was 52 successful iterations and hang). Anyway again I only used the hacky 483, not sure if the 482 is similarly stable (I briefly tested it for about 10 iterations only).

ldoktor added a commit to ldoktor/coco-operator that referenced this issue Jan 15, 2025
the install pod can fail and be re-scheduled. Modify the containerd
setup to allow re-execution without breaking the setup.

Fixes: confidential-containers#481

Signed-off-by: Lukáš Doktor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant