-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted containerd/config.toml after several operator install/uninstall iterations #481
Comments
This time it survived 8 iterations but then it corrupted the imports again, this time with multiple entries of After two more iterations I was left without |
There is one more thing I noticed which troubles me a bit and might be slightly related. Please let me know whether I should create a separate issue about it, track it here or simply ignore it. The events output on which is logical as we are restarting containerd.sock. After this there is no visible activity for a long time and usually all pods are removed in a few minutes. So I guess it tries again or simply wipes things ungratefully after a timeout. It'd be perhaps nice to include some kind of barrier or try/catch to try again after containerd is up again? Anyway go is still quite strange to me and it's hard for me to find the places where these things are happening, perhaps it's already handled somehow and this is a meaningless warning. I'm just mentioning it here since it's related to the containerd handling. Full output of a successful uninstall
For comparison full output of an unsuccessful uninstall where containerd config got corrupted:
|
Hello @fidencio I went ahead and tried to get any sense of this and IIUC the main problem here is that the operator uses
So what it does is:
to confirm my assumption I tried modified kata-deploy which sets As for the proper fix, I'm very new to this area, but I think the operator should introduce it's own label to be used as node selector for |
Btw the I think it'd be safer to introduce
|
Well, the |
the install pod can fail and be re-scheduled. Modify the containerd setup to allow re-execution without breaking the setup. Fixes: confidential-containers#481 Signed-off-by: Lukáš Doktor <[email protected]>
the install pod can fail and be re-scheduled. Modify the containerd setup to allow re-execution without breaking the setup. Fixes: confidential-containers#481 Signed-off-by: Lukáš Doktor <[email protected]>
@fidencio I went ahead and removed the sleep. It seems stable but the solution is ugly. Please let me know if you think my assumptions were correct and whether I should spend the cycles to introduce a proper solution, which in my eyes is to create a new operator label and use that one to start the uninstall, while keeping the kata label as signal the cleanup is done. A demonstration which contains all the fixes is here: #483 With this PR my setup/uninstall speed-ups to about 200s (from about 300-600s; note I haven't used wait_for_stabilization as I'm not geting restarts on my system so the install finishes when kata-qemu runtime is ready). Note I'm still running a loop of tests to see whether it really resolves the issue but so far looks fine |
Without waiting for stabilization it survived 20 iterations and hanged with "cc-operator-damon-install-XXXXX" pod running so I re-added the wait-for-stabilization and am on 57th iteration and still going (using the #483). |
So I interrupted it in the morning after 199 successful iterations, which never happened since I joined this team (max I had seen was 52 successful iterations and hang). Anyway again I only used the hacky 483, not sure if the 482 is similarly stable (I briefly tested it for about 10 iterations only). |
the install pod can fail and be re-scheduled. Modify the containerd setup to allow re-execution without breaking the setup. Fixes: confidential-containers#481 Signed-off-by: Lukáš Doktor <[email protected]>
Describe the bug
During coco/operator installation the
/etc/containerd/config.toml
is modified (at least) 2 times. When I try running a loop of install/stabilize/uninstall I always quite quickly (2-6 iterations) end up with NotReady node. After evaluation I noticed thecontainerd
service is not running, always complaining about broken config file. Once it wasimports = [ , ]
, next time it wasimports = ["/etc/containerd/config.toml.d/nydus-snapshotter.toml"]
while thenydus-snapshotter.toml
file was missing.To Reproduce
Steps to reproduce the behavior:
Describe the results you expected
It should run forever with no errors
Describe the results you received:
After 2-6 iterations I'm getting timeout errors and
kubectl get nodes
reports the node as NotReady,systemctl status containerd
shows the service as down.Additional context
Using upstream coco/operator e7d4946 on fresh (vm) ubuntu-22.04.
The text was updated successfully, but these errors were encountered: