-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
image pull errors #60
Comments
I have met this problem too. I have also try the
When I try to pull redis again, the error is
In addition, pull cargo build --example async-client
Client connect to tcp://127.0.0.1:7788
Green Thread 1 - image.pull_image() started: 64.422µs
Green Thread 1 - pull_image -> Ok(image_ref: "docker.io/huaijin20191223/scratch-base:v1.8") ended: 4.964946274s
pull_image - docker.io/huaijin20191223/scratch-base:v1.8 |
@haosanzi Can you help capture Occlum log when pulling redis? I can triage from Occlum side. |
@qzheng527 my logs (last 100 lines):
|
@mythi @haosanzi What is the sequence of below three?
1,2,3 or 2,1,3? @mythi It would be better to capture the log with "trace" log level. |
2,1,3 as far as I can see from the code
there's not much more info (I cut some lines)
|
@mythi It is weird. The mount should have nothing to do with the application docker image. Why it is successful for hello-world but failed for nginx? Can you check if they are all empty folers, target, upper and lower? And |
the trace log file is huge (1M lines ~140MB). Some potential errors:
|
All errors from the last 3000 lines:
|
For the first log info, I have two comments:
The
For the second log info, I have two comments:
At last, it seems the enclave agent sometimes cannot recognize a image with a version tag such as Hope these comments can help you @mythi. |
@HaokunX-intel Thank you for your comment.
|
Thanks all! I will give these a try later today. |
Good suggestions. I plan to draft a guideline and introduce some debug tools later. |
I have these created but I can still see errors:
They remain empty.
We have this set in enclave-cc
Where are these stored? I've deleted the agent-enclave runtime dirs created by
This is not correct. Images with different names can still have layers with the same digest, e.g., the base layer.
OK I will submit an issue about it so it gets fixed. We cannot ask users to change their deployments. |
The problem is usually caused by previously pulling failure in the same enclave agent bundle. We need to "rebuild" the enclave agent bundle, The "rebuild" means running
Deleting a part of dirs in the agent bundle maybe cannot solve the problems. Can you specify the names of these runtime dirs? I want to reproduce the problem in local. |
@HaokunX-intel it looks |
@mythi @HaokunX-intel The agent enclave is stateful. The previous pulling/unpacking stored the content in the Occlum unionfs. Instead of the occlum new operation (which I think is not applicable on practical env) to start a fresh agent enclave, why not every time copy the template agent enclave to a working dir (shim do this?) as a new clean start? |
The application (image-rs?) running in agent-enclave can (or should) cover this failures. Before doing a new pull/unpack, delete the content left by preivous attempt. |
I totally agree it needs to be made more robust but we should explore the options more closely. Based on this thread my summary is there are few areas that need fixing:
|
@arronwy fyi in case this is an error propagation condition in image-rs |
first issue related to image/layer sharing, need to further check with image-rs community |
When the enclave agent got the image pulling request, it checks the container id (cid) firstly. Typically, if the request has a cid, the cid's pattern is One solution is to replace the We propose a PR to fix the bug. |
Hi, some related error has been fixed in image-rs. confidential-containers/guest-components#76 Please try to use the newest rev of image-rs. Hope to help! |
It seems this issue is related to other problems.
And we pull the image
According to the code, we guess the image layer is cached whatever the image is mounted successfully. We want to know if the image-rs plans to provide some mechanism, looks like
We think these will make the image-rs more user-friendly. |
cc @arronwy |
Hi @HaokunX-intel , can you have a try with latest |
We rebuild the agent agent.
The PR confidential-containers/guest-components#79 actually skips the pulled layer.
However, it also skips the mount operation.
Currently, it looks like
The bundle is empty. Maybe we can info on existing layers instead of erroring on the them to allow the mount operation to be executed. Or we can pass the control to the application by providing some pulling options. |
@arronwy helps me summarize current problems.
I misunderstand the image-rs workflow. The image-rs allows different images share the same layers, and the cached layer won't be removed after mount operation. But there is one thing we need to pay attention to. When we pull two images with same layers in parallel, the function unpack will raise errors, because it can not unpack layers sharing the same digest at the same time. We need to guarantee that the images are pulled one by one, as the kata agent does here.
confidential-containers/guest-components#79 fixes the second problem in most scenarios. When the agent runs continuously, previously failures on image pulling will not result in the dead-lock. Except that we terminate the agent after failure on the image pulling and re-launch the agent to pull the same image. It is a strange problem, but in the usual case, the agent won't be re-launched.
The third problem is fixed by #61. In the conclusion, the third problem is fixed. We need to decide if we accept the following features,
|
isn't this a very likely scenario to happen (like with some kubernetes controllers that retry pod re-creation on errors)? |
If I understand correctly, the shim will create a brand new agent bundle by mounting here upon the pod re-creation /cc @haosanzi . |
@HaokunX-intel but the current implementation seems to use the same stateful Occlum instance for all pods (via |
Traditionally, in the |
Right, yes. How about the case where a pod restart with the same image (and |
Actually, each pod has an independent agent bundle. The runtime dir ( And the agent is bind with the sandbox in the same pod, we can not restart the agent on an existing bundle but we can start a new agent by starting a new sandbox. The mentioned case can not happen. As @hairongchen mentioned in #18 , a part of the status and actions on the agent is undefined. I think it is out of the scope of the issue. |
@HaokunX-intel sounds good, thanks. Can you update |
The update is to remove the And I find that the issue #48 is importing the |
We want to update to known |
To fix this issue, the The reason why the the The crate
|
We also need to leave the possibility to use |
Right. It is our final target, customers can choose different combinations. But currently the dependency chain only supports the combination In the other side, it raises another problem. Is our bundle an omniscient one or a tailored one? The omniscient bundle containes all dependencies and supports all combinations, customers can choose any combination without reinstalling the bundle. The tailored bundle only supports several features the customer choose previously, and if they want to select other combinations, they should reinstall the bundle. If our bundle is omniscient, importing |
I'm not yet ready with #48 but I want this issue verified and closed so I'd propose we decouple |
I'd agreed with Haokun's analyze and I'd suggest bellow:
Thanks! |
CoCo quickstart documentation uses
bitnami/nginx:1.22.0
image as an example and I gave it a try. I'm seeing different image pull errors:to debug this in more details, I ran enclave-agent (
sudo runc run 123
) from the bundle withOCCLUM_LOG_LEVEL=debug
and used the "async-client" to debug. This time I'm getting different errors:The latter blocks me from investigating the former error in details.
The text was updated successfully, but these errors were encountered: