-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High memory usage for verify --delete
due to deletion occurring at end
#1283
Comments
This is known. maybe we should document it more. 25TB is pretty crazy, there must be a lot of nightlies or something that add up as PyPI reports ~13.5TB these days (https://pypi.org/stats/). We have to keep state of all the files found on the file system in order to workout what files to delete. That said, I would be open to adding a different way to save this state or try avoid it all together. Some ideas I can think of (in order of preference):
All open to other ideas and if we agree on something PRs are totally welcome to make this better ... |
Thanks for the reply. I suppose it should not be a high priority issue if you always run |
I don't think that helps from memory (I haven't looked in the code for a long time). But I believe we have to map the whole file system in order to see files that are there and no longer belong to any metadata ... It's a horrible algorithm, but was the safest to be accurate. With the size of the mirror (both file count and bytes) these days I think it's time we look into maybe adding deleted releases into metadata + look at if we can slightly improve this using the yanked PEP(s). I wrote this before they existed. The main complexity is we follow the PyPI Blob storage pathing. We don't "need" to do this and we could move to just sharding via package name similar to simple which would then make deletes able to just walk the "projects" blob area and delete any that are no longer referenced by metadata. There are many ways to make this better. Just not sure the best. Something like the change how we store the blobs would need to go into a 7.0 release ... It's a big change. We should probably need a tool to go and do the 100000000000 mv's too reorganize existing mirrors too. That would not be a cheap operation. |
I had a PyPI mirror that hadn't ever had a run of
verify --delete
, so had grown to around 25 TB. Initially trying to runverify --delete
was exhausting all of my machine's memory. It only had 8 GB of RAM, but still, the algorithm should be able to delete during the run (and therefore using a relatively constant amount of memory regardless of the number of deletions needed) rather than building a list in memory and deleting everything at the end.I was able to get
verify --delete
to finish with 64 GB of RAM, but I don't know how much memory it actually needed. Now the PyPI mirror is somewhere less than 9.5 TB.The text was updated successfully, but these errors were encountered: