-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bandersnatch mirror completeness #1622
Comments
Howdy. Sorry to hear you troubles. You've taken the brute force attempt to fix your errors! But this is dedication (17 days verify ...). I haven't ran a verify since PyPI was around 1TB and have wondered if it's even sane to do anymore. I think step one is to see what error(s) you're hitting and work through them. Let's change the stop on error config option and do runs reporting what actual errors you're hitting. stop-on-error = true DId your To get a |
Here are the bandersnatch operations that we have run lately: ` The verify --delete --jason-update log has 2296109 lines and 7313 ERROR: lines. 7283 are for the form:
The remaining 30 are of the form:
or of the form
Here are snippets of all of the reported errors that resulted in tracebacks during the verify op. Is this helpful? _# 2023-11-18_08:49:20 bandersnatch verify --delete --json-update 2023-11-18 08:49:21,344 INFO: Starting verify for /mirror/sites/PyPI with 3 workers (verify.py:252) /SNIP/ 2023-11-26 18:09:56,588 INFO: Fetching https://files.pythonhosted.org/packages/09/06/896687cc1c5098dc5bc6beaaf679a5f7564cb2afc2523f8c06d61e9b874f/fsleyes-1.4.1-py2.py3-none-any.whl (master.py:149) /SNIP/ 2023-11-26 18:43:32,980 INFO: Fetching https://files.pythonhosted.org/packages/e5/e1/254288af765910269ec6f9ea39e222c3d67de84617f79b1e63c4ba6a75c1/MeUtils-2023.11.20.13.42.41-py3-none-any.whl (master.py:149) /SNIP/ 2023-11-26 19:29:47,436 INFO: Fetching https://files.pythonhosted.org/packages/e8/e0/6b7668c4a41e2d129514321ad1343e99347771a6278085fd2e4ee4b5ff81/deepforest-1.2.2-py3-none-any.whl (master.py:149) /SNIP/ 2023-11-26 19:57:14,817 INFO: Fetching https://files.pythonhosted.org/packages/2f/f4/97bd5e9d29f404b1ebbf33877b90a20f42a33554e2aa277922432395b397/unitem-1.2.6-py2.py3-none-any.whl (master.py:149) /SNIP/ File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch /SNIP/
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request The above exception was the direct cause of the following exception: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): /SNIP/ 2023-11-30 13:38:14,897 INFO: Fetching https://pypi.org/pypi/datafilter/json (master.py:149) /SNIP/ 2023-11-30 13:38:28,729 ERROR: Error syncing package: setuptools-cython (verify.py:38) The above exception was the direct cause of the following exception: Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): |
So the verify seems to be getting a lot of connection errors and timeouts - What kind of internet connection are you running bandersnatch on? It would be nice to maybe go slower and reduce these timeouts and errors I think before we can worry about your consistency ... Have you tried 2 or 1 workers and see if you get less timeouts? workers = 2 Maybe the default timeout of 10 seconds isn't enough either? This all dependes on the connection you're on, but it shoudl be timeout = 10
If you could try and sync with that + enable stop on error (as suggested above) and do a run I'd be interested to see what you hit.
|
I started a bandersnatch mirror job last night at 8 pm. It finished at 1 pm today with a zero exit status. The mirror grew by 59.9G. There were 2314 Fetching metadata lines in the 15788 line logfile and no ERROR lines. There are 11142 Downloading lines in the log. Our package listing increased by 974 to a total of 372688. The last-modified date is 20231205T03:43:56. Our internet connection throttles down to 48 Mbps after initial bursts of 200+Mbps. Since there were no errors or timeouts, why did it complete with only 372688 total packages present on our mirror? Will --debug mirror be helpful when there are no timeouts or errors? First Fetching: First Downloading: Last Fetching: Last Downloading (and last lines in logfile): |
I've run bandersnatch mirror several times without error, but it only seems to fetch a few hundred projects for each run. I how have 374680 out of the 500,508 projects listed on pypi.org. I just started up a new run and the todo file only had 7276 entries. Since I'm not getting errors or timeouts at this point, what can I do to address the consistency? Thanks! |
Sadly, the only options now are very expensive. They are:
|
Thanks for that clarification! I'll do the force-check and if I start getting errors or timeouts, I'll start the debug process you outlined above. |
One thing that I noticed is that web/simple/index.html is not updated as packages are synced with Here are some statistics with
I have high hopes that if bandersnatch finishes gracefully, that web/simple/index.html will have a lot more hrefs. And if not, I can write a tool to regenerate it. |
Yeah, sadly, index.html is generated at the end of the run. Since the mirror is getting so big these days, I'd happily take a PR to periodically write out the global index.html during a run ... But it would have to be enabled by a config var with the default off I feel. |
The Here are the stats from the todo and logfile. I'll try doing a normal TODO=14812 ERROR=4, Fetching=501089, Storing=486277, Download=2059037 The final log entries are:
The two additional errors are filename too long errors:
|
Ahh, The long name problem. We've discussed in #1228 and I feel we should maybe soft error (report and skip) that due to the file system limitations we're skipping this package. I but I also get this is not explicit and evil. Maybe it should be a config option the owner(s) of this bandersnatch instance can choose. As stated elsewhere I'd accept this PR. Ideally we need PyPI to not allow package names this long. |
Another run of
That's pretty close to the 503,186 projects reported on pypi.org. I'm happy. |
Thanks so much for providing a means to mirror the PyPI repository!
After our latest run of
bandersnatch mirror
followed bybandersnatch verify --delete --json-update
, our mirror is 13.3 TB is size. It was 17.7 TB before we ran theverify --delete
operation. We found that some packages were not being updated after many runs ofbandersnatch mirror
. One such package was poetry. We got it to update withbandersnatch sync poetry
before we ran theverify --delete
operation.We are running bandersnatch 6.3.0 and python 3.5.8 and the latest verify operation took 17 days to complete and had a zero exit code. Our mirror appears incomplete compared to the stats reported on pypi.org. How can we assess the completeness of our mirror?
On our local mirror, web/simple/index.html has 371694 . web/simple has 372444 directories and web/json has 357233 directories. The bandersnatch log reports that 1,049,164 files were fetched. https://pypi.org reports 498,484 projects and https://pypi.org/stats reports the total mirror size of 18.2 TB.
/etc/bandersnatch.conf:
The text was updated successfully, but these errors were encountered: