-
Notifications
You must be signed in to change notification settings - Fork 363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade to latest musl 1.2.5 (via 1.1.24) #2251
Conversation
Awesome! Great to get musl unpinned and up to date, thanks!
I'm pretty sure the TCP test failing is unrelated - it fails during heap initialisation, before initializing libc. Incidentally, changing the musl patch to make
You can bring this in if you'd like, or we can discuss it in a separate PR: I think this is a great opportunity to maybe find an old ghost; I'll try to find some time to investigate. |
Here's a backgrace, for reference:
I've opened a bug for this here: #2252 and will try to chase it down from there. |
Would you like to keep this open while you find the bug or should we merge? I you're (relatively) confident it's unrelated to the updated musl version I think we could merge this. It would be nice to track down before any new release though |
We don't need the TCP test to work, but I do think we need the SMP test to pass also with all CPU's actually active, since there might be changes to how TLS is initialized in the new libc. I hadn't noticed before but when running the SMP test with the command I posted above it detects 4 CPU's, but doesn't actually bring them online. So we need to:
As a point of reference, the last time I saw all AP's coming online (although I didn't test them) was in the
Note that this is after a couple of reboots, so that particular image has problems - possibly related to #2252 . I'm not able to reproduce that with Steps to reproduce the last known working SMP setup:
|
libc needs to enable locking for SMP to work properly, and that only happens when threads > 0 in musl. See: https://github.com/includeos/IncludeOS/blob/master/src/kernel/threads.cpp#L19-L23 and https://github.com/includeos/IncludeOS/blob/master/src/kernel/threads.cpp#L60-L70 This means supporting an arbitrary musl version is not straight-forward. We have to inspect the internal struct for each version in order to set current threads, and to also enable locking. |
Yes, there's definitely an SMP issue here. I've added a few more commits to make it easier to debug:
It's now easy to reproduce with
With
And with
|
I've experimented a bit more with this, but no solution yet. I'll include some notes below. @fwsGonzo do you think changes are required for tls/threads setup in addition to enabling locks in the internal __libc struct? The code you linked uses the new threads code, which is not (yet) available v15/16. In 57bdaf7 I updated the
With If I add more cores (e.g.
I see a similar crash with
Testing with v0.15.x would require a working conan setup, which I don't have at the moment unfortunately. |
I rebased on step 5.9: nix-shell --argstr unikernel ./test/kernel/integration/smp/ --run ./test.py |
With step 5.10: nix-shell --argstr unikernel ./test/kernel/integration/term/ --run ./test.py It seems that these failures are related to the new malloc-implementation in musl. Compiling with With step 5.9: nix-shell --argstr unikernel ./test/kernel/integration/smp/ --run ./test.py Source for I've pushed a new commit to use |
That's interesting! I'm noticing that they have different locking mechanics, at least on the surface. oldmalloc calls It also calls The new mallocng calls
So while these look very different, apart from the lock implementation being separated out in the new malloc, while the old one had them defined in |
Going back to the callstack I posted above, we see that This is before libc is initialized, when we're setting things up for calling into it for the first time. So it's possible that the new malloc has dependency on some state that the old didn't have, and that is not ready before libc is initialized, causing one of the asserts in |
The bufstore-test only fails when spinlocks are used - both scoped_spinlock and regular lock/unlock spinlocks will crash. Removing the alignment here fixes the issue: https://github.com/MagnusS/IncludeOS/blob/musl1-2-5/api/smp_utils#L28 If I add the alignment in BufferStore class' private section instead with Should we remove the alignment from the spinlock_t typedef or could there be other ways to fix this? How much do we gain by always aligning spinlock_t? From the discussion here I'm not sure if |
Modern Linux has some new system calls that may not be implemented. I don't know if that's related here, but a few that I can tihnk of: clone3 I've never really understood how to implement futex properly, and I suspect it's just not working right. Any code that leads to futex should be assumed to be problematic with newer libc's. |
- Copy patches from github.com/IncludeOS/musl, commit 4f331b7. - Copy src/musl/nanosleep.ccp with clock_nanosleep from commit 9ceff78 in this repo. Original author of both patches is fwsGonzo.
adds getrandom + stubs for fchdir, getrusage, sched_setaffinity, sched_setscheduler, sigaltstack
- stub faccessat, faccessat2, fchmodat2, pwritev2, preadv2, statx - include time64 versions of calls to time functions. As it looks like IncludeOS is already using 64 bit time on 32 bit, the *_time64 calls just redirect to the existing calls. On 64 bit systems there should be no change, as the regular calls are 64 bit.
This makes crash loop detection easier
Missing RDSEED/RDRAND will only emit warning, not terminate
Enable SMP by default.
I rebased on master and fixed the stack trace test to work with newer musl. With the latest fixes for SMP and the early calls into libc the tests are now passing. I tested both with SMP enabled and disabled.
|
Fantastic news! I'm adding cores to a few more integration tests to get a run where they not only have SMP enabled but also many cores to wake up. Will get back with review tomorrow. |
Couldn't wait 😀 The extended test set from #2276 passes without issues. Everything is indeed awesome 💪 The only lingering question for me is why we need to use the old malloc implementation - I couldn't find any good reasons for it when reading the code, but we can get to that later. I've spawned an issue for it here #2277. |
This upgrades to musl 1.2.5, the latest release.
The update has been tested by building the chainloader and booting the example.
The upgrade is done via 1.1.24, using partial patches from the pthread branch (originally written by @fwsGonzo). See commit messages for more details.
musl has introduced _time64 versions of several functions to provide 64-bit time for 32-bit builds without breaking the existing API. It looks like IncludeOS is using 64 bit values already, so I have just kept the existing versions in IncludeOS. It would probably be useful with a read through by someone more familiar with the IncludeOS time code...