-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve JSON Loading Performance #16410
Comments
The problem is twofold a bit. First is the parser speed as you mention but second is the GC impact. The end result of the JSON parse doesn't just add a few thousand objects, it has a footprint of hundreds of thousands of objects. This makes subsequent major GC sweeps quite slow. We can tweak the GC a bit to reduce the frequency of sweeps but ultimately having that large of an object footprint is not ideal.
I've looked at We do have the problem of native gems here. Realistically it's difficult to enforce native gems installs as there's a million things that can go wrong, as we saw when trying to enable Bootsnap for all and as I continue to sometimes see from an occasional discussion post (e.g. some people have OOM issues running The thing with alternative parsers however, particularly since the default JSON gem isn't that bad, is they can be optional and what we can do is bundle it with Portable Ruby which 99% of users use and use it if available and fallback if not. This is trivial to do.
There is definitely a difference in speed between the two passes. The first pass is much cheaper than the second in both time taken and objects created. With that said I do acknowledge it's not the most ideal. The payload being a string was intentional given we were following the JWS standard which defines the payload to be the string. Technically we could roll our own format and not follow the standard that closely. I rolled out the feature very quickly (developed largely in one day) so it was useful at the time to point to the standard in PRs as proof of something that is used and tested in real world applications rather than being something I pulled out of a hat. It can definitely be compressed to remove spaces however. At the time it was awkward due to Jekyll but this has been made easier since now we control the build more and use v4. A bigger change is potentially even something not-JSON. In particular something that doesn't require parsing the whole file by allowing us to seek to the relevant section for a particular formula. Such format may solve most problems as we're parsing only a part of the data which has the potential to be cheaper in both time taken and in the number of objects created. This isn't necessarily easy, and there's a several other factors such as whether it's possible to do so with reasonable speed in pure Ruby or if a native gem is necessary. But it's an option at least that I've reflected on a bit. |
I hadn't realized this but in hindsight it seems obvious. That reminds me that it'd be nice to have memory profiling get added to
That makes sense to me though I'm not sure how complex that bundling would be. At least
Very true. This shows up clearly in the profiling output.
Do you have an existing serialization format in mind here? I can't think of any off-the-shelf format that fits those characteristics though I agree that it would obviously be nice to have the ability to not have to load the entire file into memory to just load one formula or cask. |
I did try this at some point but never sent a PR with it. Might take a look again. The garbage collector bar in
Ruby does have its own gem bundling system that we could piggyback off of by adding a line to https://github.com/ruby/ruby/blob/master/gems/bundled_gems. There's also the
Not currently because I've not researched it much - I was talking conceptually and in theory we could have a TOC listing all formula names and their offsets that can be very quickly read followed by the actual payload. I know stuff like protobuf has lazy properties but that's not to suggest that's necessarily the perfect fit or not. It's an option, but there's maybe not too much appetite for change this soon. |
We should do this regardless. There's a lot of keys that are literally never used by us but yet are downloaded by everyone.
This seems like a good idea.
If possible: this seems like a good/nice idea.
We can't/shouldn't really use native gems for anything except developer commands.
I wonder about making the payload e.g. Gzipped data?
I don't really have an appetite for here until we try the above ideas and see them insufficient. I still think the biggest impact will come from reducing the number of keys we actually include in the JSON payload that we use. |
It can be, though the JSON parse of the actual payload is the slower bit and this won't really help there - only the inital parse. |
I assumed that the reason we don't include native gems outside of dev commands was to reduce install time when they need to be updated and avoid problems where native gems refuse to build on some systems. If we essentially roll those gems into the Ruby 3.1 bottle that we're shipping to everyone already, that should solve most of those problems, right? I agree that it should probably not be the first option though since it adds more complexity. |
Thoughts on Reducing JSON SizeAre we okay with having two different JSON representations of casks/formulae? One would be for loading packages internally and the other for We could probably remove all top-level hash keys that point to blank data. This would be more automatic. At a glance, it looks like most hash keys that we would expect not to be used would be blank anyway like Maybe we should store the package data as a hash instead of an array. Internally, we turn the array of package info hashes into a hash of package name to package info every time we load it into memory. I doubt that is slow though but it would require less fiddling. |
It would but that will likely dramatically increase how often we need to update this bottle (and users need to download it).
Not just OK but: this seems actively desirable.
Seems like a great idea 👍🏻
Also a good idea 👍🏻 |
That's what I hadn't considered. Good point! To add a new slimmer JSON representation we'd need to...
|
It doesn't have to. The update cadences for We already don't automatically ship new Portable Rubies when a dependency updates, and arguably things like OpenSSL are more important than any gem update (though we only use OpenSSL on a very basic level so most vulnerabilities have largely not been a concern). (On a related note to this: locally I have put together autobumping for the Portable Ruby repository so expect to see a PR for that this month. That will mean the whole repository is 100% automated except for updates between minor versions e.g. 3.1 -> 3.2, and hitting the release button) |
Makes sense to me 👍🏻
Ok, that's fair 👍🏻
🎉 looking forward to it. Would be nice for this to be a trivial, automated process. |
I don't have much to add here but:
|
From quick tests now, while Vendoring |
To demonstrate the object thing. A
(exact values may vary from run to run) So JSON data accounts for about 72.5% of all Ruby objects created in a Forcing a GC (several times to make sure) frees up about 8800 slots after a formula JSON load and 14500 after a cask JSON load. Disabling the GC on its own (for demonstration purposes) saves ~0.1s, though creating the objects themselves probably have overhead too. Again not night and day but these things quickly add up and pair everything together (even There's not much to take from this comment - just following up on earlier theories and will be interesting to check back and see what areas improved from trying the various ideas (e.g. JSON slimming) to give an indication of the type of things that are helping more than others. For example, I suspect the whitespace stripping is probably not significant based on the data so far, but we should do it anyway as it's good practice. |
Why not split the JSON up into one file per formula or cask? (This may be something to consider further later on down the line, though, depending on how invasive of a refactor or not that might be right now.) |
It's a trade-off a bit. Certainly would be slower to download but if you mean instead a single download that is split afterwards then maybe, though the extra write IO cost may be notable. |
(Nods.) One obvious option would be a tarball. |
Yeah, that should definitely be an option. I took a quick stab at seeing how much smaller the formula JSON would get by just removing blank values and unused keys and it made a big difference. This, of course, hasn't been tested yet though so it's possible that some of these are still needed. IGNORABLE_KEYS = %w[
oldname
installed
linked_keg
pinned
outdated
keg_only
disabled
deprecated
].freeze
original = Homebrew::API::Formula.all_formulae
File.write "original.json", JSON.generate(original)
slim = original.transform_values do |hash|
hash = hash
.reject { |key, _| IGNORABLE_KEYS.include?(key) }
.reject { |_, value| value.blank? }
files = hash.dig("bottle", "stable", "files").to_h do |key, hash|
[key, hash.slice("cellar", "sha256")] # Ignore unused "url"
end
hash["bottle"]["stable"]["files"] = files
hash
end
File.write "slim.json", JSON.generate(slim)
I was surprised at how much space was taken up by the bottle urls which aren't used at all. That was around 6 megabytes by itself.
Edit: We could also potentially save another half a megabyte by only including the tap info at the top-level of the JSON instead of inside each individual formula definition. Edit 2: I found that recursively removing blank elements from the formula hashes didn't result in a significantly smaller file. |
Assuming 7 bottles per formula that's 7 * ~7000 ≈ 50k string objects (probably more from removing the key too) so yeah it might be noticeable on the Ruby side too |
It may not slim the file but it will likely reduce the number of required allocations. |
I was afraid of that.
I think that sounds like a good idea. |
Looks like that's about the absolute limit. Just for fun for a couple hours I experimented to see what a best case scenario could look like: https://github.com/Bo98/lazyjson-ruby (intentionally doesn't include What we have now: ~988,000 objects, ~0.72s Also note that So if we can get anything close to those numbers from what we're doing here we will have done well. |
Very impressive! I don't think we'll get close to that since I'm not aware of any Ruby gems that offer lazy loading for JSON (though to be honest I haven't looked that hard). Of course, we could use your repo but I doubt that's in the cards either. |
Yeah there's very likely none as it's sort of a "fake" binding given it doesn't return native Ruby Hashes and Arrays but instead custom ones that don't expose children to Ruby. Idea was make the GC only see one object (+ any values you extract from it) rather than hundreds of thousands. The custom classes mean it isn't a drop-in replacement for all use cases of JSON parsing in the wild (but code adapted to work with it would remain compatible with existing JSON parsing). The lazy part is referring to the Ruby exposure rather than the parsing itself which is not lazy (but really fast given it uses SIMDs like NEON, AVX and SSE). Lazy parsing is possible but only really for streaming and not random access, and is close to being a micro-optimisation when we're in the 2-3% range. Exploring all this was a nice break from the norm too as I hadn't really tried working with Ruby extensions much before. |
This allows me to sanity check the changes to the API JSON from a loading perspective. Related to: Homebrew/brew#16410
The main difference with this new approach is that it means rewriting a lot of the API code which isn't necessary to just minify the JSON hashes. |
Yeah a clean break would make this very doable and I agree what you describe here would be the approach I would take - generate both old and new on formulae.brew.sh. Regardless what we do here, this will be the first time we properly do something like this so it's very uncertain how much issues it may cause. We did make a minor compatibility break once before. #15168 wasn't in the original API JSON (shipped in 4.0.12 instead) and I accidentally broke compatibility with pre-4.0.12 JSONs in 4.1.0: #15676. I believe this was the cause of reports like https://github.com/orgs/Homebrew/discussions/4701 and https://github.com/orgs/Homebrew/discussions/4750. Though using an old JSON for so long is also a concern - seems to be some permission issue going on somewhere. The reverse (new JSON incompatible with old brew) has never been tested and the API minifying approach alone would be the first example of doing that given it needs the other PRs that ignore nil-values on the loader/reader side. So the minifying approach isn't necessarily risk-free - it's a different, unexplored type of risk. Frozen CI images (common in Circle CI etc) without the API pre-cached may be good example of such concern, along with multi-user systems. It may work better, it may work worse - I don't really know. I've leaned more on the clean break approach because we can add any necessary compatibility code we need while the minifying approach is impossible to handle compatibility for should it turn out to be necessary. It is however of course more effort. |
Would we need to wait for the next major release here? It's an internal API change but it is breaking. At the very least, it should probably wait for a minor release.
Yeah, I could very well be underestimating the complexity of the minimized approach. Frozen CI images never even crossed my mind, for example.
This is why I was advocating for the other approach. I'm just lazy and this looks like a whole lot more work. I also think that any additional performance benefits will be marginal at best. |
Agreed 👍🏻
Yeh, would be good to migrate these two (and leave "room" in the format to potentially add more tap-specific data if needed in future).
Yup, that's fair.
Yeh, worth seeing if this makes things any smaller/faster. If it doesn't meaningfully: can leave it.
Yes, this is non-ideal but hopefully this logic can be separated enough that it's just a matter if deleting a few
Good idea, agreed 👍🏻
Could do this before we remove the generation of the old format.
Lazy is a virtue! This sort of scope creep can be bad, I appreciate you pushing back on it ❤️
This will be v3, FYI. I think this also makes this a good candidate to be a |
Yeah, I see 10+ functions that will need different logic for the different API versions. We could add that branching logic into each method or just have different mixins that we use based on whether the v3 flag is set.
I wasn't planning on changing the public API. We could decide to do that later on depending on how things go though. |
Another option is separate classes using duck-typing.
Yeh, I don't think we need that in the short-term but I think the current v2 (at least for formulae) API is so terrible that it is worth learning from this "v3-internal" and using there. |
TODO: Check to see if we can improve the JSON format for listing dependencies. CC: @MikeMcQuaid @Bo98 |
For the improved JSON format for formula dependencies, what do you think of the following alternative? {"dependencies"=>
{"autoconf"=>{:tags=>[:build]},
"automake"=>{:tags=>[:build]},
"[email protected]"=>{:tags=>[:build]},
"sphinx-doc"=>{:tags=>[:build]},
"xz"=>{:tags=>[:build]},
"gnu-sed"=>{:tags=>[:build]},
"m4"=>{:tags=>[:build], :uses_from_macos=>nil},
"ncurses"=>{:uses_from_macos=>nil}},
"head_dependencies"=>
{"autoconf"=>{:tags=>[:build]},
"automake"=>{:tags=>[:build]},
"[email protected]"=>{:tags=>[:build]},
"sphinx-doc"=>{:tags=>[:build]},
"xz"=>{:tags=>[:build]},
"gnu-sed"=>{:tags=>[:build]},
"m4"=>{:tags=>[:build], :uses_from_macos=>nil},
"ncurses"=>{:uses_from_macos=>nil}}} {
"dependencies": {
"bdw-gc": null,
"gmp": null,
"libevent": null,
"libyaml": null,
"llvm": null,
"openssl@3": null,
"pcre2": null,
"pkg-config": null
},
"head_dependencies": {
"bdw-gc": null,
"gmp": null,
"libevent": null,
"libyaml": null,
"llvm": null,
"openssl@3": null,
"pcre2": null,
"pkg-config": null,
"libffi": {
"uses_from_macos": null
}
}
} |
@apainintheneck Looks a lot nicer! I'm thinking parsing may be a bit nicer/easier/cleaner if dependencies are either e.g. Similarly, |
Okay, how about something like this? The first result is what we currently have and the second is the new format. It surprisingly ends up being even a bit smaller too.
brew(main):002:0> :ruby.f.dependencies_hash
=>
{"build_dependencies"=>["autoconf", "pkg-config", "rust"],
"dependencies"=>["libyaml", "openssl@3"],
"test_dependencies"=>[],
"recommended_dependencies"=>[],
"optional_dependencies"=>[],
"uses_from_macos"=>["gperf", "libffi", "libxcrypt", "zlib"],
"uses_from_macos_bounds"=>[{}, {}, {}, {}]}
brew(main):003:0> :ruby.f.to_internal_api_hash.slice("dependencies", "head_dependencies")
=>
{"dependencies"=>
[{:name=>"libyaml"},
{:name=>"openssl@3"},
{:name=>"autoconf", :tags=>[:build]},
{:name=>"pkg-config", :tags=>[:build]},
{:name=>"rust", :tags=>[:build]},
{:name=>"gperf", :uses_from_macos=>true},
{:name=>"libffi", :uses_from_macos=>true},
{:name=>"libxcrypt", :uses_from_macos=>true},
{:name=>"zlib", :uses_from_macos=>true}],
"head_dependencies"=>
[{:name=>"libyaml"},
{:name=>"openssl@3"},
{:name=>"autoconf", :tags=>[:build]},
{:name=>"pkg-config", :tags=>[:build]},
{:name=>"rust", :tags=>[:build]},
{:name=>"gperf", :uses_from_macos=>true},
{:name=>"libffi", :uses_from_macos=>true},
{:name=>"libxcrypt", :uses_from_macos=>true},
{:name=>"zlib", :uses_from_macos=>true}]} |
Actually, I was comparing it against the current public json v2 instead of the current internal json v3 which is much slimmer. This would increase the size significantly.
I'd prefer to find a format that doesn't require adding to the file size. |
What do dependencies look like in the current internal JSON v3? Sorry, I'm a bit muddled at this point. No idea whether it actually ends up adding to the size but I wonder if having dependencies as a single array (for both stable and head) would make more sense. Each array item could either be a |
Sorry I forgot to reply to this. It's the same as what we have now but without fields that have blank values. So it would roughly look like this if you convert the above example. {"build_dependencies"=>["autoconf", "pkg-config", "rust"],
"dependencies"=>["libyaml", "openssl@3"],
"uses_from_macos"=>["gperf", "libffi", "libxcrypt", "zlib"],
"uses_from_macos_bounds"=>[{}, {}, {}, {}]} I don't remember how I calculated the format size before so I redid by just generating the entire JSON file for the core tap. Unfortunately I ended up getting basically with the same results. brew(main):001:0> ENV["HOMBREW_NO_INSTALL_FROM_API"] = "1"
=> "1"
brew(main):002:0> Formula.generating_hash!
=> true
brew(main):003:0> File.write("new.json", JSON.generate(CoreTap.instance.to_internal_api_hash)) $ ll *.json
-rw-r--r-- 1 kevinrobell admin 12M Mar 26 00:14 new.json
-rw-r--r-- 1 kevinrobell admin 11M Mar 26 00:11 old.json
$ du *.json
23736 new.json
22160 old.json |
Ok, that seems near enough to not worry about. I think it's good to minimise the size but readability and being DRY is more important than that, IMO (but I'm open to disagreement there). |
I think I'll go with this one: #16410 (comment) It's a bit smaller than the other new format. I'll also keep the empty values as null since that should make unpacking things a bit simpler. @Bo98 any thoughts here before I open a PR? |
Verification
brew install wget
. If they do, open an issue at https://github.com/Homebrew/homebrew-core/issues/new/choose instead.Provide a detailed description of the proposed feature
We should try to improve the loading times for JSON data. This is especially relevant because the JSON API has large files that get read into memory whenever a formula or cask is loaded from the API.
Off the top of my head, I see a few ways that we could potentially improve performance.
After @Bo98's excellent work, we're living in the future and everybody's running Ruby 3.1. A side effect of this is that we could potentially include native libraries and ship them with the Ruby 3.1 bottle. This should allow us to explore using libraries like
oj
which have better JSON parsing performance.It's easier to parse smaller files and smaller files mean that fewer objects get allocated when things are parsed. Currently, we don't purge nil values in hashes before creating the JSON files. This leads to a large number of extra hash keys which are all strings and have to be allocated. It also increases the size of the files too which slows down parsing and means that more data has to be shipped across the network.
We're allocating an extra 140k hash keys by not eliminating null values from hashes.
Note: I looked into this a few months ago to see if it would make a big difference from a network file size point of view but it only seemed to change the file size by 100s of kilobytes after getting compressed so I don't expect to see a large benefit there.
Currently we essentially parse both of the API JSON files twice. The first pass is used to validate the JSON web signature and the second parses the content. Beyond removing duplicate work, combining these two passes somehow will mean not having to allocate additional large strings that end up being around the same size as the JSON files since the signature header takes up only a small part of the file anyway.
It would also potentially make the JSON a bit smaller since we'd be able to avoid escaping all of the quotes in the stringified JSON.
JSON API payload string:
Escaped quotes:
What is the motivation for the feature?
Currently there is a fixed cost whenever the API JSON gets loaded into memory that affects all commands that load formulae or casks from the API. On my M1 MacBook it takes around 3/10ths of a second to load the formulae API JSON by itself.
How will the feature be relevant to at least 90% of Homebrew users?
It will make most commands snappier because most users use the JSON API. This is an especially big deal with the work to move
brew desc
to use the JSON API in #16237.What alternatives to the feature have been considered?
Doing nothing...
The text was updated successfully, but these errors were encountered: