-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read Formula/Cask descriptions without evaluating the Ruby code #16237
Comments
Thanks @apainintheneck!
Yeh, we should do this and it should be the default behaviour.
We should definitely not do this. We could consider using I think it's a much simpler conceptual model, now we have the API JSON, to have homebrew/core and homebrew/cask descriptions read from that, formula/cask files in those taps are trusted by default and anything else from any other tap (note: this is a good reason to get e.g. cask-versions cleaned up and archived) requires either an explicit formula name invocation or |
I guess the thinking on parsing the remaining non-core package files is that it would be better than the status quo both from a user and a security perspective and in this case should be relatively straightforward. There are no descriptions in core with string interpolation and I guess I can't see why anyone would want to do that anyway. Then advantage of using grep + regexes over |
Unfortunately it seems likely on the long tail on the formula DSL: unless we've made it impossible, people will do it and we will break it. I think ultimately even if descriptions are handled this way: non-core formulae are still going to have to execute "untrusted" (by us) formula code to do anything anyway so parsing them to access the descriptions seems somewhat reasonable, too. Would love some input from @woodruffw here as we talked about a potential adjustment of the security model around taps that feels like it would inform this work. |
Does it make sense to allow taps to generate their own formula/cask JSON data which can be read in lieu of Ruby code whenever it exists? The JSON files can live in the tap alongside the other Ruby files, perhaps in a separate directory. |
I haven't said before because it might end up nowhere, but third-party JSON API was something I noted down to potentially experiment with next time I have the chance to do a hackathon-like day/week (AGM week maybe?). That and experimenting with JSON parsing speed improvements (currently quite slow and creates half a million Ruby objects that fills the majority of the GC pool). I've done that a few times: some things end up as something, others end up as nothing. But experimentation like that is how the current iteration of the formula JSON API spawned (#12936 - initially drafted in late 2021, shelved for a bit as all these experiments do and then turned into a full project in summer 2022 onwards). And it's fun to do something different like that from time to time where you try tear drilled-down components apart and see what possibilities there are. Did a some of that this summer as well, the most interesting one being auto-generated Ruby DSL for GitHub GraphQL API (though that's not strictly Homebrew-related). I maybe should have a dumping ground for things like that.
Ideally it would probably be something that could be also be signed etc., and potentially even not-Git. Probably a different JSON format where you have tap information etc. The improved security model is IMO one of the best parts of the JSON API we have, and was a huge reason why I pushed to discourage any manual setting of Ultimately though, traditional Ruby-based taps won't be going away so we will always have to deal with making compromises there, including whatever we decide to do here for this particular issue. |
Given this and the fact that we will never (and I genuinely mean never) deprecate user consumption of |
Yeah fair, will keep it as a fun thing and just shelve it. |
I agree with this to an extent. More specifically: I think that setting up signing and separate hosting infrastructure in order to handle API data distribution is a bit too much for tap authors. However: generating JSON formula/cask data and publishing them alongside Ruby files in the tap can easily(?) be added to the steps done in the default workflow generated by Plus, if those tap authors were using our workflows, then they're publishing bottles. Which means their users would not need to execute any untrusted Ruby code at all on their machines in order to install from the taps that do this if we checked for JSON data in the tap first. Given the security implications, I'd think it's worth doing. |
I think the difficult thing would be making sure that the JSON data in each tap is up-to-date especially when they don't use our workflows to publish bottles. |
I've noticed this too. It's interesting how this can end up being the majority of time spent in certain commands. |
I'm fine with the idea of doing the change to use the descriptions from the JSON API as the default for this command and then deciding on what we want to do with third-party taps later on. It seems like we're in agreement on the first point at the very least. |
If they don't use our workflow to publish bottles but nevertheless have JSON data, then they have their own process for updating the JSON data and we shouldn't worry about whether the JSON data is up-to-date. I think the much more likely scenario for a tap having bottles that don't use our workflows is that they won't have JSON data either. |
If we want to do this: I'd like to have CI for
Yes. Let's avoid blowing the scope up too big here 👍🏻
We definitely should if it's being pitched as a security feature. Otherwise, we risk making the security profile worse rather than better.
In my experience: this is the case for the majority of third-party taps using bottles, not the minority. Many taps just use binary packages in the |
@apainintheneck For those that host on GitHub, perhaps a GitHub action could be created for this; or at the very least, some docs on how to set it up + a command within homebrew that does the 'heavy lifting' for them; that they then just need to trigger from whatever CI they use. Could even have a brew command that writes out a basic GitHub action workflow in the tap for them (designed to be extensible for other hosting platforms as well?); that way they don't even really need to understand how to write/set it up to benefit from it? |
We already have commands and workflows to do this internally for core taps (see
I think we already some stuff like this in the |
@0xdevalias Since the |
@apainintheneck nods yeah, the 'simple' 'install from JSON for core taps' described above sounds like it would solve the issues I was having there. And as someone mentioned earlier in this thread, I think landing that as a standalone task would be a net improvement separate to figuring out specifics for non-core taps/etc. |
I definitely don't think the JSON API generation as-is is a good fit for taps, as mentioned above. The primary motivations for the JSON API (in descending order) were:
Neither of these are problems for third-party taps. It's also unclear on the latter whether fetching lots of JSON files will actually be faster rather than slower. Some accidental but positive side-effects:
I think if we want to have something like the latter for taps: I don't think it must or should look the same as it does on homebrew/core or homebrew/cask. Something that may be worth exploring for taps for the security benefits would be having a I could even see a world where we move to this in homebrew/core or homebrew/cask (but continue to serve a single big JSON blob like we do currently). |
Out of curiosity, I looked at parsing this information with the
We have a few examples of strings that don't meet those requirements in core and they usually show up in tests and are valid examples of Ruby code. We could potentially work around this with monkey-patching but I think it's hardly worth the effort. Beyond that this library doesn't really work for us because it ends up being a bit slow when you have to parse thousands of files (the worst case scenario if we're parsing core). There are other parser options like |
This comment was marked as resolved.
This comment was marked as resolved.
Agreed. I think it's worth just focusing on the core/homebrew-cask cases here. |
After looking at the code a bit more, I don't think we need to change things to use the API JSON unless we want to get rid of or change how we use the descriptions cache store. Currently we use the cache store to cache a pre-calculated hash of package name to package description. This cache gets updated in brew/Library/Homebrew/cmd/update-report.rb Lines 241 to 248 in 4793677
Lines 364 to 371 in 4793677
If the user doesn't have the descriptions cache set up already for some reason, it will only build the descriptions cache if brew/Library/Homebrew/description_cache_store.rb Lines 99 to 104 in 4793677
It seems like we could very easily remove the blanket warning to pass in brew/Library/Homebrew/cmd/desc.rb Lines 45 to 47 in 4793677
brew/Library/Homebrew/cmd/search.rb Lines 77 to 79 in 4793677
Edit: 🤦 I can't believe I didn't realize this sooner. |
Yeh, I guess my thinking was we could perhaps get rid of it entirely for core usage so that e.g. |
Would this mean deprecating non-core packages showing up in |
Sorry: I'm not entirely sure 😅. Things I think, in case it helps answer this:
And a bit more unrelated:
|
I agree with most of your points especially removing There is also the use case where someone uses |
Just to be explicit: or having some sort of explicit "yes I trust this third party tap" step that allows us to evaluate that tap's code in future with
Yeh, I think at this point: it can either just be very slow, it downloads the core JSON blob to work or perhaps we can avoid implementing it entirely. |
The main performance difference is that loading the JSON for formulas and casks is kind of slow. To load just the core formula hash from JSON it takes around 3/10ths of a second on my new M1 MacBook and 6/10ths of a second on my old iMac. Without the cache |
Loading core formulae from their Ruby representation is even slower. Loading just core formulae takes at least 5-10 seconds depending on the machine. |
Yeah JSON loading is something I've been looking into now that we've got more tricks we can do with Portable Ruby, though in terms of command speed the one to care about more is |
They should both use the same code internally for searching descriptions. I'll start a thread about the JSON loading to get a discussion going. The current performance is not the end of the world but I am a bit annoyed that it's always the same regardless of whether we're loading one package or a thousand. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
I looked at this a while ago and found the cache code (esp. the cache update logic) the biggest deterrent to attempting a PR. Is eliminating the cache altogether a viable option here (i.e. from a performance standpoint)? In that case the solution is simple: eval and search external taps if
If there is any room to improve the json parse speed it seems like that would be a better place to put effort (to help eliminate the need for the cache) than getting the cache logic to deal with every possible scenario. |
At this point: yes, I think so.
Maybe? PRs for this would be welcome but I don't think they are a blocker on improving this. |
Performance considerations are best discussed with a PR and some benchmarks but in general loading casks and formulae from the taps is an order of magnitude slower than loading them from the API JSON which is slower than loading the cached cask and formula descriptions (though not an order of magnitude slower). I think the consensus is that it should be fast enough without the cache. Realistically all third-party taps have a tiny fraction of the formulae and casks that are in the core taps anyway. We did explore speeding up the JSON parse speed a bit but didn't make much meaningful progress and things have stalled out recently. It seems to be only tangentially related to what you proposed here though.
|
PRs are definitely welcome here. |
Verification
brew install wget
. If they do, open an issue at https://github.com/Homebrew/homebrew-core/issues/new/choose instead.Provide a detailed description of the proposed feature
The way the
brew desc
command currently works is that it reads all formulae and casks once and builds a cache of descriptions. It then reads from this cache of descriptions when searching for string/regex matches.Instead of gathering the description information by evaluating each package we should source it from two places.
The format of the description method in both the formula and cask DSLs is very straightforward. It's just a method called
desc
that takes a string which is easily parseable. Thename
method for casks would also need to be parsed but it's similarly straightforward.For example, the following seems to be decently accurate. It'd need to be polished a bit though to ignore directories with non-package Ruby files.
What is the motivation for the feature?
The problem is that to build the cache the first time we need to evaluate the Ruby code for each package requiring
--eval-all
. This is a potential security problem since arbitrary Ruby code can be run when a package is first loaded. This change would remove the need to pass this flag or evaluate Ruby code at all increasing application security and making the experience smoother for users.How will the feature be relevant to at least 90% of Homebrew users?
It is relevant to all users that use either the
brew search --desc
orbrew desc
commands.What alternatives to the feature have been considered?
Keeping things as they are...
The text was updated successfully, but these errors were encountered: