Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read Formula/Cask descriptions without evaluating the Ruby code #16237

Closed
1 task done
apainintheneck opened this issue Nov 20, 2023 · 36 comments · Fixed by #17582
Closed
1 task done

Read Formula/Cask descriptions without evaluating the Ruby code #16237

apainintheneck opened this issue Nov 20, 2023 · 36 comments · Fixed by #17582
Labels
features New features help wanted We want help addressing this

Comments

@apainintheneck
Copy link
Contributor

Verification

Provide a detailed description of the proposed feature

The way the brew desc command currently works is that it reads all formulae and casks once and builds a cache of descriptions. It then reads from this cache of descriptions when searching for string/regex matches.

Instead of gathering the description information by evaluating each package we should source it from two places.

  1. The core tap JSON for formulae in homebrew/core and casks in homebrew/cask if the API is enabled.
  2. Parse the formula and casks descriptions from other taps manually (think grep + regexes).

The format of the description method in both the formula and cask DSLs is very straightforward. It's just a method called desc that takes a string which is easily parseable. The name method for casks would also need to be parsed but it's similarly straightforward.

For example, the following seems to be decently accurate. It'd need to be polished a bit though to ignore directories with non-package Ruby files.

~ $ cd "$(brew --repo)/Library/Taps"
/u/l/H/L/Taps (master|✔) $ git grep --no-index -E '  (desc|name) "' -- '*.rb'

What is the motivation for the feature?

The problem is that to build the cache the first time we need to evaluate the Ruby code for each package requiring --eval-all. This is a potential security problem since arbitrary Ruby code can be run when a package is first loaded. This change would remove the need to pass this flag or evaluate Ruby code at all increasing application security and making the experience smoother for users.

How will the feature be relevant to at least 90% of Homebrew users?

It is relevant to all users that use either the brew search --desc or brew desc commands.

What alternatives to the feature have been considered?

Keeping things as they are...

@apainintheneck apainintheneck added the features New features label Nov 20, 2023
@MikeMcQuaid
Copy link
Member

Thanks @apainintheneck!

  1. The core tap JSON for formulae in homebrew/core and casks in homebrew/cask if the API is enabled.

Yeh, we should do this and it should be the default behaviour.

2. Parse the formula and casks descriptions from other taps manually (think grep + regexes).

We should definitely not do this. We could consider using parser, rubocop-ast etc. but even that it going to end up going down a rabbit hole if e.g. interpolation is used.

I think it's a much simpler conceptual model, now we have the API JSON, to have homebrew/core and homebrew/cask descriptions read from that, formula/cask files in those taps are trusted by default and anything else from any other tap (note: this is a good reason to get e.g. cask-versions cleaned up and archived) requires either an explicit formula name invocation or HOMEBREW_EVAL_ALL/--eval-all to be passed.

@apainintheneck
Copy link
Contributor Author

I guess the thinking on parsing the remaining non-core package files is that it would be better than the status quo both from a user and a security perspective and in this case should be relatively straightforward. There are no descriptions in core with string interpolation and I guess I can't see why anyone would want to do that anyway.

Then advantage of using grep + regexes over parser, ripper or rubocop-ast is purely performance related. I'd have to do some benchmarking but there is a part of me that wonders if we'd be able to search all files on the fly every time instead of having to cache descriptions after they get parsed the first time. That would help simplify the code a lot. It is admittedly kind of hacky though.

@MikeMcQuaid
Copy link
Member

There are no descriptions in core with string interpolation and I guess I can't see why anyone would want to do that anyway.

Unfortunately it seems likely on the long tail on the formula DSL: unless we've made it impossible, people will do it and we will break it.

I think ultimately even if descriptions are handled this way: non-core formulae are still going to have to execute "untrusted" (by us) formula code to do anything anyway so parsing them to access the descriptions seems somewhat reasonable, too.

Would love some input from @woodruffw here as we talked about a potential adjustment of the security model around taps that feels like it would inform this work.

@carlocab
Copy link
Member

carlocab commented Nov 24, 2023

Does it make sense to allow taps to generate their own formula/cask JSON data which can be read in lieu of Ruby code whenever it exists?

The JSON files can live in the tap alongside the other Ruby files, perhaps in a separate directory.

@Bo98
Copy link
Member

Bo98 commented Nov 24, 2023

Does it make sense to allow taps to generate their own formula/cask JSON data which can be read in lieu of Ruby code whenever it exists?

I haven't said before because it might end up nowhere, but third-party JSON API was something I noted down to potentially experiment with next time I have the chance to do a hackathon-like day/week (AGM week maybe?). That and experimenting with JSON parsing speed improvements (currently quite slow and creates half a million Ruby objects that fills the majority of the GC pool).

I've done that a few times: some things end up as something, others end up as nothing. But experimentation like that is how the current iteration of the formula JSON API spawned (#12936 - initially drafted in late 2021, shelved for a bit as all these experiments do and then turned into a full project in summer 2022 onwards). And it's fun to do something different like that from time to time where you try tear drilled-down components apart and see what possibilities there are.

Did a some of that this summer as well, the most interesting one being auto-generated Ruby DSL for GitHub GraphQL API (though that's not strictly Homebrew-related). I maybe should have a dumping ground for things like that.

The JSON files can live in the tap alongside the other Ruby files, perhaps in a separate directory.

Ideally it would probably be something that could be also be signed etc., and potentially even not-Git. Probably a different JSON format where you have tap information etc.

The improved security model is IMO one of the best parts of the JSON API we have, and was a huge reason why I pushed to discourage any manual setting of HOMEBREW_NO_INSTALL_FROM_API (in favour of per-command scoping: #15564). Everything you install is done from information that's signed by us (including build-from-source Ruby code) whereas before we trusted local files (may or may not have been tampered) from arbitrary mirrors (which also may or may not be tampered). So ideally any third-party tap solution would retain that: you accept a signing key when initially tapping and that stays constant forever (or requiring manual approval if it ever changes).

Ultimately though, traditional Ruby-based taps won't be going away so we will always have to deal with making compromises there, including whatever we decide to do here for this particular issue.

@MikeMcQuaid
Copy link
Member

Ultimately though, traditional Ruby-based taps won't be going away so we will always have to deal with making compromises there, including whatever we decide to do here for this particular issue.

Given this and the fact that we will never (and I genuinely mean never) deprecate user consumption of Formula files: I really would rather we didn't add Yet Another Way To Do Things For Tap Authors.

@Bo98
Copy link
Member

Bo98 commented Nov 24, 2023

Yeah fair, will keep it as a fun thing and just shelve it.

@carlocab
Copy link
Member

I really would rather we didn't add Yet Another Way To Do Things For Tap Authors.

I agree with this to an extent. More specifically: I think that setting up signing and separate hosting infrastructure in order to handle API data distribution is a bit too much for tap authors.

However: generating JSON formula/cask data and publishing them alongside Ruby files in the tap can easily(?) be added to the steps done in the default workflow generated by brew tap-new, so I view this as a relatively low friction addition to the things to do for tap authors.

Plus, if those tap authors were using our workflows, then they're publishing bottles. Which means their users would not need to execute any untrusted Ruby code at all on their machines in order to install from the taps that do this if we checked for JSON data in the tap first.

Given the security implications, I'd think it's worth doing.

@apainintheneck
Copy link
Contributor Author

I think the difficult thing would be making sure that the JSON data in each tap is up-to-date especially when they don't use our workflows to publish bottles.

@apainintheneck
Copy link
Contributor Author

That and experimenting with JSON parsing speed improvements (currently quite slow and creates half a million Ruby objects that fills the majority of the GC pool).

I've noticed this too. It's interesting how this can end up being the majority of time spent in certain commands.

@apainintheneck
Copy link
Contributor Author

I'm fine with the idea of doing the change to use the descriptions from the JSON API as the default for this command and then deciding on what we want to do with third-party taps later on. It seems like we're in agreement on the first point at the very least.

@carlocab
Copy link
Member

I think the difficult thing would be making sure that the JSON data in each tap is up-to-date especially when they don't use our workflows to publish bottles.

If they don't use our workflow to publish bottles but nevertheless have JSON data, then they have their own process for updating the JSON data and we shouldn't worry about whether the JSON data is up-to-date.

I think the much more likely scenario for a tap having bottles that don't use our workflows is that they won't have JSON data either.

@MikeMcQuaid
Copy link
Member

However: generating JSON formula/cask data and publishing them alongside Ruby files in the tap can easily(?) be added to the steps done in the default workflow generated by brew tap-new, so I view this as a relatively low friction addition to the things to do for tap authors.

If we want to do this: I'd like to have CI for brew tap-new that actually verifies this works end-to-end. Even the current brew tap-new code suffers pretty badly from bitrot and being sporadically broken for folks because it's essentially untested by us.

I'm fine with the idea of doing the change to use the descriptions from the JSON API as the default for this command and then deciding on what we want to do with third-party taps later on. It seems like we're in agreement on the first point at the very least.

Yes. Let's avoid blowing the scope up too big here 👍🏻

we shouldn't worry about whether the JSON data is up-to-date.

We definitely should if it's being pitched as a security feature. Otherwise, we risk making the security profile worse rather than better.

I think the much more likely scenario for a tap having bottles that don't use our workflows

In my experience: this is the case for the majority of third-party taps using bottles, not the minority.

Many taps just use binary packages in the url e.g. anything using goreleaser (last time I checked, at least). Our whole approach to bottling has really not caught on in the wider community; even with GitHub Actions it's too involved and confusing for most folks except us to be able to handle it.

@0xdevalias
Copy link
Contributor

0xdevalias commented Dec 13, 2023

I think the difficult thing would be making sure that the JSON data in each tap is up-to-date especially when they don't use our workflows to publish bottles.

@apainintheneck For those that host on GitHub, perhaps a GitHub action could be created for this; or at the very least, some docs on how to set it up + a command within homebrew that does the 'heavy lifting' for them; that they then just need to trigger from whatever CI they use.

Could even have a brew command that writes out a basic GitHub action workflow in the tap for them (designed to be extensible for other hosting platforms as well?); that way they don't even really need to understand how to write/set it up to benefit from it?

@apainintheneck
Copy link
Contributor Author

@apainintheneck For those that host on GitHub, perhaps a GitHub action could be created for this; or at the very least, some docs on how to set it up + a command within homebrew that does the 'heavy lifting' for them; that they then just need to trigger from whatever CI they use.

We already have commands and workflows to do this internally for core taps (see brew generate-formula-api for example); we'd just have to generalize that code. The question is whether that's good enough or maintainable. There are a lot of abandoned taps out there. Creating a disconnect between the Ruby files and the JSON representation means that the tap can now be outdated in multiple ways which could potentially cause problems. Of course, if we only use this to improve search and descriptions, it's not the end of the world if those results are outdated temporarily. To be honest though, I don't think such a complex change is worth it to just improve search results; there are probably simpler ways to handle that issue. If we want to start loading everything from JSON representations for security/reproducibility/simplicity reasons, then maybe it'd be worth it but it's a big ask.

Could even have a brew command that writes out a basic GitHub action workflow in the tap for them (designed to be extensible for other hosting platforms as well?); that way they don't even really need to understand how to write/set it up to benefit from it?

I think we already some stuff like this in the brew tap-new command for setting up the bottle publishing workflow. We could potentially build upon that.

@apainintheneck
Copy link
Contributor Author

@0xdevalias Since the 1kc-razer cask you were working on earlier is in the core cask tap, it would benefit from the changes to loading this info from the first suggestion to load info from the JSON representation. Including the cask names in the search results as suggested by you and @bevanjkay should be possible after this change for core casks. That sounds like a nice improvement.

@0xdevalias
Copy link
Contributor

0xdevalias commented Dec 13, 2023

Including the cask names in the search results as suggested by you and @bevanjkay should be possible after this change for core casks.

@apainintheneck nods yeah, the 'simple' 'install from JSON for core taps' described above sounds like it would solve the issues I was having there.

And as someone mentioned earlier in this thread, I think landing that as a standalone task would be a net improvement separate to figuring out specifics for non-core taps/etc.

@MikeMcQuaid
Copy link
Member

I definitely don't think the JSON API generation as-is is a good fit for taps, as mentioned above.

The primary motivations for the JSON API (in descending order) were:

  • reducing the impact on GitHub's git infrastructure: we had been asked for years to not "use Git as a CDN"
  • increase the performance of brew update for users using homebrew/core and homebrew/cask

Neither of these are problems for third-party taps. It's also unclear on the latter whether fetching lots of JSON files will actually be faster rather than slower.

Some accidental but positive side-effects:

  • improved security for users
  • some improved consistency on formulae.brew.sh (not wholly utilised yet)
  • a more comprehensive API on formulae.brew.sh

I think if we want to have something like the latter for taps: I don't think it must or should look the same as it does on homebrew/core or homebrew/cask. Something that may be worth exploring for taps for the security benefits would be having a Formula/foo.json or JSONFormula/foo.json file that's actually committed into the tap itself. This will not and should not allow literally everything that Ruby formulae do but for most formulae and casks: it'll be sufficient and provide a much better security profile than formulae alone do.

I could even see a world where we move to this in homebrew/core or homebrew/cask (but continue to serve a single big JSON blob like we do currently).

@apainintheneck
Copy link
Contributor Author

Out of curiosity, I looked at parsing this information with the parser gem and ran into the problem that this gem only supports valid UTF-8 strings.

Invalid characters inside comments and literals

Ruby MRI permits arbitrary non-7-bit byte sequences to appear in comments, as well as in string or symbol literals in form of escape sequences, regardless of source encoding. Parser requires all source code, including the expanded escape sequences, to consist of valid byte sequences in the source encoding that are convertible to UTF-8.

As of 2013-07-25, there are about 180 affected gems.

We have a few examples of strings that don't meet those requirements in core and they usually show up in tests and are valid examples of Ruby code.

We could potentially work around this with monkey-patching but I think it's hardly worth the effort. Beyond that this library doesn't really work for us because it ends up being a bit slow when you have to parse thousands of files (the worst case scenario if we're parsing core).

There are other parser options like ripper and prism is a new addition to the standard library in Ruby 3.3.

@0xdevalias

This comment was marked as resolved.

@MikeMcQuaid
Copy link
Member

We could potentially work around this with monkey-patching but I think it's hardly worth the effort. Beyond that this library doesn't really work for us because it ends up being a bit slow when you have to parse thousands of files (the worst case scenario if we're parsing core).

Agreed. I think it's worth just focusing on the core/homebrew-cask cases here.

@apainintheneck
Copy link
Contributor Author

apainintheneck commented Dec 17, 2023

After looking at the code a bit more, I don't think we need to change things to use the API JSON unless we want to get rid of or change how we use the descriptions cache store.

Currently we use the cache store to cache a pre-calculated hash of package name to package description. This cache gets updated in brew update-report and brew tap automatically so most users should not have to run --eval-all since no evaluation is actually taking place when the cache has already been built.

CacheStoreDatabase.use(:descriptions) do |db|
DescriptionCacheStore.new(db)
.update_from_report!(hub)
end
CacheStoreDatabase.use(:cask_descriptions) do |db|
CaskDescriptionCacheStore.new(db)
.update_from_report!(hub)
end

CacheStoreDatabase.use(:descriptions) do |db|
DescriptionCacheStore.new(db)
.update_from_formula_names!(formula_names)
end
CacheStoreDatabase.use(:cask_descriptions) do |db|
CaskDescriptionCacheStore.new(db)
.update_from_cask_tokens!(cask_tokens)
end

If the user doesn't have the descriptions cache set up already for some reason, it will only build the descriptions cache if --eval-all is passed.

def populate_if_empty!(eval_all: Homebrew::EnvConfig.eval_all?)
return unless eval_all
return unless database.empty?
Cask::Cask.all.each { |c| update!(c.full_name, [c.name.join(", "), c.desc.presence]) }
end

It seems like we could very easily remove the blanket warning to pass in --eval-all to search descriptions and instead display a warning/error message only when the cache store is empty and --eval-all hasn't been passed in. The docs for --eval-all with this command would have to be updated for brew search and brew desc as well.

if !args.eval_all? && !Homebrew::EnvConfig.eval_all?
raise UsageError, "`brew desc` needs `--eval-all` passed or `HOMEBREW_EVAL_ALL` set!"
end

if !args.eval_all? && !Homebrew::EnvConfig.eval_all?
raise UsageError, "`brew search --desc` needs `--eval-all` passed or `HOMEBREW_EVAL_ALL` set!"
end

Edit: 🤦 I can't believe I didn't realize this sooner.

@MikeMcQuaid
Copy link
Member

I don't think we need to change things to use the API JSON unless we want to get rid of or change how we use the descriptions cache store.

Yeh, I guess my thinking was we could perhaps get rid of it entirely for core usage so that e.g. brew desc related commands work without a cache needing to be created based on the JSON instead. If the performance is too awful, the cache could be created on first read based on just reading core and no non-official taps.

@apainintheneck
Copy link
Contributor Author

I don't think we need to change things to use the API JSON unless we want to get rid of or change how we use the descriptions cache store.

Yeh, I guess my thinking was we could perhaps get rid of it entirely for core usage so that e.g. brew desc related commands work without a cache needing to be created based on the JSON instead. If the performance is too awful, the cache could be created on first read based on just reading core and no non-official taps.

Would this mean deprecating non-core packages showing up in brew desc entirely, or having a separate, default code path for querying core packages that doesn't involve the description cache. I'm all for removing the cache if possible and we wouldn't even have to load the packages before searching; we'd be able to pull the descriptions straight from the JSON (I haven't benchmarked this though). That said if we're going to keep the descriptions cache around in some form for non-core packages we might as well use it for all of them since it's already recording core package changes as well at the moment.

@MikeMcQuaid
Copy link
Member

Would this mean deprecating non-core packages showing up in brew desc entirely, or having a separate, default code path for querying core packages that doesn't involve the description cache.

Sorry: I'm not entirely sure 😅.

Things I think, in case it helps answer this:

  • homebrew-core/homebrew-cask packages do not need to require --eval-all as they are entirely trusted and evaluating them to get the description does not require evaluating the Ruby code
  • other taps should require --eval-all because it's not obvious that e.g. running brew desc will evaluate all formulae in all taps without a sandbox and a FileUtils.rm_rf "/Users/mike/stuff_that_isn't_backed_up" will happily remove files at this time
  • homebrew-core/homebrew-cask packages may not need the description cache at all if querying the JSON directly is performant enough (e.g. no more than 2x slower)
  • if we can figure out some other not-too-hacky way of reading third-party tap descriptions without needing an --eval-all that would be preferable
  • we still want to be able to read/use descriptions from third-party taps

And a bit more unrelated:

  • at some point we should probably think about a better trust model for unofficial taps than --eval-all and remove --eval-all everywhere
  • at some point allowing taps to just include committed JSON rather than a formula file might be a nice idea
  • perhaps literally every time we evaluate a formula or cask it should happen in a macOS sandbox/Linux chroot (currently implemented) and separate process for safety. alternatively, perhaps we should locally generate something that looks like the current API JSON for all "trusted" third-party taps on every brew update or something and use that instead of the description cache or anywhere --eval-all is currently required.

@apainintheneck
Copy link
Contributor Author

I agree with most of your points especially removing --eval-all when only evaluating core packages from the API JSON. It looks like any removal/deprecation of the descriptions cache depends on the performance of loading descriptions from the API JSON and evaluating that information from third-party taps without evaluating the Ruby code. I'll come up with a proof of concept so that we can benchmark things to see if it seems reasonable.

There is also the use case where someone uses HOMEBREW_NO_INSTALL_FROM_API and doesn't have the core JSON blob downloaded locally. That would likely be much much slower without the descriptions cache.

@MikeMcQuaid
Copy link
Member

evaluating that information from third-party taps without evaluating the Ruby code

Just to be explicit: or having some sort of explicit "yes I trust this third party tap" step that allows us to evaluate that tap's code in future with --eval-all.

There is also the use case where someone uses HOMEBREW_NO_INSTALL_FROM_API and doesn't have the core JSON blob downloaded locally.

Yeh, I think at this point: it can either just be very slow, it downloads the core JSON blob to work or perhaps we can avoid implementing it entirely.

@apainintheneck
Copy link
Contributor Author

The main performance difference is that loading the JSON for formulas and casks is kind of slow. To load just the core formula hash from JSON it takes around 3/10ths of a second on my new M1 MacBook and 6/10ths of a second on my old iMac. Without the cache brew desc ends up taking around 0.5-1 seconds longer than before. And this is after only loading core formulas and casks from the JSON representation; no other files were loaded at all.

@apainintheneck
Copy link
Contributor Author

Loading core formulae from their Ruby representation is even slower. Loading just core formulae takes at least 5-10 seconds depending on the machine.

@Bo98
Copy link
Member

Bo98 commented Dec 28, 2023

Yeah JSON loading is something I've been looking into now that we've got more tricks we can do with Portable Ruby, though in terms of command speed the one to care about more is brew search --desc as I imagine that's far more used than brew desc.

@apainintheneck
Copy link
Contributor Author

Yeah JSON loading is something I've been looking into now that we've got more tricks we can do with Portable Ruby, though in terms of command speed the one to care about more is brew search --desc as I imagine that's far more used than brew desc.

They should both use the same code internally for searching descriptions. I'll start a thread about the JSON loading to get a discussion going. The current performance is not the end of the world but I am a bit annoyed that it's always the same regardless of whether we're loading one package or a thousand.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale No recent activity label Jan 21, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 29, 2024
@apainintheneck apainintheneck added help wanted We want help addressing this and removed stale No recent activity labels Feb 23, 2024
@rrotter
Copy link
Contributor

rrotter commented Jun 25, 2024

I looked at this a while ago and found the cache code (esp. the cache update logic) the biggest deterrent to attempting a PR.

Is eliminating the cache altogether a viable option here (i.e. from a performance standpoint)? In that case the solution is simple: eval and search external taps if --eval-all is set and don't search external taps if it's not. Any other way out of this IMO leads to arbitrarily complex logic around caching:

  • Should external taps be searched if --eval-all is set but they are already cached?
  • Does the cache need to be invalidated between searches with and without --eval-all, or should there be a separate non-internal-tap cache?
  • Additional branches if user is allowed to mark another tap as "trusted" (maybe a good idea option, but it would really make things complex in addition to cacheing)
  • etc

If there is any room to improve the json parse speed it seems like that would be a better place to put effort (to help eliminate the need for the cache) than getting the cache logic to deal with every possible scenario.

@MikeMcQuaid
Copy link
Member

Is eliminating the cache altogether a viable option here (i.e. from a performance standpoint)?

At this point: yes, I think so.

If there is any room to improve the json parse speed it seems like that would be a better place to put effort (to help eliminate the need for the cache)

Maybe? PRs for this would be welcome but I don't think they are a blocker on improving this.

@apainintheneck
Copy link
Contributor Author

Performance considerations are best discussed with a PR and some benchmarks but in general loading casks and formulae from the taps is an order of magnitude slower than loading them from the API JSON which is slower than loading the cached cask and formula descriptions (though not an order of magnitude slower). I think the consensus is that it should be fast enough without the cache. Realistically all third-party taps have a tiny fraction of the formulae and casks that are in the core taps anyway.

We did explore speeding up the JSON parse speed a bit but didn't make much meaningful progress and things have stalled out recently. It seems to be only tangentially related to what you proposed here though.

@apainintheneck
Copy link
Contributor Author

PRs are definitely welcome here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
features New features help wanted We want help addressing this
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants