Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve traceability of accountable memory usage for expensive queries #130

Open
hammerhead opened this issue Nov 8, 2024 · 1 comment

Comments

@hammerhead
Copy link
Member

Problem Statement

If a cluster is under high query load, users can run into CircuitBreakingException. In case the total circuit breaker trips (indices.breaker.total.limit), users will see certain queries failing, but those are not necessarily the most expensive queries. There can be other (long-running) heavy queries that occupy large portions of memory, but are still below the (total) circuit breaker limits. Users may then start seeing relatively cheap queries failing.

As a user, I want to have an easy way to diagnose what queries are currently occupying (or have recently occopied) what amount of memory that is accountable for the total circuit breaking.

Besides troubleshooting circuit breaker exceptions, such a metric can also help to proactively review expensive queries before running into total circuit breaker exceptions.

Possible Solutions

There could be different solutions.

  1. Use existing sys.operations(_log) tables
    There already is a used_bytes column in sys.operations(_log). Would SELECT job_id, SUM(used_bytes) FROM sys.operations be an accurate representation of peak total accountable memory usage per query?

    If yes, it could be documented in Diagnostics with System Tables.

  2. Add a new field to sys.jobs(_log)
    If there is no metric exposing the peak accountable memory usage of a query, a new field could be added to sys.jobs(_log).

Considered Alternatives

No response

@seut
Copy link
Member

seut commented Nov 8, 2024

Would SELECT job_id, SUM(used_bytes) FROM sys.operations be an accurate representation of peak total accountable memory usage per query?

Yes, that is what it is designed for. If it is not working as expected, it would be a bug.

If yes, it could be documented in Diagnostics with System Tables.

Good point, would you be up for adding this?

@hammerhead hammerhead transferred this issue from crate/crate Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants