Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dynamic GROUP BY #407

Closed
derekperkins opened this issue May 29, 2024 · 4 comments
Closed

Support dynamic GROUP BY #407

derekperkins opened this issue May 29, 2024 · 4 comments
Labels
question Further information is requested

Comments

@derekperkins
Copy link

For an application like Google Analytics, you would have your raw data stream, and you might want to change the group by at runtime to look at domain, subdomain, url, or even url subfolder / prefix performance. In the normalized stock price example, it uses a raw parquet file for the data, then pre-generates labels into a temp table using a raw query string. Would you have to pre-generate all those GROUP BY permutations at startup, or is there a better way to approach that?

cc @tannerlinsley

@derekperkins
Copy link
Author

To give a concrete example, we're rolling up raw data like this

CREATE TEMP TABLE domain_agg IF NOT EXISTS AS
SELECT
  requested,
  device_code,
  domain,

  AVG(rank) AS rank,
  SUM(estimated_traffic) AS estimated_traffic,
  CAST(AVG(pixels_from_top) AS INT64) AS pixels_from_top,
  AVG(above_the_fold_percentage) AS above_the_fold_percentage,
  AVG(serp_percentage) AS serp_percentage,

FROM 'http://localhost:3002/mosaic-data.parquet'
GROUP BY requested, device_code, domain
ORDER BY requested, device_code, domain

Given a UI toggle with these three fields available in the parquet file:

  • domain
  • subdomain
  • url

Do we have to make separate subdomain_agg and url_agg temp tables? Is there a way to do that lazily?

@jheer
Copy link
Member

jheer commented Jun 6, 2024

Not yet sure if this is best viewed as a Mosaic question or a DuckDB question... at the DuckDB level, maybe take a look at GROUPING_SETS to see if you can build the table you want in a single go and then simply issue filtering queries depending on the desired groupby dimensions?

There is not currently support for dynamic generation of tables in Mosaic itself, but you could do it externally. Other than data loading operations (which are there for convenience and decoupled from all downstream operations) and internally generated data cube indexes, Mosaic doesn't currently generate its own tables.

@jheer jheer added the question Further information is requested label Jun 6, 2024
@derekperkins
Copy link
Author

Thanks. We're definitely able to handle that dynamic generation from a workflow perspective. Given the issue about adding multi-table support, I wasn't sure if there was a lifecycle requirement we needed to tie into. We're still figuring out where the coordinator ends and DuckDB begins.

@derekperkins
Copy link
Author

I'll close this issue for now, and once we get a bit deeper, we can contribute to docs and/or examples showing something like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants