Support dynamic GROUP BY #407

derekperkins · 2024-05-29T00:26:50Z

For an application like Google Analytics, you would have your raw data stream, and you might want to change the group by at runtime to look at domain, subdomain, url, or even url subfolder / prefix performance. In the normalized stock price example, it uses a raw parquet file for the data, then pre-generates labels into a temp table using a raw query string. Would you have to pre-generate all those GROUP BY permutations at startup, or is there a better way to approach that?

cc @tannerlinsley

derekperkins · 2024-06-06T17:37:42Z

To give a concrete example, we're rolling up raw data like this

CREATE TEMP TABLE domain_agg IF NOT EXISTS AS
SELECT
  requested,
  device_code,
  domain,

  AVG(rank) AS rank,
  SUM(estimated_traffic) AS estimated_traffic,
  CAST(AVG(pixels_from_top) AS INT64) AS pixels_from_top,
  AVG(above_the_fold_percentage) AS above_the_fold_percentage,
  AVG(serp_percentage) AS serp_percentage,

FROM 'http://localhost:3002/mosaic-data.parquet'
GROUP BY requested, device_code, domain
ORDER BY requested, device_code, domain

Given a UI toggle with these three fields available in the parquet file:

domain
subdomain
url

Do we have to make separate subdomain_agg and url_agg temp tables? Is there a way to do that lazily?

jheer · 2024-06-06T18:29:19Z

Not yet sure if this is best viewed as a Mosaic question or a DuckDB question... at the DuckDB level, maybe take a look at GROUPING_SETS to see if you can build the table you want in a single go and then simply issue filtering queries depending on the desired groupby dimensions?

There is not currently support for dynamic generation of tables in Mosaic itself, but you could do it externally. Other than data loading operations (which are there for convenience and decoupled from all downstream operations) and internally generated data cube indexes, Mosaic doesn't currently generate its own tables.

derekperkins · 2024-06-06T18:55:25Z

Thanks. We're definitely able to handle that dynamic generation from a workflow perspective. Given the issue about adding multi-table support, I wasn't sure if there was a lifecycle requirement we needed to tie into. We're still figuring out where the coordinator ends and DuckDB begins.

derekperkins · 2024-06-06T18:57:06Z

I'll close this issue for now, and once we get a bit deeper, we can contribute to docs and/or examples showing something like this.

jheer added the question Further information is requested label Jun 6, 2024

derekperkins closed this as completed Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support dynamic GROUP BY #407

Support dynamic GROUP BY #407

derekperkins commented May 29, 2024

derekperkins commented Jun 6, 2024

jheer commented Jun 6, 2024

derekperkins commented Jun 6, 2024

derekperkins commented Jun 6, 2024

Support dynamic GROUP BY #407

Support dynamic GROUP BY #407

Comments

derekperkins commented May 29, 2024

derekperkins commented Jun 6, 2024

jheer commented Jun 6, 2024

derekperkins commented Jun 6, 2024

derekperkins commented Jun 6, 2024