-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support dynamic GROUP BY #407
Comments
To give a concrete example, we're rolling up raw data like this CREATE TEMP TABLE domain_agg IF NOT EXISTS AS
SELECT
requested,
device_code,
domain,
AVG(rank) AS rank,
SUM(estimated_traffic) AS estimated_traffic,
CAST(AVG(pixels_from_top) AS INT64) AS pixels_from_top,
AVG(above_the_fold_percentage) AS above_the_fold_percentage,
AVG(serp_percentage) AS serp_percentage,
FROM 'http://localhost:3002/mosaic-data.parquet'
GROUP BY requested, device_code, domain
ORDER BY requested, device_code, domain Given a UI toggle with these three fields available in the parquet file:
Do we have to make separate |
Not yet sure if this is best viewed as a Mosaic question or a DuckDB question... at the DuckDB level, maybe take a look at There is not currently support for dynamic generation of tables in Mosaic itself, but you could do it externally. Other than data loading operations (which are there for convenience and decoupled from all downstream operations) and internally generated data cube indexes, Mosaic doesn't currently generate its own tables. |
Thanks. We're definitely able to handle that dynamic generation from a workflow perspective. Given the issue about adding multi-table support, I wasn't sure if there was a lifecycle requirement we needed to tie into. We're still figuring out where the coordinator ends and DuckDB begins. |
I'll close this issue for now, and once we get a bit deeper, we can contribute to docs and/or examples showing something like this. |
For an application like Google Analytics, you would have your raw data stream, and you might want to change the group by at runtime to look at domain, subdomain, url, or even url subfolder / prefix performance. In the normalized stock price example, it uses a raw parquet file for the data, then pre-generates labels into a temp table using a raw query string. Would you have to pre-generate all those GROUP BY permutations at startup, or is there a better way to approach that?
cc @tannerlinsley
The text was updated successfully, but these errors were encountered: