Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using AS on an existing column name has value overridden if value exists on field resulting from another column in the query if having same name from another table #42

Open
notactuallytreyanastasio opened this issue Jul 18, 2022 · 3 comments
Labels
bug Something isn't working For Discussion 💡 ❔ A proposal that is open for discussion.

Comments

@notactuallytreyanastasio
Copy link
Contributor

notactuallytreyanastasio commented Jul 18, 2022

I am not 100% sure if this is a 'snowflex' issue or if its isolated enough that its just a me-problem.

An illustration:

Let's say I have a query like this

SELECT
  UPPER(product_manifest.PRODUCT_CODE_TYPE) AS PRODUCT_CODE_TYPE,
  product_manifest.PRODUCT_CODE,
  product_manifest.PRODUCT_CODE AS GTIN,
  catalog_product.*
FROM
  STUFF catalog_product
INNER JOIN
  ITEM product_manifest
ON
  catalog_product.OUR_KEY = product_manifest.OUR_KEY
WHERE
  UPPER(product_manifest.PRODUCT_CODE_TYPE) = 'GTIN'
QUALIFY
  ROW_NUMBER() OVER (PARTITION BY catalog_product.OUR_KEY ORDER BY catalog_product.LOADED_AT DESC) = 1

Let's assume that both STUFF and ITEM have a similar field, GTIN and PRODUCT_CODE that are met to be the same, but GTIN is empty in the ITEM table and we want to override that but still get everything from it so we use our AS statement.

If we run this in snowflake via a console, we get back a result that looks (roughly) like this:

PRODUCT_CODE_TYPE PRODUCT_CODE GTIN WHATEVER GTIN_2
GTIN 1234 1234 thing
GTIN 1235 1235 stuff

However, when we example the results coming out of Snowflex, we will see a setup like this:

[
  {"product_code_type", 0},
  {"product_code", 1},
  {"gtin", 2},
  {"whatever", 3},
  {"gtin", 4},
]

as the value for bin_headers, thus when we do our reduction, since the empty GTIN value from catalogue_product.* overrides the value of the one we set using our AS clause.

Now, if we simply moved the AS clause to be after catalog_products.* we could do this, but its similarly misleading as a whole and just further obfuscates the underlying issue of receiving the truth from the query. It is treating a symptom rather than a cause, and allows us to bubble the fix we're using internally in an ETL process from our app to the library, but is debatably more deceptive when it comes to figuring out why something broke later down the line (which, these things being computers being programmed by people, will for sure happen)

It seems that the solution here may be to add an option to use a not-null-or-empty value for like-keys in query results when deriving this map, but that probably has some possibly surprising behavior if its enabled by default.

I'm not 100% sure this is something that should be handled, but wanted to document it here in the event someone else ran into a similar issue, or if we deem it an API worth adding. I dont think making it just do this by default is the answer, and since maps cant have like keys, the path to the best answer seems murky.

@notactuallytreyanastasio
Copy link
Contributor Author

notactuallytreyanastasio commented Jul 19, 2022

I believe the best call here considering we have real users outside us internally is the following:

  1. Add an application level config to return the 'truth' -- which in this case is a fixed-width list of tuples that are headers and rows, and allow the user to choose their own journey

  2. Get feedback from people whom may not have encountered this issue, but think of their concerns outside ours. Because if you are using this library to query at any reasonable level in your application, this changes a huge amount of flows and expected data structures. There is no way to map against it super reliably and not have to consider a world where you cant derive the 'truth' as we are able to with our specific querying methodology here. Because while we have a reasonable workaround, it is completely unreasonable to assume that a user of this library would be OK with a whole field of their SQL query just not being returned and having whatever the latter value was overriding the first because of a name collision and a data structure not being a perfect fit.

I dont really have strong feelings here, but there is something that feels off about it working as it sits right now, and I'd just like to gather others' thoughts and let that stew for a while and hopefully make a good decision when someone has time to seriously consider this.

It seems like maybe the best short-to-mid-term decision is to keep it as maps, but CLEARLY document the shortcomings and highlight the config. But speaking on the scale of years of use, I can only imagine this creating more footguns that lead one to thinking 'do I have a bug' but really at the core its a problem in library code.

@notactuallytreyanastasio notactuallytreyanastasio changed the title Using AS on an existing column name has value overridden if value exists on field resulting from query Using AS on an existing column name has value overridden if value exists on field resulting from another column in the query if having same name from another table Jul 19, 2022
@hexchung
Copy link

hexchung commented Aug 1, 2022

If I'm understanding this issue correctly, I believe I am experiencing a similar issue.

I have a query like:

SELECT table1.requested_at,
       CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', table1.requested_at) AS requested_at
       requested_at
  FROM table1

where the results look like

----------------------------------------------------------------------
|         requested_at |       requested_at_2 |       requested_at_3 |
----------------------------------------------------------------------
| 2022-06-09T22:41:32Z | 2022-06-09T18:41:32Z | 2022-06-09T22:41:32Z |

Notice the reference to requested_at with no identifier is overridden by the original table value. The solution I am going with is to just change the AS name to something like requested_at_tz to differentiate.

@notactuallytreyanastasio
Copy link
Contributor Author

@hexchung yes this is definitely the same problem.

The core of this will be quite a doozy to fix, as we are essentially choosing the wrong data structure to do this (maps).

Here is a canned example that shows it:

Query:

SELECT
  upc.foo AS foo
  upc.bar AS bar
  upc.bizzle AS foo
FROM
  MY_NAMESPACE.whatever

We will get this in the snowflake UI

| foo | bar | foo_2|
+-------------------+
|  1  |  2  |  0   |

Which, when we do the reduction we see here will ultimately overwrite the original value of foo with 0 because while the UI shows it as foo_2 it actually is returned with the original foo key.

The solution here will potentially be to make an application-level configuration for the library which allows us to return lists of tuples, which can have this duplication, but maps as a data structure inherently will require people to work around this in ways that are similar to what you are dealing with.

Ultimately, when doing joins or selects that will result in like-column-names and you only want one, specifically noting that by being explicit with all columns and want you want vs *, or doing a different AS casting will be the best answer to get past these issues.

I am not sure when we will have bandwidth to make this change, cut a new release, etc on the PepsiCo side of things, but all pull requests are welcome. The impact on performance and massive API change in data structure makes it so that it seems we definitely cant just default back to it, but making it a configurable option seems reasonable IMO.

I have also discussed this with @jeremyowensboggs -- feel free to chime in if you have any further thoughts about this per our previous Slack discussions, Jeremy.

@notactuallytreyanastasio notactuallytreyanastasio added bug Something isn't working For Discussion 💡 ❔ A proposal that is open for discussion. labels Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working For Discussion 💡 ❔ A proposal that is open for discussion.
Projects
None yet
Development

No branches or pull requests

2 participants