Using `AS` on an existing column name has value overridden if value exists on field resulting from another column in the query if having same name from another table #42

notactuallytreyanastasio · 2022-07-18T20:15:11Z

I am not 100% sure if this is a 'snowflex' issue or if its isolated enough that its just a me-problem.

An illustration:

Let's say I have a query like this

SELECT
  UPPER(product_manifest.PRODUCT_CODE_TYPE) AS PRODUCT_CODE_TYPE,
  product_manifest.PRODUCT_CODE,
  product_manifest.PRODUCT_CODE AS GTIN,
  catalog_product.*
FROM
  STUFF catalog_product
INNER JOIN
  ITEM product_manifest
ON
  catalog_product.OUR_KEY = product_manifest.OUR_KEY
WHERE
  UPPER(product_manifest.PRODUCT_CODE_TYPE) = 'GTIN'
QUALIFY
  ROW_NUMBER() OVER (PARTITION BY catalog_product.OUR_KEY ORDER BY catalog_product.LOADED_AT DESC) = 1

Let's assume that both STUFF and ITEM have a similar field, GTIN and PRODUCT_CODE that are met to be the same, but GTIN is empty in the ITEM table and we want to override that but still get everything from it so we use our AS statement.

If we run this in snowflake via a console, we get back a result that looks (roughly) like this:

PRODUCT_CODE_TYPE	PRODUCT_CODE	GTIN	WHATEVER	GTIN_2
GTIN	1234	1234	thing
GTIN	1235	1235	stuff

However, when we example the results coming out of Snowflex, we will see a setup like this:

[
  {"product_code_type", 0},
  {"product_code", 1},
  {"gtin", 2},
  {"whatever", 3},
  {"gtin", 4},
]

as the value for bin_headers, thus when we do our reduction, since the empty GTIN value from catalogue_product.* overrides the value of the one we set using our AS clause.

Now, if we simply moved the AS clause to be after catalog_products.* we could do this, but its similarly misleading as a whole and just further obfuscates the underlying issue of receiving the truth from the query. It is treating a symptom rather than a cause, and allows us to bubble the fix we're using internally in an ETL process from our app to the library, but is debatably more deceptive when it comes to figuring out why something broke later down the line (which, these things being computers being programmed by people, will for sure happen)

It seems that the solution here may be to add an option to use a not-null-or-empty value for like-keys in query results when deriving this map, but that probably has some possibly surprising behavior if its enabled by default.

I'm not 100% sure this is something that should be handled, but wanted to document it here in the event someone else ran into a similar issue, or if we deem it an API worth adding. I dont think making it just do this by default is the answer, and since maps cant have like keys, the path to the best answer seems murky.

The text was updated successfully, but these errors were encountered:

notactuallytreyanastasio · 2022-07-19T02:54:11Z

I believe the best call here considering we have real users outside us internally is the following:

Add an application level config to return the 'truth' -- which in this case is a fixed-width list of tuples that are headers and rows, and allow the user to choose their own journey
Get feedback from people whom may not have encountered this issue, but think of their concerns outside ours. Because if you are using this library to query at any reasonable level in your application, this changes a huge amount of flows and expected data structures. There is no way to map against it super reliably and not have to consider a world where you cant derive the 'truth' as we are able to with our specific querying methodology here. Because while we have a reasonable workaround, it is completely unreasonable to assume that a user of this library would be OK with a whole field of their SQL query just not being returned and having whatever the latter value was overriding the first because of a name collision and a data structure not being a perfect fit.

I dont really have strong feelings here, but there is something that feels off about it working as it sits right now, and I'd just like to gather others' thoughts and let that stew for a while and hopefully make a good decision when someone has time to seriously consider this.

It seems like maybe the best short-to-mid-term decision is to keep it as maps, but CLEARLY document the shortcomings and highlight the config. But speaking on the scale of years of use, I can only imagine this creating more footguns that lead one to thinking 'do I have a bug' but really at the core its a problem in library code.

hexchung · 2022-08-01T20:22:11Z

If I'm understanding this issue correctly, I believe I am experiencing a similar issue.

I have a query like:

SELECT table1.requested_at,
       CONVERT_TIMEZONE('UTC', 'America/Los_Angeles', table1.requested_at) AS requested_at
       requested_at
  FROM table1

where the results look like

----------------------------------------------------------------------
|         requested_at |       requested_at_2 |       requested_at_3 |
----------------------------------------------------------------------
| 2022-06-09T22:41:32Z | 2022-06-09T18:41:32Z | 2022-06-09T22:41:32Z |

Notice the reference to requested_at with no identifier is overridden by the original table value. The solution I am going with is to just change the AS name to something like requested_at_tz to differentiate.

notactuallytreyanastasio · 2022-08-01T22:16:51Z

@hexchung yes this is definitely the same problem.

The core of this will be quite a doozy to fix, as we are essentially choosing the wrong data structure to do this (maps).

Here is a canned example that shows it:

Query:

SELECT
  upc.foo AS foo
  upc.bar AS bar
  upc.bizzle AS foo
FROM
  MY_NAMESPACE.whatever

We will get this in the snowflake UI

| foo | bar | foo_2|
+-------------------+
|  1  |  2  |  0   |

Which, when we do the reduction we see here will ultimately overwrite the original value of foo with 0 because while the UI shows it as foo_2 it actually is returned with the original foo key.

The solution here will potentially be to make an application-level configuration for the library which allows us to return lists of tuples, which can have this duplication, but maps as a data structure inherently will require people to work around this in ways that are similar to what you are dealing with.

Ultimately, when doing joins or selects that will result in like-column-names and you only want one, specifically noting that by being explicit with all columns and want you want vs *, or doing a different AS casting will be the best answer to get past these issues.

I am not sure when we will have bandwidth to make this change, cut a new release, etc on the PepsiCo side of things, but all pull requests are welcome. The impact on performance and massive API change in data structure makes it so that it seems we definitely cant just default back to it, but making it a configurable option seems reasonable IMO.

I have also discussed this with @jeremyowensboggs -- feel free to chime in if you have any further thoughts about this per our previous Slack discussions, Jeremy.

notactuallytreyanastasio added bug Something isn't working For Discussion 💡 ❔ A proposal that is open for discussion. labels Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using `AS` on an existing column name has value overridden if value exists on field resulting from another column in the query if having same name from another table #42

Using `AS` on an existing column name has value overridden if value exists on field resulting from another column in the query if having same name from another table #42

notactuallytreyanastasio commented Jul 18, 2022 •

edited

Loading

notactuallytreyanastasio commented Jul 19, 2022 •

edited

Loading

hexchung commented Aug 1, 2022 •

edited

Loading

notactuallytreyanastasio commented Aug 1, 2022

Using AS on an existing column name has value overridden if value exists on field resulting from another column in the query if having same name from another table #42

Using AS on an existing column name has value overridden if value exists on field resulting from another column in the query if having same name from another table #42

Comments

notactuallytreyanastasio commented Jul 18, 2022 • edited Loading

notactuallytreyanastasio commented Jul 19, 2022 • edited Loading

hexchung commented Aug 1, 2022 • edited Loading

notactuallytreyanastasio commented Aug 1, 2022

Using `AS` on an existing column name has value overridden if value exists on field resulting from another column in the query if having same name from another table #42

Using `AS` on an existing column name has value overridden if value exists on field resulting from another column in the query if having same name from another table #42

notactuallytreyanastasio commented Jul 18, 2022 •

edited

Loading

notactuallytreyanastasio commented Jul 19, 2022 •

edited

Loading

hexchung commented Aug 1, 2022 •

edited

Loading