Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream queries sometimes are not killed when downstream process crashes #44

Open
notactuallytreyanastasio opened this issue Aug 2, 2022 · 2 comments
Labels
bug Something isn't working For Discussion 💡 ❔ A proposal that is open for discussion. help wanted Extra attention is needed

Comments

@notactuallytreyanastasio
Copy link
Contributor

notactuallytreyanastasio commented Aug 2, 2022

I noticed this today while I was working on a new query to improve some stuff on an internal application.

A simple version of my query is like so:

WITH padded_upcs AS (
    SELECT
        item.TITLE,
        LPAD(guide_to_things.CODE, 13, '0') AS UPC
    FROM
        NAMESPACE.SUBNAMESPACE.ITEM item
    INNER JOIN
        NAMESPACE.SUBNAMESPACE.GUIDE guide_to_things
    ON
        item.FOO = guide_to_things.FOO
    WHERE
        UPPER(guide_to_things.CODE) = 'UPC'
    QUALIFY
        ROW_NUMBER()
    OVER
        (PARTITION BY item.ITEM ORDER BY item.LOAD_DTS DESC) = 1
),

gtins AS (
    SELECT
        item.TITLE,
        LPAD(guide_to_things.CODE, 13, '0') AS UPC
    FROM
        NAMESPACE.SUBNAMESPACE.ITEM item
    INNER JOIN
        NAMESPACE.SUBNAMESPACE.GUIDE guide_to_things
    ON
        item.FOO = guide_to_things.FOO
    WHERE
        UPPER(guide_to_things.CODE) = 'UPC'
    QUALIFY
        ROW_NUMBER()
    OVER
        (PARTITION BY item.ITEM ORDER BY item.LOAD_DTS DESC) = 1
)

SELECT
    p.TITLE,
    g.GTIN
FROM
    padded_upcs p
JOIN
    gtins g
ON
    charindex(p.upc, g.gtin) > 0

In this case, we are gathering a group of about 200,000 items, and a group of about 90,000 items between the joins.

In the snowflake UI, this query takes about 8 minutes to run. If you have a sufficiently slow-ish query that is similar, anything in that ballpark should suffice. If you are a PepsiCo e-comm engineering employee and want an easy, reusable example, I can provide the full, real version of this query to use for debugging.

The long and short of it is this to demonstrate the problem:

  1. Open 6 iex -S mix shells
  2. In each one, use snowflex to run the above query

You will now see after about 10-15 minutes (accounting for some extra serialization time through ODBC) a couple have results come in, but if you check out the snowflake activity panel, you can eye the queued/running queries.

Our work configuration allows 2 queries/user to be run for developer account concurrently, so with 6 it begins to stack.

So, after about 20 minutes (our configured timeout, use whatever one you have configured for your application to hit this checking point) the stacking and queue will make it so its unfeasible for some to complete in under 20 minutes.

Now, we will see something like this

image

Note the bottom right, my connection that timed out. The other 5 are complete or have had iex exit and the process is dead.

However, if we go and look at snowflake:

image

We still see 2 queries running. Both for over 20 minutes, that were certainly sent from Elixir-land.

This leads me to a couple theories:

  1. Somehow :odbc is not correctly handling process checkout from poolboy and sending the appropriate kill signal
  2. We are doing something silly and swallowing the failure yet raising the exception and a process is raising a poolboy exception but not killed (seems less likely since I ctrl-c'd the top left terminal)
  3. Its something else deeper and I haven't the slightest idea

I'm not sure about how to go about fixing this, but I wanted to document it while it was fresh and I had screenshots that verified I am not silly and that I could in fact replicate this after it bit me earlier in the day. I initially thought that there was somehow some serious serialization overhead causing the timeout, but once I realized I had ~20 queries queued but only 1 open terminal process, it was pretty clear something is off in the library itself.

@notactuallytreyanastasio notactuallytreyanastasio added bug Something isn't working help wanted Extra attention is needed For Discussion 💡 ❔ A proposal that is open for discussion. labels Aug 2, 2022
@notactuallytreyanastasio
Copy link
Contributor Author

Upon digging through some of the source here, my gut thinks the problem could be with send_heartbeat/1 and its SELECT 1. It just seems like if there is a process signaling error to the upstream server (snowflake) it may be because some or possibly all of the time this SELECT 1 keeps it alive.

I dont know though, I'm just poking around a little while I am bored this evening.

@dustinfarris
Copy link
Member

check out ABORT_DETACHED_QUERY

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working For Discussion 💡 ❔ A proposal that is open for discussion. help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants