Prevent on_cancellation_job & on_completion_job deserialization failure blocking cleanup #24

will89 · 2021-11-16T15:26:50Z

This addresses #22.
This skips enqueueing the on_cancellation_job or on_completion_job if there is an error deserializing it. Adds an optional proc, error_reporter, that accepts one argument (the raised error) when the job group encounters a deserialization error.

prime: @jturkel

jturkel · 2021-11-16T15:59:09Z

lib/delayed/job_groups/job_group.rb

+        begin
+          job = on_cancellation_job
+          job_options = on_cancellation_job_options
+        rescue StandardError => e


Should we only catch Delayed::DeserializationErrors so we'll crash and retry for other types of errors?

In my scenario, I got an ArgumentError. Checking out https://github.com/collectiveidea/delayed_job/blob/master/lib/delayed/backend/base.rb#L73, it looks like quite a long list of errors that can be generated by YAML.load_dj(handler). I'm happy to copy paste all of those errors in here if that's preferred.

Yuck. I was hoping it was just a single exception to catch. It's not ideal but I think copying that list of errors will avoid mistakenly rescuing some classes of errors.

jturkel · 2021-11-16T16:01:07Z

lib/delayed/job_groups/job_group.rb

+        begin
+          job = on_completion_job
+          job_options = on_completion_job_options
+        rescue StandardError => e


Same comment about only catching Delayed::DeserializationError

jturkel · 2021-11-16T16:06:26Z

lib/delayed/job_groups/job_group.rb

+        rescue StandardError => e
+          Delayed::Worker.logger.info('Failed to deserialize the on_completion_job or on_completion_job_options for ' \
+                                      "job_group_id=#{id}. Skipping on_completion_job to clean up job group.")
+          error_reporter.call(e) if error_reporter


Swallowing these errors when a job group is trying to complete seems dangerous since the job group really hasn't completed. Do you think we need to introduce the notion of a job group being in a failed state so we don't crash the worker by continually retrying but still capture the fact that the job group didn't complete. We might need something similar for cancels too since the job group really hasn't been completely canceled.

I suppose an alternative here is that the job group could mark itself as blocked instead of destroying itself.

Wouldn't we still need to manage some state to indicate why a job group was blocked e.g. so you could query the DB for the list of job groups that are blocked due to failure and unblock them after a fix has been deployed?

It would certainly be more convenient if we did that. I think we could get away with using a single failed_at timestamp on the on_cancellation_job or the on_completion_job that would be set if deserialization failed. Blocking the job group still makes sense right?

jturkel · 2021-11-16T16:08:33Z

spec/delayed/job_groups/job_group_spec.rb

+    context "on_completion_job refers to missing class" do
+      # The on_completion_job needs the class to be defined this way in order to serialize it
+      # rubocop:disable RSpec/LeakyConstantDeclaration,Style/ClassAndModuleChildren,Lint/ConstantDefinitionInBlock
+      module Delayed::JobGroups::JobGroupTestHelper


Should we define this constant in a before so we're sure it will be present when the example starts running? The current approach won't work if we ever added multiple examples to this example group.

* Explicit error classes rescued * Test begin block * Add failed_at column to delayed_job_groups

will89 added 5 commits November 16, 2021 09:23

Initial implementation

7843212

Add configurable error_reporter

844e506

Fix some rubocop failures

630af06

Fix rubocop

593cdfa

Clean up

a4dd750

will89 requested a review from jturkel November 16, 2021 15:26

will89 assigned jturkel Nov 16, 2021

Add changelog & version bump

fd11003

jturkel reviewed Nov 16, 2021

View reviewed changes

will89 added 3 commits November 17, 2021 15:22

PR Feedback

08bd18d

* Explicit error classes rescued * Test begin block * Add failed_at column to delayed_job_groups

Clean up

26f58af

Fix rubocop

8903a2b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent on_cancellation_job & on_completion_job deserialization failure blocking cleanup #24

Prevent on_cancellation_job & on_completion_job deserialization failure blocking cleanup #24

will89 commented Nov 16, 2021

jturkel Nov 16, 2021

will89 Nov 16, 2021

jturkel Nov 16, 2021

jturkel Nov 16, 2021

jturkel Nov 16, 2021

will89 Nov 16, 2021

jturkel Nov 16, 2021

will89 Nov 16, 2021

jturkel Nov 16, 2021

Prevent on_cancellation_job & on_completion_job deserialization failure blocking cleanup #24

Are you sure you want to change the base?

Prevent on_cancellation_job & on_completion_job deserialization failure blocking cleanup #24

Conversation

will89 commented Nov 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment