Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent on_cancellation_job & on_completion_job deserialization failure blocking cleanup #24

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

will89
Copy link
Contributor

@will89 will89 commented Nov 16, 2021

This addresses #22.
This skips enqueueing the on_cancellation_job or on_completion_job if there is an error deserializing it. Adds an optional proc, error_reporter, that accepts one argument (the raised error) when the job group encounters a deserialization error.

prime: @jturkel

begin
job = on_cancellation_job
job_options = on_cancellation_job_options
rescue StandardError => e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only catch Delayed::DeserializationErrors so we'll crash and retry for other types of errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my scenario, I got an ArgumentError. Checking out https://github.com/collectiveidea/delayed_job/blob/master/lib/delayed/backend/base.rb#L73, it looks like quite a long list of errors that can be generated by YAML.load_dj(handler). I'm happy to copy paste all of those errors in here if that's preferred.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yuck. I was hoping it was just a single exception to catch. It's not ideal but I think copying that list of errors will avoid mistakenly rescuing some classes of errors.

begin
job = on_completion_job
job_options = on_completion_job_options
rescue StandardError => e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about only catching Delayed::DeserializationError

rescue StandardError => e
Delayed::Worker.logger.info('Failed to deserialize the on_completion_job or on_completion_job_options for ' \
"job_group_id=#{id}. Skipping on_completion_job to clean up job group.")
error_reporter.call(e) if error_reporter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swallowing these errors when a job group is trying to complete seems dangerous since the job group really hasn't completed. Do you think we need to introduce the notion of a job group being in a failed state so we don't crash the worker by continually retrying but still capture the fact that the job group didn't complete. We might need something similar for cancels too since the job group really hasn't been completely canceled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose an alternative here is that the job group could mark itself as blocked instead of destroying itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we still need to manage some state to indicate why a job group was blocked e.g. so you could query the DB for the list of job groups that are blocked due to failure and unblock them after a fix has been deployed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would certainly be more convenient if we did that. I think we could get away with using a single failed_at timestamp on the on_cancellation_job or the on_completion_job that would be set if deserialization failed. Blocking the job group still makes sense right?

context "on_completion_job refers to missing class" do
# The on_completion_job needs the class to be defined this way in order to serialize it
# rubocop:disable RSpec/LeakyConstantDeclaration,Style/ClassAndModuleChildren,Lint/ConstantDefinitionInBlock
module Delayed::JobGroups::JobGroupTestHelper
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we define this constant in a before so we're sure it will be present when the example starts running? The current approach won't work if we ever added multiple examples to this example group.

* Explicit error classes rescued
* Test begin block
* Add failed_at column to delayed_job_groups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants