Don't broadcast Checkable#next_check updates made just not to check twice #10093

Al2Klimov · 2024-06-20T16:12:23Z

The checker sorts Checkables by next_check while picking the next due one, so we (already) have to advance next_check while starting a check. But the second master doesn't need this info, as it's not responsible.

refs #10082

TODO

Checkable: Don't recalculate next_check for remotely generated cr #10011

Al2Klimov · 2024-06-20T16:15:33Z

@yhabteab Are you sure all SetNextCheck remaining after Checkable: Don't recalculate next_check for remotely generated cr #10011 (which don't already call Checkable::OnNextCheckUpdated) shall be written to DB? Especially in Checkable#Start() or those affecting parent/child..

yhabteab · 2024-06-21T14:06:09Z

No, I'm not! I haven't even checked all of them in detail, but aren't they just like the other superfluous ones from Checkable::ExecuteCheck() and used to enforce the scheduler to re-index its idle checkables queue?

Al2Klimov · 2024-06-24T10:24:59Z

I don't think so.

icinga2/lib/icinga/checkable-check.cpp

Lines 392 to 404 in bca1a84

    
           if (recovery) { 
        
           	for (auto& child : children) { 
        
           		if (child->GetProblem() && child->GetEnableActiveChecks()) { 
        
           			auto nextCheck (now + Utility::Random() % 60); 
        
           			ObjectLock oLock (child); 
        
           			if (nextCheck < child->GetNextCheck()) { 
        
           				child->SetNextCheck(nextCheck); 
        
           			} 
        
           		} 
        
           	} 
        
           }

icinga2/lib/icinga/checkable-check.cpp

Lines 406 to 420 in bca1a84

    
           if (stateChange) { 
        
           	/* reschedule direct parents */ 
        
           	for (const Checkable::Ptr& parent : GetParents()) { 
        
           		if (parent.get() == this) 
        
           			continue; 
        
           		if (!parent->GetEnableActiveChecks()) 
        
           			continue; 
        
           		if (parent->GetNextCheck() >= now + parent->GetRetryInterval()) { 
        
           			ObjectLock olock(parent); 
        
           			parent->SetNextCheck(now); 
        
           		} 
        
           	} 
        
           }

Especially these two re-schedule the next check of other checkables as something happened with the current one. The other HA node may be responsible for them, so of course this matters for the cluster. But IMAO this doesn't matter for our backends. I think we can stop our anti-SetNextCheck witch-hunt here.

yhabteab · 2024-09-26T08:14:11Z

The other HA node may be responsible for them, so of course this matters for the cluster.

We literally trigger an OnNewCheckResult event just two lines below the highlighted code, which is synced and goes into the same code path on the other node.

…wice The checker sorts Checkables by next_check while picking the next due one, so we (already) have to advance next_check while starting a check. But the second master doesn't need this info, as it's not responsible.

Al2Klimov · 2024-09-26T15:43:45Z

In addition, CheckerComponent::NextCheckChangedHandler needs these two (not suppressed!) events too, so that these next_check updates are effective at all.

yhabteab · 2024-09-27T12:44:27Z

Honestly, I don't understand why you're turning down all the suggestions, just to stick to the first idea that came to your mind. Let's be real, no one said this should work out right away with either suggestions. However, compared to adding yet another useless cluster event, this could be a much better solution. We already have enough problems with the countless/unpredictable RPC messages to deal with. So, why add yet another one when there's a better alternative?

I'm not saying these SetNextCheck calls are useless, but you should first minimise the places where this method is called. After that, we can discuss how these remaining 2-3 calls should be handled. If you are only concerned about these two calls, then two additional Icinga DB events are nothing to me if we can reduce the load on the cluster as a result.

Al2Klimov · 2024-09-27T13:20:22Z

Actually I'm totally fine with both (#10082 (comment)) a new event or a flag in the existing one. Also, I just said (#10093 (comment)) these events are needed locally. For the latter we could add flag(s) to setters and event handlers, so that the latter can say: Oh, this event is not for broadcasting, so I won't send it as cluster message. Ok?

julianbrost · 2024-10-01T12:22:05Z

refs #10082

How does this relate to that issue now? If this PR was merged as-is, how would you address that bug?

julianbrost · 2024-10-01T12:29:32Z

lib/icinga/checkable-check.cpp

-	/* This calls SetNextCheck() which updates the CheckerComponent's idle/pending
+	/* This calls SetNextCheck() for a later update of the CheckerComponent's idle/pending
 	 * queues and ensures that checks are not fired multiple times. ProcessCheckResult()
 	 * is called too late. See #6421.
 	 */
-	UpdateNextCheck();
+	UpdateNextCheck(nullptr, true);


for a later update of the CheckerComponent's idle/pending

When will this later update happen?

ProcessCheckResult() is called too late.

In particular, is it earlier than this?

for a later update of the CheckerComponent's idle/pending

When will this later update happen?

First, CheckerComponent::ExecuteCheckHelper calls checkable->ExecuteCheck (https://github.com/Icinga/icinga2/blob/v2.14.2/lib/checker/checkercomponent.cpp#L233)

That calls UpdateNextCheck (https://github.com/Icinga/icinga2/blob/v2.14.2/lib/icinga/checkable-check.cpp#L563)

Finally, CheckerComponent::ExecuteCheckHelper updates the index (https://github.com/Icinga/icinga2/blob/v2.14.2/lib/checker/checkercomponent.cpp#L266)

ProcessCheckResult() is called too late.

In particular, is it earlier than this?

Imagine, you have a simple Python plugin doing some basic network I/O. I know that Python/C++ meme is a bit silly, but actually, compared to how quick CheckerComponent returns to its loop which gets the next item from this index, your Python plugin (and ProcessCheckResult) takes centuries.

Imagine, you have a simple Python plugin doing some basic network I/O. I know that Python/C++ meme is a bit silly, but actually, compared to how quick CheckerComponent returns to its loop which gets the next item from this index, your Python plugin (and ProcessCheckResult) takes centuries.

I don't get what this is trying to say but it sounds like you're describing a race condition. Are you trying to say it's not a problem because check plugins will be slow enough? But that sounds like the opposite of the comment, the slower the plugin, the later ProcessCheckResult() will be called. On the other hand, you can also get quickly failing checks by specifying a non-existent path for example so that executing it fails immediately. That shouldn't break Icinga 2 either.

No, no race condition. Checkable::ExecuteCheck calls first UpdateNextCheck, then GetCheckCommand()->Execute. The plugin can't fail earlier than UpdateNextCheck is called.

But, what I've written about: Because plugins are in general rather slow, ProcessCheckResult() comes with a latency. But the checker index needs SetNextCheck now(!!), that's why UpdateNextCheck is called. It's that simple IIRC.

Did I get you (Overdue state doesn't honor set time periods #10082 (comment)) right, that this PR is not (part of) the solution for that issue? If yes, let's remove this PR from v2.14.3.

Al2Klimov self-assigned this Jun 20, 2024

cla-bot bot added the cla/signed label Jun 20, 2024

icinga-probot bot added ref/IP ref/NC labels Jun 20, 2024

Al2Klimov added 2 commits September 26, 2024 17:19

Checkable#UpdateNextCheck(): allow to suppress next_check listeners

9d9f1de

Al2Klimov changed the title ~~WIP~~ Checkable#UpdateNextCheck(): allow to suppress next_check listeners Oct 1, 2024

Al2Klimov force-pushed the overdue-state-doesn-t-honor-set-time-periods-10082 branch from 0a14846 to 6533a50 Compare October 1, 2024 08:24

Al2Klimov changed the title ~~Checkable#UpdateNextCheck(): allow to suppress next_check listeners~~ Don't broadcast Checkable#next_check updates made just not to check twice Oct 1, 2024

Al2Klimov marked this pull request as ready for review October 1, 2024 08:49

julianbrost reviewed Oct 1, 2024

View reviewed changes

Al2Klimov mentioned this pull request Oct 2, 2024

Overdue state doesn't honor set time periods #10082

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't broadcast Checkable#next_check updates made just not to check twice #10093

Don't broadcast Checkable#next_check updates made just not to check twice #10093

Al2Klimov commented Jun 20, 2024 •

edited

Loading

Al2Klimov commented Jun 20, 2024

yhabteab commented Jun 21, 2024

Al2Klimov commented Jun 24, 2024

yhabteab commented Sep 26, 2024

Al2Klimov commented Sep 26, 2024

yhabteab commented Sep 27, 2024

Al2Klimov commented Sep 27, 2024

julianbrost commented Oct 1, 2024

julianbrost Oct 1, 2024

Al2Klimov Oct 2, 2024

julianbrost Oct 2, 2024

Al2Klimov Oct 2, 2024

Al2Klimov Oct 23, 2024

Don't broadcast Checkable#next_check updates made just not to check twice #10093

Are you sure you want to change the base?

Don't broadcast Checkable#next_check updates made just not to check twice #10093

Conversation

Al2Klimov commented Jun 20, 2024 • edited Loading

TODO

Al2Klimov commented Jun 20, 2024

yhabteab commented Jun 21, 2024

Al2Klimov commented Jun 24, 2024

yhabteab commented Sep 26, 2024

Al2Klimov commented Sep 26, 2024

yhabteab commented Sep 27, 2024

Al2Klimov commented Sep 27, 2024

julianbrost commented Oct 1, 2024

julianbrost Oct 1, 2024

Choose a reason for hiding this comment

Al2Klimov Oct 2, 2024

Choose a reason for hiding this comment

julianbrost Oct 2, 2024

Choose a reason for hiding this comment

Al2Klimov Oct 2, 2024

Choose a reason for hiding this comment

Al2Klimov Oct 23, 2024

Choose a reason for hiding this comment

Al2Klimov commented Jun 20, 2024 •

edited

Loading