DAOS-18355 chk: check leader waits all check engines before exited #17315

Nasf-Fan · 2025-12-25T15:46:35Z

In old implementation, when the PS leader notifies the check leader that related pool has been checked, the check leader will mark such pool as 'done'. If all required pools have been marked as 'done', then the check leader will exit. But at that time, the check engine on related PS leader may not complete yet. There are something to be processed (such as restart pool server) after the checking the pool. The check engine will notify the check leader via CHK IV when exit. But the check leader does not wait such notification. Under such case, if someone tries to trigger new check instance, it will create new IV namespace. That will cause some check engines and the check leader to use different IV namespace, as to the CHK IV logic cannot recognize the leadership correctly.

The patch adjust the leader exit logic: the leader scheduler needs to wait all check engines' notification before exit.

Test-tag: recovery

Steps for the author:

Commit message follows the guidelines.
Appropriate Features or Test-tag pragmas were used.
Appropriate Functional Test Stages were run.
At least two positive code reviews including at least one code owner from each category referenced in the PR.
Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

Gatekeeper requested (daos-gatekeeper added as a reviewer).

github-actions · 2025-12-25T15:46:52Z

Ticket title is 'recovery/check_start_corner_case.py:DMGCheckStartCornerCaseTest.test_start_back_to_back - dmg check start pool_2 failed after 40 sec'
Status is 'In Progress'
Labels: 'ci_master_weekly,weekly_test'
https://daosio.atlassian.net/browse/DAOS-18355

daosbuild3 · 2025-12-25T17:47:40Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17315/3/testReport/

daosbuild3 · 2025-12-26T01:18:19Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/3/execution/node/1345/log

daosbuild3 · 2025-12-26T01:18:45Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/3/execution/node/1355/log

daosbuild3 · 2025-12-26T08:04:17Z

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/301/log

daosbuild3 · 2025-12-26T08:04:44Z

Test stage Build on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/317/log

daosbuild3 · 2025-12-26T08:04:46Z

Test stage Build on EL 9.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/309/log

daosbuild3 · 2025-12-26T08:12:26Z

Test stage Build on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/405/log

daosbuild3 · 2025-12-27T05:21:08Z

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/6/execution/node/217/log

daosbuild3 · 2025-12-27T05:21:34Z

Test stage Build on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/6/execution/node/233/log

daosbuild3 · 2025-12-27T05:21:36Z

Test stage Build on EL 9.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/6/execution/node/225/log

daosbuild3 · 2025-12-28T11:18:39Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17315/8/testReport/

daosbuild3 · 2025-12-29T11:52:39Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/10/execution/node/1176/log

daosbuild3 · 2025-12-29T13:56:03Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/10/execution/node/1156/log

daosbuild3 · 2025-12-31T22:41:45Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/12/execution/node/1317/log

Nasf-Fan · 2026-01-01T03:00:29Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/12/execution/node/1317/log

failed for DAOS-18388, not related with the patch, to be retested.

daosbuild3 · 2026-01-04T05:46:12Z

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/14/execution/node/470/log

daosbuild3 · 2026-01-04T06:24:54Z

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/14/execution/node/429/log

In old implementation, when the PS leader notifies the check leader that related pool has been checked, the check leader will mark such pool as 'done'. If all required pools have been marked as 'done', then the check leader will exit. But at that time, the check engine on related PS leader may not complete yet. There are something to be processed (such as restart pool server) after the checking the pool. The check engine will notify the check leader via CHK IV when exit. But the check leader does not wait such notification. Under such case, if someone tries to trigger new check instance, it will create new IV namespace. That will cause some check engines and the check leader to use different IV namespace, as to the CHK IV logic cannot recognize the leadership correctly. The patch adjust the leader exit logic: the leader scheduler needs to wait all check engines' notification before exit. Test-tag: recovery Signed-off-by: Fan Yong <[email protected]>

Nasf-Fan · 2026-01-05T06:24:06Z

All required CI tests passed.

wangshilong

I am not familiar with all logic, but this looks reasonable for me.

kjacque

Overall makes sense to me. Just one comment.

kjacque · 2026-01-05T22:09:13Z

src/chk/chk_internal.h

+	if (ins->ci_orphan_done && *done > 0 && !d_list_empty(&ins->ci_rank_list)) {
+		D_ASSERT(ins->ci_is_leader);
+
+		phase = CHK_INVAL_PHASE;


Instead of overloading CHK_INVAL_PHASE, is it possible to add a new phase to represent this state? CHK_LEADER_WAIT_PHASE or something like that. Seems like we would never want to be in an invalid phase.

That is not impossible, but unnecessary. Because CHK_INVAL_PHASE in CR logic means any other phase that
is not regular scan phase. Related CR users (not only for this case) can handle CHK_INVAL_PHASE properly. If we introduce new phase, then means the state machine is extended, as to the logic will be more complex and the patch to be bigger.
So I would suggest to use CHK_INVAL_PHASE.

If @gnailzenh thinks it's fine for now, I'm OK with that. I am mostly concerned with ensuring the code is easy to maintain in the future. Is there any other way to get to CHK_INVAL_PHASE? Maybe that phase could remain the same but have a different name, like CHK_WAIT_PHASE or CHK_NOOP_PHASE, if those names would make sense with its other uses. Then it would be clear that it's a valid phase where we expect to do nothing until we're told to move on.

gnailzenh

I agree with Kris defining a new status could make it easier to understand, but not a big deal for now.

daosbuild3 · 2026-01-07T18:34:14Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17315/17/testReport/

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch 2 times, most recently from 6f9ab6d to 268abfb Compare December 25, 2025 16:01

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from 268abfb to d7ad13f Compare December 26, 2025 01:15

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from d7ad13f to af1b78e Compare December 26, 2025 07:59

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch 2 times, most recently from 4953689 to bc0ea4d Compare December 28, 2025 09:36

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from bc0ea4d to 59b3671 Compare December 28, 2025 14:14

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch 2 times, most recently from edc05cf to a17fc07 Compare December 31, 2025 07:37

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from a17fc07 to fc52e02 Compare January 1, 2026 03:13

Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from fc52e02 to 0bafed8 Compare January 4, 2026 09:47

Nasf-Fan marked this pull request as ready for review January 5, 2026 06:20

Nasf-Fan requested review from gnailzenh, kjacque and wangshilong January 5, 2026 06:24

wangshilong approved these changes Jan 5, 2026

View reviewed changes

kjacque reviewed Jan 5, 2026

View reviewed changes

Nasf-Fan requested a review from kjacque January 6, 2026 03:25

gnailzenh approved these changes Jan 6, 2026

View reviewed changes

kjacque approved these changes Jan 7, 2026

View reviewed changes

Nasf-Fan requested a review from a team January 7, 2026 03:30

daltonbohning added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 7, 2026

daltonbohning merged commit cd97fbb into master Jan 7, 2026
40 checks passed

daltonbohning deleted the Nasf-Fan/DAOS-18355 branch January 7, 2026 17:07

DAOS-18355 chk: check leader waits all check engines before exited #17315

DAOS-18355 chk: check leader waits all check engines before exited #17315

Conversation

Nasf-Fan commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Steps for the author:

After all prior steps are complete:

Uh oh!

github-actions bot commented Dec 25, 2025

Uh oh!

daosbuild3 commented Dec 25, 2025

Uh oh!

daosbuild3 commented Dec 26, 2025

Uh oh!

daosbuild3 commented Dec 26, 2025

Uh oh!

daosbuild3 commented Dec 26, 2025

Uh oh!

daosbuild3 commented Dec 26, 2025

Uh oh!

daosbuild3 commented Dec 26, 2025

Uh oh!

daosbuild3 commented Dec 26, 2025

Uh oh!

daosbuild3 commented Dec 27, 2025

Uh oh!

daosbuild3 commented Dec 27, 2025

Uh oh!

daosbuild3 commented Dec 27, 2025

Uh oh!

daosbuild3 commented Dec 28, 2025

Uh oh!

daosbuild3 commented Dec 29, 2025

Uh oh!

daosbuild3 commented Dec 29, 2025

Uh oh!

daosbuild3 commented Dec 31, 2025

Uh oh!

Nasf-Fan commented Jan 1, 2026

Uh oh!

daosbuild3 commented Jan 4, 2026

Uh oh!

daosbuild3 commented Jan 4, 2026

Uh oh!

Nasf-Fan commented Jan 5, 2026

Uh oh!

wangshilong left a comment

Choose a reason for hiding this comment

Uh oh!

kjacque left a comment

Choose a reason for hiding this comment

Uh oh!

kjacque Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Nasf-Fan Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

kjacque Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gnailzenh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

daosbuild3 commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

7 participants

Nasf-Fan commented Dec 25, 2025 •

edited

Loading

kjacque Jan 7, 2026 •

edited

Loading