Skip to content

Conversation

@Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Dec 25, 2025

In old implementation, when the PS leader notifies the check leader that related pool has been checked, the check leader will mark such pool as 'done'. If all required pools have been marked as 'done', then the check leader will exit. But at that time, the check engine on related PS leader may not complete yet. There are something to be processed (such as restart pool server) after the checking the pool. The check engine will notify the check leader via CHK IV when exit. But the check leader does not wait such notification. Under such case, if someone tries to trigger new check instance, it will create new IV namespace. That will cause some check engines and the check leader to use different IV namespace, as to the CHK IV logic cannot recognize the leadership correctly.

The patch adjust the leader exit logic: the leader scheduler needs to wait all check engines' notification before exit.

Test-tag: recovery

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

Ticket title is 'recovery/check_start_corner_case.py:DMGCheckStartCornerCaseTest.test_start_back_to_back - dmg check start pool_2 failed after 40 sec'
Status is 'In Progress'
Labels: 'ci_master_weekly,weekly_test'
https://daosio.atlassian.net/browse/DAOS-18355

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch 2 times, most recently from 6f9ab6d to 268abfb Compare December 25, 2025 16:01
@daosbuild3
Copy link
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from 268abfb to d7ad13f Compare December 26, 2025 01:15
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/3/execution/node/1345/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/3/execution/node/1355/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from d7ad13f to af1b78e Compare December 26, 2025 07:59
@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/301/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/6/execution/node/217/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch 2 times, most recently from 4953689 to bc0ea4d Compare December 28, 2025 09:36
@daosbuild3
Copy link
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from bc0ea4d to 59b3671 Compare December 28, 2025 14:14
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/10/execution/node/1176/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/10/execution/node/1156/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch 2 times, most recently from edc05cf to a17fc07 Compare December 31, 2025 07:37
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/12/execution/node/1317/log

@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Jan 1, 2026

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/12/execution/node/1317/log

failed for DAOS-18388, not related with the patch, to be retested.

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from a17fc07 to fc52e02 Compare January 1, 2026 03:13
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/14/execution/node/470/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/14/execution/node/429/log

In old implementation, when the PS leader notifies the check leader
that related pool has been checked, the check leader will mark such
pool as 'done'. If all required pools have been marked as 'done',
then the check leader will exit. But at that time, the check engine
on related PS leader may not complete yet. There are something to
be processed (such as restart pool server) after the checking the
pool. The check engine will notify the check leader via CHK IV when
exit. But the check leader does not wait such notification. Under
such case, if someone tries to trigger new check instance, it will
create new IV namespace. That will cause some check engines and
the check leader to use different IV namespace, as to the CHK IV
logic cannot recognize the leadership correctly.

The patch adjust the leader exit logic: the leader scheduler needs
to wait all check engines' notification before exit.

Test-tag: recovery

Signed-off-by: Fan Yong <[email protected]>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18355 branch from fc52e02 to 0bafed8 Compare January 4, 2026 09:47
@Nasf-Fan Nasf-Fan marked this pull request as ready for review January 5, 2026 06:20
@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Jan 5, 2026

All required CI tests passed.

Copy link
Contributor

@wangshilong wangshilong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with all logic, but this looks reasonable for me.

Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall makes sense to me. Just one comment.

if (ins->ci_orphan_done && *done > 0 && !d_list_empty(&ins->ci_rank_list)) {
D_ASSERT(ins->ci_is_leader);

phase = CHK_INVAL_PHASE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of overloading CHK_INVAL_PHASE, is it possible to add a new phase to represent this state? CHK_LEADER_WAIT_PHASE or something like that. Seems like we would never want to be in an invalid phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is not impossible, but unnecessary. Because CHK_INVAL_PHASE in CR logic means any other phase that
is not regular scan phase. Related CR users (not only for this case) can handle CHK_INVAL_PHASE properly. If we introduce new phase, then means the state machine is extended, as to the logic will be more complex and the patch to be bigger.
So I would suggest to use CHK_INVAL_PHASE.

Copy link
Contributor

@kjacque kjacque Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If @gnailzenh thinks it's fine for now, I'm OK with that. I am mostly concerned with ensuring the code is easy to maintain in the future. Is there any other way to get to CHK_INVAL_PHASE? Maybe that phase could remain the same but have a different name, like CHK_WAIT_PHASE or CHK_NOOP_PHASE, if those names would make sense with its other uses. Then it would be clear that it's a valid phase where we expect to do nothing until we're told to move on.

@Nasf-Fan Nasf-Fan requested a review from kjacque January 6, 2026 03:25
Copy link
Contributor

@gnailzenh gnailzenh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Kris defining a new status could make it easier to understand, but not a big deal for now.

@Nasf-Fan Nasf-Fan requested a review from a team January 7, 2026 03:30
@daltonbohning daltonbohning added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 7, 2026
@daltonbohning daltonbohning merged commit cd97fbb into master Jan 7, 2026
40 checks passed
@daltonbohning daltonbohning deleted the Nasf-Fan/DAOS-18355 branch January 7, 2026 17:07
@daosbuild3
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.

Development

Successfully merging this pull request may close these issues.

7 participants