-
Notifications
You must be signed in to change notification settings - Fork 338
DAOS-18355 chk: check leader waits all check engines before exited #17315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Ticket title is 'recovery/check_start_corner_case.py:DMGCheckStartCornerCaseTest.test_start_back_to_back - dmg check start pool_2 failed after 40 sec' |
6f9ab6d to
268abfb
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17315/3/testReport/ |
268abfb to
d7ad13f
Compare
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/3/execution/node/1345/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/3/execution/node/1355/log |
d7ad13f to
af1b78e
Compare
|
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/301/log |
|
Test stage Build on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/317/log |
|
Test stage Build on EL 9.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/309/log |
|
Test stage Build on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/5/execution/node/405/log |
|
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/6/execution/node/217/log |
|
Test stage Build on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/6/execution/node/233/log |
|
Test stage Build on EL 9.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/6/execution/node/225/log |
4953689 to
bc0ea4d
Compare
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17315/8/testReport/ |
bc0ea4d to
59b3671
Compare
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/10/execution/node/1176/log |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/10/execution/node/1156/log |
edc05cf to
a17fc07
Compare
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/12/execution/node/1317/log |
failed for DAOS-18388, not related with the patch, to be retested. |
a17fc07 to
fc52e02
Compare
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/14/execution/node/470/log |
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17315/14/execution/node/429/log |
In old implementation, when the PS leader notifies the check leader that related pool has been checked, the check leader will mark such pool as 'done'. If all required pools have been marked as 'done', then the check leader will exit. But at that time, the check engine on related PS leader may not complete yet. There are something to be processed (such as restart pool server) after the checking the pool. The check engine will notify the check leader via CHK IV when exit. But the check leader does not wait such notification. Under such case, if someone tries to trigger new check instance, it will create new IV namespace. That will cause some check engines and the check leader to use different IV namespace, as to the CHK IV logic cannot recognize the leadership correctly. The patch adjust the leader exit logic: the leader scheduler needs to wait all check engines' notification before exit. Test-tag: recovery Signed-off-by: Fan Yong <[email protected]>
fc52e02 to
0bafed8
Compare
|
All required CI tests passed. |
wangshilong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with all logic, but this looks reasonable for me.
kjacque
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall makes sense to me. Just one comment.
| if (ins->ci_orphan_done && *done > 0 && !d_list_empty(&ins->ci_rank_list)) { | ||
| D_ASSERT(ins->ci_is_leader); | ||
|
|
||
| phase = CHK_INVAL_PHASE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of overloading CHK_INVAL_PHASE, is it possible to add a new phase to represent this state? CHK_LEADER_WAIT_PHASE or something like that. Seems like we would never want to be in an invalid phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is not impossible, but unnecessary. Because CHK_INVAL_PHASE in CR logic means any other phase that
is not regular scan phase. Related CR users (not only for this case) can handle CHK_INVAL_PHASE properly. If we introduce new phase, then means the state machine is extended, as to the logic will be more complex and the patch to be bigger.
So I would suggest to use CHK_INVAL_PHASE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If @gnailzenh thinks it's fine for now, I'm OK with that. I am mostly concerned with ensuring the code is easy to maintain in the future. Is there any other way to get to CHK_INVAL_PHASE? Maybe that phase could remain the same but have a different name, like CHK_WAIT_PHASE or CHK_NOOP_PHASE, if those names would make sense with its other uses. Then it would be clear that it's a valid phase where we expect to do nothing until we're told to move on.
gnailzenh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Kris defining a new status could make it easier to understand, but not a big deal for now.
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17315/17/testReport/ |
In old implementation, when the PS leader notifies the check leader that related pool has been checked, the check leader will mark such pool as 'done'. If all required pools have been marked as 'done', then the check leader will exit. But at that time, the check engine on related PS leader may not complete yet. There are something to be processed (such as restart pool server) after the checking the pool. The check engine will notify the check leader via CHK IV when exit. But the check leader does not wait such notification. Under such case, if someone tries to trigger new check instance, it will create new IV namespace. That will cause some check engines and the check leader to use different IV namespace, as to the CHK IV logic cannot recognize the leadership correctly.
The patch adjust the leader exit logic: the leader scheduler needs to wait all check engines' notification before exit.
Test-tag: recovery
Steps for the author:
After all prior steps are complete: