Skip to content

Conversation

@Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Dec 26, 2025

Currently, if a transaction failed for some reason, the cleanup logic
will try to evict related vos object from cache to avoid leaving stable
information in cache. Such logic works well for the system with PMEM.
But under md-on-ssd mode, it may cause trouble if the cleanup logic
evicts some object that is not created by current failed transaction.
Because one vos modification may hold the same object multiple times,
and there is CPU yield during these object hold actions. That creates
race windows for other concurrent operations against the same object.

This patch changes the logic: when the transaction creates some new
object(s), it will record related oid(s), if such transaction failed
in subsequent process, it will only evict these new created object(s).
For those new created dkey or lower component under existing objects,
related object cache will not be affected during transaction cleanup.

On the other hand, under md-on-ssd mode, CPU may yield during backend
TX start, the object that is held by current modification maybe marked
as evicted in such race windows. So add logic to check whether related
object is evicted or not after backend TX started, if yes, then restart
current transaction.

Signed-off-by: Fan Yong [email protected]

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

Ticket title is 'Enhance dtx_act_ent_cleanup() to only evict self-created object when transaction failure'
Status is 'In Progress'
Labels: 'scrubbed_2.8'
https://daosio.atlassian.net/browse/DAOS-18367

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18367 branch 5 times, most recently from 18f7850 to 9536e36 Compare December 27, 2025 05:16
@daosbuild3
Copy link
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18367 branch from 9536e36 to 065d404 Compare December 28, 2025 09:42
@daosbuild3
Copy link
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18367 branch from 065d404 to 7f5575c Compare December 28, 2025 14:32
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Unit Test bdev on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17320/8/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17320/8/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17320/8/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17320/8/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17320/8/execution/node/1362/log

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18367 branch from 7f5575c to a7f09c4 Compare December 29, 2025 03:08
@daosbuild3
Copy link
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18367 branch from a7f09c4 to 657d99b Compare December 29, 2025 05:13
@daosbuild3
Copy link
Collaborator

Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17320/10/testReport/

Currently, if a transaction failed for some reason, the cleanup logic
will try to evict related vos object from cache to avoid leaving stable
information in cache. Such logic works well for the system with PMEM.
But under md-on-ssd mode, it may cause trouble if the cleanup logic
evicts some object that is not created by current failed transaction.
Because one vos modification may hold the same object multiple times,
and there is CPU yield during these object hold actions. That creates
race windows for other concurrent operations against the same object.

This patch changes the logic: when the transaction creates some new
object(s), it will record related oid(s), if such transaction failed
in subsequent process, it will only evict these new created object(s).
For those new created dkey or lower component under existing objects,
related object cache will not be affected during transaction cleanup.

On the other hand, under md-on-ssd mode, CPU may yield during backend
TX start, the object that is held by current modification maybe marked
as evicted in such race windows. So add logic to check whether related
object is evicted or not after backend TX started, if yes, then restart
current transaction.

Signed-off-by: Fan Yong <[email protected]>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-18367 branch from 657d99b to 248715a Compare December 29, 2025 10:16
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17320/11/testReport/

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17320/11/execution/node/1364/log

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17320/11/execution/node/1405/log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants