Add heartbeat to catalog background tasks #9868

itaigilo · 2025-12-22T12:01:16Z

Closes #9872.

Change Description

In order to support long lasting async tasks and monitor them, adding a heartbeat goroutine that updates the updated_at of a task while it's still running.

Testing Details

Tested locally (by manually increasing the commit time and validating that the heartbeat indeed updates the updated_at and stops when the task stops).

Added unit tests.

In addition, once the feature's Esti tests will be added, it will also validate this don't cause regression.

N-o-Z · 2025-12-22T17:36:58Z

I don't think this is the way we should go with this. There's no need to reinvent the wheel.
We already have a solid logic with Import.

Server should not control task timeout - only client.
We should implement a cancel operation.
Client should call cancel when it decides to abort the operation (whether due to timeout or otherwise)

itaigilo · 2025-12-22T18:06:33Z

I don't think this is the way we should go with this. There's no need to reinvent the wheel. We already have a solid logic with Import.

Server should not control task timeout - only client.

We should implement a cancel operation.

Client should call cancel when it decides to abort the operation (whether due to timeout or otherwise)

@N-o-Z the requirements (and pain points) are described in the Enterprise PR - according to these, I don't think there's a current requirement for client aborting the operation, hence it should be controlled by the server.

Having said that, can you please point out the part of the import code that implements something similar?

(and maybe @nopcoder has something to contribute to this discussion.)

N-o-Z · 2025-12-22T18:11:03Z

We already had this discussion and the PRD needs to be updated.
You can look at the Import endpoints in swagger

itaigilo · 2025-12-22T18:45:01Z

We already had this discussion and the PRD needs to be updated. You can look at the Import endpoints in swagger

@N-o-Z Who took part in the (apparently undocumented) discussion you've mentioned, and can shed light on the details?

N-o-Z · 2025-12-22T18:46:36Z

We already had this discussion and the PRD needs to be updated. You can look at the Import endpoints in swagger

@N-o-Z Who took part in the (apparently undocumented) discussion you've mentioned, and can shed light on the details?

@nopcoder @Annaseli and myself took part of the undocumented discussion. You can discuss this with @nopcoder as the product owner

nopcoder · 2025-12-22T20:30:40Z

Server should not control task timeout - only client.

We should implement a cancel operation.

Client should call cancel when it decides to abort the operation (whether due to timeout or otherwise)

The heart beat mechanism is in place because at this point we do not provide a way to report progress or cancellation as part of commit/merge.

The async client have hard limit configured, but if we can reduce the timeout the client is configured because the server to report status update is it better than the current state.
I agree that when each async operation will work like import, this mechanism can be removed.

N-o-Z · 2025-12-22T21:01:21Z

Server should not control task timeout - only client.

We should implement a cancel operation.

Client should call cancel when it decides to abort the operation (whether due to timeout or otherwise)

The heart beat mechanism is in place because at this point we do not provide a way to report progress or cancellation as part of commit/merge.

The async client have hard limit configured, but if we can reduce the timeout the client is configured because the server to report status update is it better than the current state. I agree that when each async operation will work like import, this mechanism can be removed.

So why replace one temporary mechanism with another? Implementing the cancel mechanism is relatively quick and will take the same effort as introducing a new temporary solution

nopcoder · 2025-12-23T09:00:07Z

Server should not control task timeout - only client.

We should implement a cancel operation.

Client should call cancel when it decides to abort the operation (whether due to timeout or otherwise)

The heart beat mechanism is in place because at this point we do not provide a way to report progress or cancellation as part of commit/merge.
The async client have hard limit configured, but if we can reduce the timeout the client is configured because the server to report status update is it better than the current state. I agree that when each async operation will work like import, this mechanism can be removed.

So why replace one temporary mechanism with another? Implementing the cancel mechanism is relatively quick and will take the same effort as introducing a new temporary solution

I don't think that progress and cancellation in the context of change of commit and merge is a small task.

pkg/catalog/catalog.go

Annaseli · 2025-12-25T14:51:34Z

pkg/catalog/catalog.go

+	// start heartbeat: a background goroutine to update the task status in the kv store every 5 seconds, until the task is done
+	cancelCtx, cancel := context.WithCancel(ctx)
+	currTaskStatus := proto.Clone(taskStatus) // deep copy of the task status, to avoid race conditions
+	go func() {


Is this go routine runs in parallel to the go routines that the c.workPool.Submit starts? if so, maybe instead of writing this as a separate go routine we can send it as an additional task to the steps that c.workPool.Submit gets?

It's here because it needs the same context (it uses cancelCtx), so as far as I can tell, it can't be moved from here.

Anyway, this seems like a known go pattern.

I didn’t mean to move it elsewhere, but rather to include it as one of the steps that c.workPool.Submit runs. I was thinking that instead of launching this work in a separate go func(), we could add it as an additional task in the steps list.

The idea would be to extract the logic currently inside that go func() into a helper function, and then add it to steps before calling c.workPool.Submit, which already iterates over all the steps.

If you think it won't be a good ideat to do this, I think at least we should move the go func() { ... } into a helper function, because RunBackgroundTaskSteps has become quite large.

pkg/catalog/catalog.go

pkg/catalog/task_test.go

Annaseli · 2025-12-25T17:37:09Z

pkg/catalog/task_test.go

+	_, err = GetTaskStatus(ctx, kvStore, repository, taskID, &status)
+	require.NoError(t, err)
+	require.True(t, status.Task.Done)
+	require.NotEmpty(t, status.Task.ErrorMsg)


if we use the actual ErrorToStatusCodeAndMsg: api.ErrorToStatusAndMsg, in the catalog initialization, we can check that the status_code updates here correctly as well.

I think it will be good to do a similar test but with a longer task that eventually returns an error to check how the heartbeat mechanism deals with it - that it stoped updating after that.

pkg/catalog/catalog.go

itaigilo

Thanks @Annaseli for your review,
Comments were addressed / fixed,
PTAL again.

itaigilo · 2025-12-29T07:34:11Z

pkg/catalog/catalog.go

 // RunBackgroundTaskSteps update task status provided after filling the 'Task' field and update for each step provided.
 // the task status is updated after each step, and the task is marked as completed if the step is the last one.
 // initial update if the task is done before running the steps.
 func (c *Catalog) RunBackgroundTaskSteps(repository *graveler.RepositoryRecord, taskID string, steps []TaskStep, taskStatus protoreflect.ProtoMessage) error {


This is true,
But I prefer to keep this PR contained.
I can create an Issue and handle this right afterwards.

pkg/catalog/catalog.go

itaigilo · 2025-12-29T10:12:29Z

pkg/catalog/catalog.go

+	// start heartbeat: a background goroutine to update the task status in the kv store every 5 seconds, until the task is done
+	cancelCtx, cancel := context.WithCancel(ctx)
+	currTaskStatus := proto.Clone(taskStatus) // deep copy of the task status, to avoid race conditions
+	go func() {


It's here because it needs the same context (it uses cancelCtx), so as far as I can tell, it can't be moved from here.

Anyway, this seems like a known go pattern.

pkg/catalog/catalog.go

pkg/catalog/task_test.go

Annaseli

Thanks! Have non blocking comments.

Annaseli · 2026-01-04T15:04:09Z

pkg/catalog/task_test.go

+
+	// Verify timestamp was updated during execution
+	require.True(t, completionTime.After(timestampDuringExecution) || completionTime.Equal(timestampDuringExecution),
+		"completion timestamp (%v) should be after or equal to timestamp during execution (%v)",


if this one is the case: || completionTime.Equal(timestampDuringExecution), we didn't actually checked that timestamp was updated during execution right? because in this case the completionTime just equals to the timestampDuringExecution that equals to the first Task.UpdatedAt since TaskHeartbeatInterval is 5 seconds.

Add heartbeat to catalog background tasks

ee5396e

itaigilo added include-changelog PR description should be included in next release changelog minor-change Used for PRs that don't require issue attached labels Dec 22, 2025

github-actions bot added the area/cataloger Improvements or additions to the cataloger label Dec 22, 2025

Add const

ca90484

itaigilo removed the minor-change Used for PRs that don't require issue attached label Dec 22, 2025

itaigilo added 2 commits December 22, 2025 16:07

Fix written value

2be1dc8

Add unit tests

323f97b

github-actions bot added the area/testing Improvements or additions to tests label Dec 22, 2025

itaigilo marked this pull request as ready for review December 22, 2025 17:16

itaigilo requested review from a team, Annaseli and nopcoder December 22, 2025 17:16

Annaseli requested changes Dec 25, 2025

View reviewed changes

itaigilo added 3 commits December 29, 2025 23:16

Update comment

417023b

Merge branch 'master' into feature/add-heartbeat-for-catalog-async-tasks

b9b20a9

Update tests

67a8db1

itaigilo commented Dec 29, 2025

View reviewed changes

Annaseli approved these changes Jan 4, 2026

View reviewed changes

itaigilo added 2 commits January 4, 2026 20:25

Merge master

f8bda7b

Fix tests

e7c6327

itaigilo merged commit fc134f6 into master Jan 5, 2026
42 checks passed

itaigilo deleted the feature/add-heartbeat-for-catalog-async-tasks branch January 5, 2026 10:35

Add heartbeat to catalog background tasks #9868

Add heartbeat to catalog background tasks #9868

Conversation

itaigilo commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

Testing Details

Uh oh!

N-o-Z commented Dec 22, 2025

Uh oh!

itaigilo commented Dec 22, 2025

Uh oh!

N-o-Z commented Dec 22, 2025

Uh oh!

itaigilo commented Dec 22, 2025

Uh oh!

N-o-Z commented Dec 22, 2025

Uh oh!

nopcoder commented Dec 22, 2025

Uh oh!

N-o-Z commented Dec 22, 2025

Uh oh!

nopcoder commented Dec 23, 2025

Uh oh!

Uh oh!

Annaseli Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

itaigilo Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Annaseli Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Annaseli Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

itaigilo left a comment

Choose a reason for hiding this comment

Uh oh!

itaigilo Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

itaigilo Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Annaseli left a comment

Choose a reason for hiding this comment

Uh oh!

Annaseli Jan 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

itaigilo commented Dec 22, 2025 •

edited

Loading