Skip to content

Automatically taint servers when receiving critical bmc events#692

Open
stefanhipfel wants to merge 50 commits intomainfrom
events_taints
Open

Automatically taint servers when receiving critical bmc events#692
stefanhipfel wants to merge 50 commits intomainfrom
events_taints

Conversation

@stefanhipfel
Copy link
Copy Markdown
Contributor

@stefanhipfel stefanhipfel commented Feb 19, 2026

Proposed Changes

Automatically taint servers when receiving critical bmc events

Summary by CodeRabbit

  • New Features

    • Event subscription management for BMCs (create/delete subscriptions for metrics and alerts).
    • Built-in event server with HTTP endpoints to receive metrics reports and alerts.
    • Prometheus-compatible collector exposing sensor readings and alert counts.
    • Critical-event handling to mark affected servers.
    • CLI/config options to enable and configure the event server.
  • Documentation

    • API reference and CRD updated with new subscription link fields in BMC status/spec.
  • Tests

    • Tests added/updated for subscription lifecycle, metrics collection, and HTTP handlers.

… documentation for SubscribeMetricsReport and SubscribeEvents functions.
@stefanhipfel stefanhipfel requested a review from a team as a code owner February 19, 2026 12:05
@github-actions github-actions Bot added size/XL api-change documentation Improvements or additions to documentation labels Feb 19, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds end-to-end Redfish event subscription support: CRD/status fields, BMC interface methods and Redfish client subscription create/delete, mock server collection handling, an HTTP event server with Prometheus-backed collector and critical-event tainting, and controller wiring to manage subscriptions and status.

Changes

Cohort / File(s) Summary
API types & CRD
api/v1alpha1/bmc_types.go, config/crd/bases/metal.ironcore.dev_bmcs.yaml, docs/api-reference/api.md
Added metricsReportSubscriptionLink and eventsSubscriptionLink fields to BMC spec/status and updated API docs/CRD schema.
BMC interface & Redfish client
bmc/bmc.go, bmc/redfish.go
Added CreateEventSubscription and DeleteEventSubscription to BMC interface; implemented subscription payloads, POST/delete flows, Location header normalization, vendor-specific payload handling, and error handling.
Mock server
bmc/mock/server/server.go, bmc/mock/server/data/Managers/BMC/index.json
Introduced Collection/Member types and POST/DELETE collection-member handling; added Manufacturer field to mock Manager index JSON.
Controller integration & tests
internal/controller/bmc_controller.go, internal/controller/bmc_controller_test.go, internal/controller/suite_test.go
Wired EventURL into BMCReconciler; added subscribe/unsubscribe flows that patch status with subscription links and tests asserting new status fields and cleanup on deletion.
Event server package
internal/serverevents/server.go, internal/serverevents/metrics.go, internal/serverevents/subscription.go, internal/serverevents/taint_handler.go
New HTTP server for /serverevents endpoints, RedfishEventCollector with Prometheus integration, subscription helper functions, and a CriticalEventHandler that marks Servers on critical events.
CLI / main / tests
cmd/main.go, test/serverevents/main.go, internal/controller/server_controller.go
Added CLI flags and manager runnable to run the event server, passed EventURL into reconciler setup, and included a test binary to run the event server.
Tests & tooling
internal/serverevents/metrics_test.go, internal/serverevents/server_test.go, internal/controller/bmc_controller_test.go
Added unit tests for collector cleanup and event updates, handler path validation tests, and controller tests validating subscription lifecycle.
Deps & generated
go.mod, api/v1alpha1/zz_generated.deepcopy.go
Added Prometheus client_model dependency and minor deepcopy import style change.

Sequence Diagram(s)

sequenceDiagram
    participant Controller as BMC Controller
    participant BMC as Redfish BMC
    participant EventSvc as Redfish Event Service
    participant EventServer as Event Server
    participant Collector as RedfishEventCollector

    Controller->>BMC: CreateEventSubscription(destination)
    BMC->>EventSvc: POST /redfish/v1/EventService/Subscriptions
    EventSvc-->>BMC: 201 Created + Location: /redfish/v1/EventService/Subscriptions/5
    BMC-->>Controller: "/redfish/v1/EventService/Subscriptions/5"
    Controller->>Controller: Patch BMC.Status with link

    EventSvc->>EventServer: POST /serverevents/alerts/{host} (alert)
    EventServer->>Collector: UpdateFromEvent(host, event)
    Collector->>Collector: increment alert count / invoke CriticalEventHandler
    EventServer-->>EventSvc: 200 OK
Loading
sequenceDiagram
    participant RedfishBMC as Redfish BMC
    participant EventServer as Event Server
    participant Collector as RedfishEventCollector
    participant Prometheus as Prometheus

    RedfishBMC->>EventServer: POST /serverevents/metricsreport/{host} {MetricsValues}
    EventServer->>Collector: UpdateFromMetricsReport(host, report)
    Collector->>Collector: parse/store latest readings
    Prometheus->>Collector: scrape /metrics
    Collector-->>Prometheus: sensor + alert metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • nagadeesh-nagaraja
  • afritzler
  • davidgrun
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The description is incomplete. While it mentions the main feature, it lacks structured details about changes, rationale, and issue references required by the template. Expand the description to include details about what changes were made, why they were necessary, and link the related issue number (Fixes #...).
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main purpose of the changeset: automatically tainting servers when critical BMC events are received.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch events_taints
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use Trivy to scan for security misconfigurations and secrets in Infrastructure as Code files.

Add a .trivyignore file to your project to customize which findings Trivy reports.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Nitpick comments (8)
bmc/mock/server/server.go (1)

195-201: Redundant make call and inconsistent Location/OdataID format between branches.

  • Line 195 base.Members = make([]Member, 0) is dead code — it is unconditionally overwritten on line 197.
  • r.URL.JoinPath("1").String() is called twice; location should be reused in the struct literal.
  • The if branch (line 183) uses path.Join(r.URL.Path, newID) — a path-only string — for the OdataID and Location header. The else branch uses r.URL.JoinPath("1").String() — a full absolute URL. These formats are inconsistent and any client code that parses the Location header or @odata.id values uniformly may misbehave for one case.
♻️ Proposed cleanup (align to path-only format, matching the `if` branch)
 	} else {
-		base.Members = make([]Member, 0)
-		location := r.URL.JoinPath("1").String()
+		location := path.Join(r.URL.Path, "1")
 		base.Members = []Member{
 			{
-				OdataID: r.URL.JoinPath("1").String(),
+				OdataID: location,
 			},
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bmc/mock/server/server.go` around lines 195 - 201, Remove the dead allocation
of base.Members and stop calling r.URL.JoinPath("1").String() twice by assigning
a single location string and reusing it when building base.Members (update the
struct literal to use the location variable for OdataID). Also make the two
branches consistent: use the same path-only format as the if branch (e.g., build
location with path.Join(r.URL.Path, newID or "1") rather than r.URL.JoinPath),
and ensure the Location response header and the Member's OdataID both use that
same path-only location string.
internal/controller/bmc_controller_test.go (1)

64-65: Subscription link assertions not applied uniformly across test scenarios.

The label-patch test (lines 220–227) and the DNSRecord test (lines 310–330) do not assert MetricsReportSubscriptionLink/EventsSubscriptionLink, even though they use the same EventURL configuration and mock server. If subscription creation is part of every successful reconciliation, these two tests under-assert the new behaviour. Consider adding the subscription link checks there as well, or adding a comment explaining why they're intentionally omitted.

Additionally, there is no test coverage for subscription cleanup (i.e., DeleteEventSubscription being called when a BMC is deleted or subscriptions are torn down).

Also applies to: 148-149

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmc_controller_test.go` around lines 64 - 65, The
label-patch and DNSRecord test cases in
internal/controller/bmc_controller_test.go are missing assertions for
Status.MetricsReportSubscriptionLink and Status.EventsSubscriptionLink even
though they use the same EventURL and mock event server; update those test cases
(the label-patch test and the DNSRecord test) to assert
HaveField("Status.MetricsReportSubscriptionLink", Equal(...)) and
HaveField("Status.EventsSubscriptionLink", Equal(...)) with the same expected
subscription paths used elsewhere, or add a brief comment explaining why they're
intentionally omitted; additionally, add a new unit test that deletes a BMC (or
simulates controller teardown) and verifies DeleteEventSubscription is called on
the mock event client to cover subscription cleanup behavior (use the same mock
client helpers and the controller reconcile/delete helper functions to locate
where DeleteEventSubscription is invoked).
test/serverevents/main.go (1)

34-34: Misleading error log message.

"problem running telemetry server" should reference the event server, not a telemetry server, to avoid confusing log analysis. This matches no other name in the codebase.

🔧 Proposed fix
-		setupLog.Error(err, "problem running telemetry server")
+		setupLog.Error(err, "Problem running server event server")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/serverevents/main.go` at line 34, The log call using setupLog.Error
currently says "problem running telemetry server" which is misleading; update
the error message in the setupLog.Error invocation (the call that currently
reads setupLog.Error(err, "problem running telemetry server") in main.go) to
reference the event server (e.g., "problem running event server" or "problem
running events server") so logs accurately reflect the failing component.
internal/controller/bmc_controller.go (2)

113-119: Consider logging when BMC client is unavailable during deletion.

When GetBMCClientFromBMC fails, event subscriptions are silently skipped. A log line would help operators understand that orphaned subscriptions may remain on the BMC.

Suggested improvement
 	bmcClient, err := bmcutils.GetBMCClientFromBMC(ctx, r.Client, bmcObj, r.Insecure, r.BMCOptions)
 	if err == nil {
 		defer bmcClient.Logout()
 		if err := r.deleteEventSubscription(ctx, bmcClient, bmcObj); err != nil {
 			return ctrl.Result{}, fmt.Errorf("failed to delete event subscriptions: %w", err)
 		}
+	} else {
+		log.V(1).Info("Could not acquire BMC client during deletion, skipping event subscription cleanup", "error", err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmc_controller.go` around lines 113 - 119, When
GetBMCClientFromBMC(ctx, r.Client, bmcObj, r.Insecure, r.BMCOptions) returns an
error, add a log statement before skipping deletion so operators see that event
subscriptions were not removed; specifically, log the BMC identifier (from
bmcObj) and the error, then continue to skip calling deleteEventSubscription;
keep the existing defer bmcClient.Logout() behavior when bmcClient is present.

549-582: Consider combining both subscription status patches into a single patch.

Currently, two separate status patches are issued (lines 565 and 577). If both subscriptions need creation, this doubles the API server round-trips. You could batch them into one patch when both links change.

Suggested refactor
 func (r *BMCReconciler) handleEventSubscriptions(ctx context.Context, bmcClient bmc.BMC, bmcObj *metalv1alpha1.BMC) (bool, error) {
 	log := ctrl.LoggerFrom(ctx)
 	if r.EventURL == "" {
 		return false, nil
 	}
 	log.V(1).Info("Handling event subscriptions for BMC")
 	modified := false
+	bmcBase := bmcObj.DeepCopy()
 
 	if bmcObj.Status.MetricsReportSubscriptionLink == "" {
 		link, err := serverevents.SubscribeMetricsReport(ctx, r.EventURL, bmcObj.Name, bmcClient)
 		if err != nil {
 			return false, fmt.Errorf("failed to subscribe to server metrics report: %w", err)
 		}
-		bmcBase := bmcObj.DeepCopy()
 		bmcObj.Status.MetricsReportSubscriptionLink = link
 		modified = true
-		if err := r.Status().Patch(ctx, bmcObj, client.MergeFrom(bmcBase)); err != nil {
-			return false, fmt.Errorf("failed to patch server status with subscription links: %w", err)
-		}
 	}
 	if bmcObj.Status.EventsSubscriptionLink == "" {
 		link, err := serverevents.SubscribeEvents(ctx, r.EventURL, bmcObj.Name, bmcClient)
 		if err != nil {
 			return false, fmt.Errorf("failed to subscribe to server alerts: %w", err)
 		}
-		bmcBase := bmcObj.DeepCopy()
 		bmcObj.Status.EventsSubscriptionLink = link
 		modified = true
-		if err := r.Status().Patch(ctx, bmcObj, client.MergeFrom(bmcBase)); err != nil {
-			return false, fmt.Errorf("failed to patch server status with subscription links: %w", err)
-		}
+	}
+	if modified {
+		if err := r.Status().Patch(ctx, bmcObj, client.MergeFrom(bmcBase)); err != nil {
+			return false, fmt.Errorf("failed to patch server status with subscription links: %w", err)
+		}
 	}
 	return modified, nil
 }

Note that this trades atomicity for efficiency — if the first subscription succeeds but the second fails, neither link is persisted (the error returns before the patch). On retry, the first subscription would be re-created, causing an orphan. The current approach has the opposite tradeoff: the first link is persisted even if the second fails. Choose based on which orphan scenario is more acceptable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmc_controller.go` around lines 549 - 582, The handler
handleEventSubscriptions currently patches status twice; change it to collect
both subscription links first and then issue a single status patch: call
serverevents.SubscribeMetricsReport and serverevents.SubscribeEvents as needed
(checking bmcObj.Status.MetricsReportSubscriptionLink and
EventsSubscriptionLink), set bmcObj.Status.MetricsReportSubscriptionLink and
bmcObj.Status.EventsSubscriptionLink on a single DeepCopy of bmcObj, set
modified if any link changed, and then call r.Status().Patch(ctx, bmcObj,
client.MergeFrom(bmcBase)) once; keep existing error returns from
SubscribeMetricsReport/SubscribeEvents and only patch when at least one link was
updated.
internal/serverevents/server.go (1)

82-97: Alert handler doesn't validate hostname extraction.

path.Base(r.URL.Path) on /serverevents/alerts/ (without hostname) returns "alerts", which would be treated as a hostname. Consider validating that the extracted hostname is meaningful.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/serverevents/server.go` around lines 82 - 97, The alertHandler
currently uses hostname := path.Base(r.URL.Path) which will return "alerts" when
the request URL is /serverevents/alerts/ and treat that as a hostname; update
alertHandler to validate the extracted hostname before calling
s.collector.UpdateFromEvent: parse or split r.URL.Path (or use path.Clean +
strings.Split) to ensure the path has at least three segments and the last
segment is non-empty and not equal to "alerts", return http.StatusBadRequest
(and log) if invalid, and only call s.collector.UpdateFromEvent(hostname,
eventData) when hostname passes validation.
internal/serverevents/metrics.go (1)

173-198: Unbounded growth of lastReadings and alertCounts maps — potential memory leak.

Stale entries in lastReadings (older than 10 min) are skipped during Collect but never evicted. Similarly, alertCounts grows indefinitely with each unique EventKey. For a long-running instance receiving diverse events, both maps grow without bound.

Consider periodic cleanup of stale lastReadings entries and, for alertCounts, either periodic resets or capping cardinality.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/serverevents/metrics.go` around lines 173 - 198, The Collect method
currently skips stale entries but never removes them, causing unbounded growth
in c.lastReadings and c.alertCounts; inside the Collect method (the loop that
iterates c.lastReadings) remove entries older than 10 minutes (delete from
c.lastReadings) while holding the collector's mutex (e.g., c.mu) to avoid races,
and for c.alertCounts implement a bounded policy: either periodically reset the
map on a ticker started when the collector is created (e.g., start a cleanup
goroutine in the constructor/initializer that clears or trims c.alertCounts
every N minutes) or attach TTL metadata to each EventKey and prune expired
alertCounts entries during the same cleanup—ensure any deletion/trimming uses
the same synchronization as reads/writes to avoid race conditions.
internal/serverevents/taint_handler.go (1)

59-63: Missing re-fetch of server before patching.

This function is called asynchronously (via goroutine in metrics.go Line 156). Between when the server was listed and when taintServer patches it, the object may have been modified by the reconciler. While MergeFrom patches are relatively safe against conflicts, the coding guidelines recommend re-fetching objects before updating to avoid conflicts.

As per coding guidelines: "Always re-fetch objects before updating them in reconciliation to avoid conflicts."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/serverevents/taint_handler.go` around lines 59 - 63, taintServer
currently patches the passed-in server object which may be stale; before
computing a MergeFrom patch, re-fetch the latest Server from the API (use
k8sClient.Get into a new variable, e.g. latest := &metalv1alpha1.Server{} with
NamespacedName from server.Name/Namespace), make your modifications (add
condition/taint) on latest, create the patch with patch :=
client.MergeFrom(latest.DeepCopy()), then call k8sClient.Patch(ctx, latest,
patch); return any Get/Patch error. This ensures taintServer (the function)
operates on the freshest Server and avoids reconciliation conflicts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bmc/bmc.go`:
- Line 118: The CreateEventSubscription method's third parameter is misnamed: it
is declared as protocol but typed redfish.DeliveryRetryPolicy; rename that
parameter to a clear name such as deliveryRetryPolicy or retryPolicy in the
CreateEventSubscription signature and update all implementations and call sites
to use the new name, ensuring the type remains redfish.DeliveryRetryPolicy; also
update any related documentation/comments to reflect that this parameter
controls the delivery retry policy (not the EventDestinationProtocol).

In `@bmc/mock/server/server.go`:
- Around line 181-183: The current new ID generation using fmt.Sprintf("%d",
len(base.Members)+1) can collide after deletions; replace this with logic that
scans base.Members, parses each member ID to an int (use strconv.Atoi), tracks
the maximum numeric ID, then set newID to fmt.Sprintf("%d", maxID+1) (falling
back to "1" if none parse), and keep using path.Join(r.URL.Path, newID) for the
location; update the block around base.Members, newID and location to use this
max-ID approach.
- Around line 259-296: The code has a race and wrong key lookup: stop releasing
the mutex before reading/updating s.overrides and instead acquire the lock once
and defer its Unlock to cover both the initial lookup and the later write
(replace the separate s.mu.Lock()/s.mu.Unlock() pairs with a single s.mu.Lock();
defer s.mu.Unlock()). Also derive the collection override key from the request
URL using the same resolver as handlePost (use resolvePath(r.URL.Path) or the
equivalent to produce the collection index path) rather than using
path.Dir(filePath), so s.overrides lookup uses the same key format as
handlePost.
- Around line 194-206: When creating the first member in the else branch where
base.Members is initialized, the new member's resource (the variable update) is
not persisted to s.overrides, so subsequent GETs 404; modify the else branch to
also store the created member under its resource path by setting
s.overrides[newMemberPath] = update (the same way the if branch does), ensuring
you compute newMemberPath the same way used elsewhere (e.g., from
r.URL.JoinPath("1").String() or urlPath + "/1") and keep the existing
s.overrides[urlPath] = base and Location header logic.

In `@bmc/redfish.go`:
- Around line 1239-1244: The Context string in the Redfish subscription payload
is incorrect; update the Context field on the subscriptionPayload (the payload
variable in the Redfish subscription creation) from "metal3-operator" to the
correct project identifier "metal-operator" so the payload.Context value uses
the proper string when creating subscriptions.

In `@cmd/main.go`:
- Around line 216-225: The log for missing optional event URL should be
informational, not an error: in the block that computes eventURL (variables
eventURL, eventAddr, eventProtocol, eventPort) replace setupLog.Error(nil,
"...") with setupLog.Info(...) and keep the existing message/context and no
process exit; ensure the branch that constructs eventURL (fmt.Sprintf(...))
remains unchanged so optional behavior is preserved.

In `@internal/serverevents/metrics.go`:
- Around line 150-158: The goroutine for critical events currently uses
context.Background() which provides no cancellation, timeout, or lifecycle
linkage; change the call site in the handler (where criticalEventHandler is
invoked) to create a derived context from the server/component lifecycle context
(e.g., c.ctx or c.shutdownCtx) and wrap it with a timeout using
context.WithTimeout (or context.WithCancel tied to shutdown) before launching
the goroutine, and ensure criticalEventHandler is invoked with that cancellable
context; additionally, to avoid unbounded goroutines on bursts, protect the
spawn with a bounded worker mechanism (e.g., semaphore/channel or a fixed worker
pool) referenced from the same struct so excessive concurrent invocations are
limited.
- Around line 163-166: Describe on RedfishEventCollector currently only emits
c.sensorDesc but Collect also uses c.alertDesc; update Describe (method Describe
on type RedfishEventCollector) to send both c.sensorDesc and c.alertDesc on the
ch channel so all descriptors emitted by Collect are advertised by the
collector, preventing Prometheus client mismatches or panics.
- Around line 69-71: The code in NewRedfishEventCollector currently calls
metrics.Registry.MustRegister(c) which can panic on duplicate registration;
change this to use prometheus.Register(metrics.Registry.Register) with error
handling (i.e., call prometheus.Register(c) and handle the returned error), and
if the error is a prometheus.AlreadyRegisteredError, recover the existing
collector from that error (or ignore the error) and return it instead of
panicking; ensure you reference the collector variable c and Registry to locate
the change and return the appropriate collector from NewRedfishEventCollector.

In `@internal/serverevents/server.go`:
- Around line 77-80: The route patterns registered in Server.routes
(s.mux.HandleFunc("/serverevents/alerts", s.alertHandler) and
"/serverevents/metricsreport") are exact matches and won't match subscription
callback URLs that include a hostname suffix (constructed in subscription.go as
/serverevents/alerts/{hostname} and /serverevents/metricsreport/{hostname});
change the patterns to subtree patterns (e.g., "/serverevents/alerts/" and
"/serverevents/metricsreport/") so s.mux will match any path under those
prefixes (or alternatively use Go 1.22+ wildcard patterns) and keep the same
handler functions (alertHandler and metricsreportHandler).
- Around line 116-145: The log/error strings in Server.Start reference "registry
server" (e.g. the ListenAndServe error message "HTTP registry server
ListenAndServe: %w", s.log.Info("Starting registry server", ...),
s.log.Info("Shutting down registry server..."), the shutdown error return "HTTP
server Shutdown: %w" context, s.log.Info("Registry server graciously stopped"),
and s.log.Error(..., "Error shutting down registry server")) — update all of
those messages to say "event server" consistently (e.g. "HTTP event server
ListenAndServe", "Starting event server", "Shutting down event server...", "HTTP
event server Shutdown: %w", "Event server graciously stopped", "Error shutting
down event server") inside the Server.Start implementation so logs reflect the
correct component.

In `@internal/serverevents/subscription.go`:
- Line 11: The build fails because the import
"github.com/stmcginnis/gofish/redfish" used in
internal/serverevents/subscription.go is not present in go.mod; add the module
dependency by running `go get github.com/stmcginnis/gofish` (or manually add the
module to go.mod) and then run `go mod tidy` to update and vendor dependencies
so the package redfish is available for the subscription.go import.

In `@internal/serverevents/taint_handler.go`:
- Around line 102-131: The top-of-block comment is stale: either remove or
update the "Uncomment the following code after PR `#672` is merged:" line since
the taint-patching logic (serverBase, taintKey, taint, taintExists, the for-loop
checking existingTaint, and the k8sClient.Patch call) is already active; change
the comment to a concise, accurate note (e.g., explain why tainting runs now) or
revert the block to commented code if it should remain gated behind PR `#672`,
ensuring any updated comment references server.Spec.Taints and the purpose of
the patch operation (k8sClient.Patch) for clarity.
- Around line 67-73: The Reason field on criticalEventCondition is being built
as fmt.Sprintf("CriticalEvent%s", event.EventID) which can include invalid
characters from event.EventID; update the code that constructs
criticalEventCondition (the metav1.Condition with Type
CriticalEventConditionType) to use a valid CamelCase reason (e.g.
"CriticalEventReceived") and move the raw EventID into the Message (or sanitize
event.EventID to remove non-alphanumeric characters and ensure it begins with a
letter before concatenation) so that Reason always matches the Kubernetes
pattern while preserving EventID details in Message.

---

Nitpick comments:
In `@bmc/mock/server/server.go`:
- Around line 195-201: Remove the dead allocation of base.Members and stop
calling r.URL.JoinPath("1").String() twice by assigning a single location string
and reusing it when building base.Members (update the struct literal to use the
location variable for OdataID). Also make the two branches consistent: use the
same path-only format as the if branch (e.g., build location with
path.Join(r.URL.Path, newID or "1") rather than r.URL.JoinPath), and ensure the
Location response header and the Member's OdataID both use that same path-only
location string.

In `@internal/controller/bmc_controller_test.go`:
- Around line 64-65: The label-patch and DNSRecord test cases in
internal/controller/bmc_controller_test.go are missing assertions for
Status.MetricsReportSubscriptionLink and Status.EventsSubscriptionLink even
though they use the same EventURL and mock event server; update those test cases
(the label-patch test and the DNSRecord test) to assert
HaveField("Status.MetricsReportSubscriptionLink", Equal(...)) and
HaveField("Status.EventsSubscriptionLink", Equal(...)) with the same expected
subscription paths used elsewhere, or add a brief comment explaining why they're
intentionally omitted; additionally, add a new unit test that deletes a BMC (or
simulates controller teardown) and verifies DeleteEventSubscription is called on
the mock event client to cover subscription cleanup behavior (use the same mock
client helpers and the controller reconcile/delete helper functions to locate
where DeleteEventSubscription is invoked).

In `@internal/controller/bmc_controller.go`:
- Around line 113-119: When GetBMCClientFromBMC(ctx, r.Client, bmcObj,
r.Insecure, r.BMCOptions) returns an error, add a log statement before skipping
deletion so operators see that event subscriptions were not removed;
specifically, log the BMC identifier (from bmcObj) and the error, then continue
to skip calling deleteEventSubscription; keep the existing defer
bmcClient.Logout() behavior when bmcClient is present.
- Around line 549-582: The handler handleEventSubscriptions currently patches
status twice; change it to collect both subscription links first and then issue
a single status patch: call serverevents.SubscribeMetricsReport and
serverevents.SubscribeEvents as needed (checking
bmcObj.Status.MetricsReportSubscriptionLink and EventsSubscriptionLink), set
bmcObj.Status.MetricsReportSubscriptionLink and
bmcObj.Status.EventsSubscriptionLink on a single DeepCopy of bmcObj, set
modified if any link changed, and then call r.Status().Patch(ctx, bmcObj,
client.MergeFrom(bmcBase)) once; keep existing error returns from
SubscribeMetricsReport/SubscribeEvents and only patch when at least one link was
updated.

In `@internal/serverevents/metrics.go`:
- Around line 173-198: The Collect method currently skips stale entries but
never removes them, causing unbounded growth in c.lastReadings and
c.alertCounts; inside the Collect method (the loop that iterates c.lastReadings)
remove entries older than 10 minutes (delete from c.lastReadings) while holding
the collector's mutex (e.g., c.mu) to avoid races, and for c.alertCounts
implement a bounded policy: either periodically reset the map on a ticker
started when the collector is created (e.g., start a cleanup goroutine in the
constructor/initializer that clears or trims c.alertCounts every N minutes) or
attach TTL metadata to each EventKey and prune expired alertCounts entries
during the same cleanup—ensure any deletion/trimming uses the same
synchronization as reads/writes to avoid race conditions.

In `@internal/serverevents/server.go`:
- Around line 82-97: The alertHandler currently uses hostname :=
path.Base(r.URL.Path) which will return "alerts" when the request URL is
/serverevents/alerts/ and treat that as a hostname; update alertHandler to
validate the extracted hostname before calling s.collector.UpdateFromEvent:
parse or split r.URL.Path (or use path.Clean + strings.Split) to ensure the path
has at least three segments and the last segment is non-empty and not equal to
"alerts", return http.StatusBadRequest (and log) if invalid, and only call
s.collector.UpdateFromEvent(hostname, eventData) when hostname passes
validation.

In `@internal/serverevents/taint_handler.go`:
- Around line 59-63: taintServer currently patches the passed-in server object
which may be stale; before computing a MergeFrom patch, re-fetch the latest
Server from the API (use k8sClient.Get into a new variable, e.g. latest :=
&metalv1alpha1.Server{} with NamespacedName from server.Name/Namespace), make
your modifications (add condition/taint) on latest, create the patch with patch
:= client.MergeFrom(latest.DeepCopy()), then call k8sClient.Patch(ctx, latest,
patch); return any Get/Patch error. This ensures taintServer (the function)
operates on the freshest Server and avoids reconciliation conflicts.

In `@test/serverevents/main.go`:
- Line 34: The log call using setupLog.Error currently says "problem running
telemetry server" which is misleading; update the error message in the
setupLog.Error invocation (the call that currently reads setupLog.Error(err,
"problem running telemetry server") in main.go) to reference the event server
(e.g., "problem running event server" or "problem running events server") so
logs accurately reflect the failing component.

Comment thread bmc/bmc.go Outdated
Comment thread bmc/mock/server/server.go
Comment thread bmc/mock/server/server.go
Comment thread bmc/mock/server/server.go Outdated
Comment thread bmc/redfish.go Outdated
Comment thread internal/serverevents/server.go
Comment thread internal/serverevents/server.go
Comment thread internal/serverevents/subscription.go Outdated
Comment thread internal/serverevents/taint_handler.go
Comment thread internal/serverevents/taint_handler.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/serverevents/taint_handler.go`:
- Around line 104-132: Replace the commented taint creation so it uses the
project-specific taint effect and a sanitized key: instead of
corev1.TaintEffectNoSchedule use the custom effect string "NoClaim" (replace
references to corev1.TaintEffectNoSchedule in the taint construction), and build
taintKey from event.EventID after normalizing it (replace/strip characters not
in [A-Za-z0-9_.-], collapse sequences to '-', trim any leading/trailing
non-alphanumerics, and truncate so the full key including the prefix
"metal.ironcore.dev/critical-event-" stays <=63 chars) before creating the
corev1.Taint and appending to server.Spec.Taints; leave the existing existence
check loop (iterating server.Spec.Taints) and the Patch via
k8sClient.Patch(server, client.MergeFrom(serverBase)) intact.
- Around line 66-88: The code unconditionally sets LastTransitionTime on the
metav1.Condition when adding/updating CriticalEventConditionType, violating
condition semantics; instead construct a metav1.Condition (set Type:
CriticalEventConditionType, Status: metav1.ConditionTrue, ObservedGeneration:
server.Generation, Reason, Message) but do NOT manually set LastTransitionTime,
and replace the manual loop/appending logic that touches
server.Status.Conditions with a single call to
apimeta.SetStatusCondition(&server.Status.Conditions, criticalEventCondition) so
Kubernetes semantics are preserved and LastTransitionTime is only updated when
the condition status actually changes.

---

Duplicate comments:
In `@internal/serverevents/taint_handler.go`:
- Around line 71-72: The Reason field is being generated with
fmt.Sprintf("CriticalEvent%s", event.EventID) which can produce invalid
characters for Kubernetes; change the Reason to a fixed safe string (e.g.,
"CriticalEventReceived") and ensure the raw event.EventID is included only in
the Message text (e.g., append or include event.EventID in the Message built
from event.Message and event.OriginOfCondition) so Reason always matches the
required pattern while Message contains the full EventID and details.

Comment thread internal/serverevents/taint_handler.go Outdated
Comment thread internal/serverevents/taint_handler.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
internal/controller/bmc_controller.go (2)

112-118: Silent BMC client failure during deletion may obscure orphaned subscriptions.

If GetBMCClientFromBMC fails (e.g., bad credentials, network issue), the deletion proceeds without deleting event subscriptions and without logging the error. This silently leaves orphaned subscriptions on the BMC hardware. Consider logging the error at a warning level so operators have visibility into the cleanup gap.

Suggested improvement
 	bmcClient, err := bmcutils.GetBMCClientFromBMC(ctx, r.Client, bmcObj, r.Insecure, r.BMCOptions)
 	if err == nil {
 		defer bmcClient.Logout()
 		if err := r.deleteEventSubscription(ctx, bmcClient, bmcObj); err != nil {
 			return ctrl.Result{}, fmt.Errorf("failed to delete event subscriptions: %w", err)
 		}
+	} else {
+		log.V(1).Info("Could not obtain BMC client during deletion, skipping event subscription cleanup", "error", err)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmc_controller.go` around lines 112 - 118, The code
currently skips logging when bmcutils.GetBMCClientFromBMC fails, causing silent
failures that leave orphaned subscriptions; update the block around
GetBMCClientFromBMC (the call to GetBMCClientFromBMC and subsequent use of
bmcClient and r.deleteEventSubscription) to detect when err != nil and emit a
warning or error log (via the controller logger) that includes the error, BMC
identity (e.g., bmcObj.Name or ID) and context so operators can see the cleanup
gap, while preserving the existing successful-path behavior (defer
bmcClient.Logout() and call r.deleteEventSubscription when err == nil).

548-581: Two sequential status patches without re-fetching the object.

handleEventSubscriptions patches the BMC status twice (once for MetricsReport, once for Events) without re-fetching the object between patches. While Patch is more forgiving than Update, if a concurrent reconciliation modifies the status between the two patches, the second MergeFrom base will be stale. Consider consolidating into a single status patch, or re-fetching the object between patches.

Suggested consolidation into a single patch
 func (r *BMCReconciler) handleEventSubscriptions(ctx context.Context, bmcClient bmc.BMC, bmcObj *metalv1alpha1.BMC) (bool, error) {
 	log := ctrl.LoggerFrom(ctx)
 	if r.EventURL == "" {
 		return false, nil
 	}
 	log.V(1).Info("Handling event subscriptions for BMC")
 	modified := false
+	bmcBase := bmcObj.DeepCopy()
 
 	if bmcObj.Status.MetricsReportSubscriptionLink == "" {
 		link, err := serverevents.SubscribeMetricsReport(ctx, r.EventURL, bmcObj.Name, bmcClient)
 		if err != nil {
 			return false, fmt.Errorf("failed to subscribe to server metrics report: %w", err)
 		}
-		bmcBase := bmcObj.DeepCopy()
 		bmcObj.Status.MetricsReportSubscriptionLink = link
 		modified = true
-		if err := r.Status().Patch(ctx, bmcObj, client.MergeFrom(bmcBase)); err != nil {
-			return false, fmt.Errorf("failed to patch server status with subscription links: %w", err)
-		}
 	}
 	if bmcObj.Status.EventsSubscriptionLink == "" {
 		link, err := serverevents.SubscribeEvents(ctx, r.EventURL, bmcObj.Name, bmcClient)
 		if err != nil {
 			return false, fmt.Errorf("failed to subscribe to server alerts: %w", err)
 		}
-		bmcBase := bmcObj.DeepCopy()
 		bmcObj.Status.EventsSubscriptionLink = link
 		modified = true
+	}
+	if modified {
 		if err := r.Status().Patch(ctx, bmcObj, client.MergeFrom(bmcBase)); err != nil {
 			return false, fmt.Errorf("failed to patch server status with subscription links: %w", err)
 		}
 	}
 	return modified, nil
 }

Note: the current approach has a secondary concern — if the MetricsReport subscription succeeds (and is created on the BMC) but the Events subscription fails, the next reconciliation will skip the MetricsReport subscription (link is already stored) and retry only the Events subscription. This is correct and idempotent. However, if the MetricsReport subscription succeeds on the BMC but the first status patch fails, the subscription exists on the BMC but the link is lost, leaving an orphaned subscription. The consolidated single-patch approach has the same issue but at a wider window. Consider whether this is acceptable for your use case.

As per coding guidelines, "Always re-fetch objects before updating them in reconciliation to avoid conflicts."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmc_controller.go` around lines 548 - 581,
handleEventSubscriptions currently issues two sequential status.Patch calls (one
after setting MetricsReportSubscriptionLink and one after setting
EventsSubscriptionLink) which can cause MergeFrom conflicts; consolidate into a
single status patch or re-fetch the BMC before the second patch. Fix by in
handleEventSubscriptions: call serverevents.SubscribeMetricsReport and
SubscribeEvents as needed, set both bmcObj.Status.MetricsReportSubscriptionLink
and bmcObj.Status.EventsSubscriptionLink on the same bmcObj copy (use bmcBase :=
bmcObj.DeepCopy() before modifying), then perform a single r.Status().Patch(ctx,
bmcObj, client.MergeFrom(bmcBase)); alternatively, if you prefer two patches,
re-fetch the fresh BMC object from API (using r.Get) before creating the second
MergeFrom base to avoid stale-base conflicts. Ensure you reference the existing
symbols: handleEventSubscriptions, serverevents.SubscribeMetricsReport,
serverevents.SubscribeEvents, MetricsReportSubscriptionLink,
EventsSubscriptionLink, and r.Status().Patch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bmc/redfish.go`:
- Around line 1210-1221: The file references types from the redfish package
(e.g., subscriptionPayload struct fields using redfish.EventType,
redfish.EventFormatType, redfish.DeliveryRetryPolicy,
redfish.EventDestinationProtocol and redfish.RedfishEventDestinationProtocol)
but the package is not imported; add the import
"github.com/stmcginnis/gofish/redfish" to the import block so those symbols
resolve and the code compiles, ensuring any existing import grouping/aliasing
style is preserved.
- Around line 1246-1251: The code currently assigns payload.RegistryPrefixes =
[]string{""} and payload.ResourceTypes = []string{""} in bmc/redfish.go which
will serialize as single-empty-string arrays (not omitted) and can produce
vendor-specific errors; change this so that when you intend to omit filters you
set those fields to nil or an empty slice (so the json "omitempty" drops them),
or alternatively populate them with explicit known registry/resource names (e.g.
"Base","ResourceEvent","iLOEvents" and "Chassis","ComputerSystem","Power")
derived from ev.ResourceTypes; update the branch that sets
payload.RegistryPrefixes and payload.ResourceTypes to either leave them unset
(nil/empty) or fill with meaningful values accordingly.

---

Duplicate comments:
In `@bmc/redfish.go`:
- Line 1242: The Context field in the Redfish BMC struct initialization is
misspelled as "metal3-operator"; update the Context value to the correct
"metal-operator". Locate the struct literal that sets Context (search for
Context: "metal3-operator" in bmc/redfish.go) and replace the string with
"metal-operator" so the operator name is correct.

---

Nitpick comments:
In `@internal/controller/bmc_controller.go`:
- Around line 112-118: The code currently skips logging when
bmcutils.GetBMCClientFromBMC fails, causing silent failures that leave orphaned
subscriptions; update the block around GetBMCClientFromBMC (the call to
GetBMCClientFromBMC and subsequent use of bmcClient and
r.deleteEventSubscription) to detect when err != nil and emit a warning or error
log (via the controller logger) that includes the error, BMC identity (e.g.,
bmcObj.Name or ID) and context so operators can see the cleanup gap, while
preserving the existing successful-path behavior (defer bmcClient.Logout() and
call r.deleteEventSubscription when err == nil).
- Around line 548-581: handleEventSubscriptions currently issues two sequential
status.Patch calls (one after setting MetricsReportSubscriptionLink and one
after setting EventsSubscriptionLink) which can cause MergeFrom conflicts;
consolidate into a single status patch or re-fetch the BMC before the second
patch. Fix by in handleEventSubscriptions: call
serverevents.SubscribeMetricsReport and SubscribeEvents as needed, set both
bmcObj.Status.MetricsReportSubscriptionLink and
bmcObj.Status.EventsSubscriptionLink on the same bmcObj copy (use bmcBase :=
bmcObj.DeepCopy() before modifying), then perform a single r.Status().Patch(ctx,
bmcObj, client.MergeFrom(bmcBase)); alternatively, if you prefer two patches,
re-fetch the fresh BMC object from API (using r.Get) before creating the second
MergeFrom base to avoid stale-base conflicts. Ensure you reference the existing
symbols: handleEventSubscriptions, serverevents.SubscribeMetricsReport,
serverevents.SubscribeEvents, MetricsReportSubscriptionLink,
EventsSubscriptionLink, and r.Status().Patch.

Comment thread bmc/redfish.go
Comment thread bmc/redfish.go
@github-actions github-actions Bot added size/XXL and removed size/XL labels Mar 11, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
internal/controller/bmc_controller_test.go (2)

402-408: Server may already be deleted via cascade.

Since Server has BMC as its controller owner (with BlockOwnerDeletion: ptr.To(true)), deleting the BMC should cascade delete the Server. The explicit k8sClient.Delete(ctx, server) on line 408 may return a "not found" error if garbage collection has already removed it.

Consider using client.IgnoreNotFound() or Eventually(Get(server)).Should(Satisfy(apierrors.IsNotFound)) to handle this race condition gracefully.

🧹 Suggested cleanup fix
-		Expect(k8sClient.Delete(ctx, bmcSecret)).To(Succeed())
-		Expect(k8sClient.Delete(ctx, server)).To(Succeed())
+		Expect(k8sClient.Delete(ctx, bmcSecret)).To(Succeed())
+		// Server may already be deleted via cascade from BMC deletion
+		_ = k8sClient.Delete(ctx, server)
+		Eventually(Get(server)).Should(Satisfy(apierrors.IsNotFound))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmc_controller_test.go` around lines 402 - 408, The test
currently creates a Server object named via
bmcutils.GetServerNameFromBMCandIndex and unconditionally calls
k8sClient.Delete(ctx, server), which can race with garbage-collection because
the BMC is the owner; change the cleanup to ignore "not found" errors or assert
eventual deletion: replace the direct Expect(k8sClient.Delete(ctx,
server)).To(Succeed()) with either client.IgnoreNotFound(k8sClient.Delete(ctx,
server)) or an Eventually assertion that fetching the Server returns
apierrors.IsNotFound (e.g., Eventually(func() error { return k8sClient.Get(ctx,
server) }).Should(Satisfy(apierrors.IsNotFound))) so the test tolerates the
cascade delete race with the BMC/owner garbage collector.

398-400: Test does not verify subscription cleanup.

The assertions on lines 399-400 only verify that metricsLink and eventsLink were non-empty before deletion. This doesn't verify that the subscriptions were actually deleted from the BMC during cleanup.

Consider enhancing the test to verify cleanup by either:

  1. Checking mock server state to confirm subscriptions were deleted
  2. Adding a mock that tracks DeleteEventSubscription calls
  3. Verifying the BMC status fields are cleared (if that's the expected behavior)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmc_controller_test.go` around lines 398 - 400, The test
currently only asserts that metricsLink and eventsLink are non-empty before
deletion but does not confirm they were removed; update the test in
bmc_controller_test.go to assert cleanup by (a) extending the mock BMC server or
mock client to expose its subscription state and assert the entries for
metricsLink/eventsLink are absent after deletion, or (b) add a mock expectation
that DeleteEventSubscription (or the equivalent method on the mock client) was
called with the subscription IDs, or (c) assert that the BMC resource status
fields that held those links are cleared after the controller runs; locate the
assertions referencing metricsLink and eventsLink and add one of these checks
immediately after the deletion/cleanup step.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/bmc_controller.go`:
- Around line 112-120: The delete path currently returns an error if
deleteEventSubscription(ctx, bmcClient, bmcObj) fails, which blocks finalizer
removal; change the logic in the reconcile deletion block that calls
GetBMCClientFromBMC, bmcClient and defer bmcClient.Logout so that after a
successful bmcClient creation you attempt deleteEventSubscription but treat any
error as non-fatal: log the error (including err.Error()) with context and
continue without returning the error so finalizer removal can proceed; ensure
the defer bmcClient.Logout remains only when bmcClient was created.
- Around line 550-585: The handleEventSubscriptions function can lose a
successfully created EventsSubscriptionLink if a subsequent
SubscribeMetricsReport call fails; update the logic so each successful
subscription is immediately persisted: after serverevents.SubscribeEvents
succeeds set bmcObj.Status.EventsSubscriptionLink and call r.Status().Patch(ctx,
bmcObj, client.MergeFrom(bmcBase)) (refreshing bmcBase to the post-patch state
or re-deep-copying before the next change), then proceed to
SubscribeMetricsReport; likewise, if SubscribeMetricsReport succeeds patch the
status right away; ensure any patch errors are returned but do not discard
previously persisted links so partial successes are retained.

In `@internal/serverevents/metrics.go`:
- Around line 37-39: RedfishEventCollector's lastReadings map currently keys by
MetricID+MetricProperty causing different hosts to clobber each other; update
the map key to include the source host (e.g., replace the string concat key with
a structured key type or a safe delimited string) and adjust all uses in Collect
and related code paths (including the logic referenced around lines 136-145) to
build/lookup using that new key; specifically modify the declaration of
lastReadings, the key construction where MetricEntry is written/read, and any
references that assumed the old key so the hostname label from Collect is
preserved in the cache lookup.
- Around line 173-189: The current logic spawns a goroutine for every Critical
event and drops events if c.eventSem is full; instead ensure we never silently
discard accepted critical events by acquiring capacity before
acknowledging/returning: either block on sending to c.eventSem (move the
select/send out of the goroutine) so the caller or handler goroutine waits for
capacity, or enqueue events into a bounded channel (e.g., a dedicated
criticalEventQueue) and have a fixed worker pool consume from that queue and
call c.criticalEventHandler; update the code paths around c.eventSem, the
anonymous goroutine, and the place that logs "Critical event handler pool
exhausted" to remove the default/drop behavior so events are queued/blocked
until a worker can process them.

In `@internal/serverevents/taint_handler.go`:
- Around line 58-62: The handler currently only sets a status condition and
leaves the ServerSpec.Taints patch commented out, so taintServer never persists
a taint; restore and implement the spec mutation: construct the appropriate
taint object (matching the project’s ServerSpec.Taints element type) from the
Event, append or upsert it into server.Spec.Taints in taintServer, and persist
the change with the Kubernetes API (use k8sClient.Update or a strategic Patch on
the server object) rather than only updating status; ensure you handle merge
conflicts/retries and preserve other spec fields when calling k8sClient.Update/
Patch.

---

Nitpick comments:
In `@internal/controller/bmc_controller_test.go`:
- Around line 402-408: The test currently creates a Server object named via
bmcutils.GetServerNameFromBMCandIndex and unconditionally calls
k8sClient.Delete(ctx, server), which can race with garbage-collection because
the BMC is the owner; change the cleanup to ignore "not found" errors or assert
eventual deletion: replace the direct Expect(k8sClient.Delete(ctx,
server)).To(Succeed()) with either client.IgnoreNotFound(k8sClient.Delete(ctx,
server)) or an Eventually assertion that fetching the Server returns
apierrors.IsNotFound (e.g., Eventually(func() error { return k8sClient.Get(ctx,
server) }).Should(Satisfy(apierrors.IsNotFound))) so the test tolerates the
cascade delete race with the BMC/owner garbage collector.
- Around line 398-400: The test currently only asserts that metricsLink and
eventsLink are non-empty before deletion but does not confirm they were removed;
update the test in bmc_controller_test.go to assert cleanup by (a) extending the
mock BMC server or mock client to expose its subscription state and assert the
entries for metricsLink/eventsLink are absent after deletion, or (b) add a mock
expectation that DeleteEventSubscription (or the equivalent method on the mock
client) was called with the subscription IDs, or (c) assert that the BMC
resource status fields that held those links are cleared after the controller
runs; locate the assertions referencing metricsLink and eventsLink and add one
of these checks immediately after the deletion/cleanup step.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ce61ea82-9e0f-40f8-af4f-ad2759382590

📥 Commits

Reviewing files that changed from the base of the PR and between 9fe9af0 and dbc8edd.

📒 Files selected for processing (14)
  • api/v1alpha1/zz_generated.deepcopy.go
  • bmc/bmc.go
  • bmc/mock/server/server.go
  • bmc/redfish.go
  • go.mod
  • internal/controller/bmc_controller.go
  • internal/controller/bmc_controller_test.go
  • internal/serverevents/metrics.go
  • internal/serverevents/metrics_test.go
  • internal/serverevents/server.go
  • internal/serverevents/server_test.go
  • internal/serverevents/subscription.go
  • internal/serverevents/taint_handler.go
  • test/serverevents/main.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • bmc/bmc.go
  • bmc/mock/server/server.go
  • bmc/redfish.go

Comment thread internal/controller/bmc_controller.go
Comment thread internal/controller/bmc_controller.go
Comment thread internal/serverevents/metrics.go
Comment thread internal/serverevents/metrics.go
Comment thread internal/serverevents/taint_handler.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
cmd/main.go (1)

221-230: ⚠️ Potential issue | 🟡 Minor

Treat missing optional event URL as informational.

This branch only disables the event subsystem, but setupLog.Error(nil, ...) makes a normal configuration look broken. Info is a better fit here.

Suggested fix
 		eventAddr = os.Getenv("EVENT_ADDRESS")
 		if eventAddr == "" {
-			setupLog.Error(nil, "failed to set the event URL as no address is provided")
+			setupLog.Info("Event URL not configured, event subscriptions will be disabled")
 		} else {
 			eventURL = fmt.Sprintf("%s://%s:%d", eventProtocol, eventAddr, eventPort)
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/main.go` around lines 221 - 230, The code logs a missing optional
EVENT_ADDRESS as an error; change the logging level to Info to reflect that this
simply disables the event subsystem. In the branch where eventURL == "" and
eventAddr == "" (variables eventURL and eventAddr), replace the
setupLog.Error(nil, ...) call with setupLog.Info(...) and keep the same
message/context so it becomes informational rather than an error.
internal/controller/bmc_controller.go (2)

112-120: ⚠️ Potential issue | 🟠 Major

Make subscription cleanup best-effort in the finalizer.

A transient Redfish error or an already-removed subscription currently aborts deletion and leaves the BMC stuck in Terminating. Log the cleanup failure and continue removing the finalizer.

Suggested fix
 	bmcClient, err := bmcutils.GetBMCClientFromBMC(ctx, r.Client, bmcObj, r.Insecure, r.BMCOptions)
 	if err == nil {
 		defer bmcClient.Logout()
 		if err := r.deleteEventSubscription(ctx, bmcClient, bmcObj); err != nil {
-			return ctrl.Result{}, fmt.Errorf("failed to delete event subscriptions: %w", err)
+			log.Info("Failed to delete event subscriptions, continuing with BMC deletion", "error", err.Error())
 		}
 	} else {
 		log.Info("Cannot create BMC client during deletion, subscription cleanup will be skipped", "error", err.Error())
 	}

As per coding guidelines, "Implement finalizers to clean up external resources (buckets, VMs, DNS entries) before resource deletion".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmc_controller.go` around lines 112 - 120, The finalizer
currently aborts deletion if deleting Redfish subscriptions fails: in the block
using bmcutils.GetBMCClientFromBMC, change the behavior so subscription cleanup
is best-effort — do not return an error from r.deleteEventSubscription; instead
log the failure and continue removing the finalizer. Specifically, keep creating
bmcClient via GetBMCClientFromBMC and defer bmcClient.Logout() only when err ==
nil, but replace the `if err := r.deleteEventSubscription(...); err != nil {
return ..., fmt.Errorf(...) }` with a log statement (e.g., log.Info or
log.Error) that records the error and proceeds; ensure any client logout still
runs when created and no early return occurs on subscription cleanup failure.

549-579: ⚠️ Potential issue | 🟠 Major

Persist each subscription result before creating the next one.

If SubscribeEvents succeeds and SubscribeMetricsReport fails, the events link is never stored in status. The next reconcile will create another events subscription, so this path stops being idempotent.

As per coding guidelines, "Implement idempotent reconciliation logic - safe to run the same reconciliation multiple times without side effects".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/bmc_controller.go` around lines 549 - 579,
handleEventSubscriptions currently defers persisting subscription links until
both SubscribeEvents and SubscribeMetricsReport have succeeded, which can lose
the first result if the second fails; change the flow so after a successful
SubscribeEvents call you immediately set bmcObj.Status.EventsSubscriptionLink
and call r.Status().Patch(ctx, bmcObj, client.MergeFrom(bmcBase)) (handling its
error) before proceeding to call serverevents.SubscribeMetricsReport, and
likewise persist the MetricsReport link immediately after its success; reference
the function handleEventSubscriptions, the calls to serverevents.SubscribeEvents
and serverevents.SubscribeMetricsReport, and the use of r.Status().Patch and
bmcObj.Status.EventsSubscriptionLink / MetricsReportSubscriptionLink when making
this change.
bmc/redfish.go (1)

1015-1021: ⚠️ Potential issue | 🟠 Major

Do not send placeholder filters as [""].

These assignments send empty-string registry/resource filters to the BMC instead of omitting the filters. That can fail subscription creation or produce vendor-specific matching; leave the fields unset for “all”, or populate them with real values.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bmc/redfish.go` around lines 1015 - 1021, The code is assigning placeholder
empty-string filters (payload.RegistryPrefixes = []string{""} and
payload.ResourceTypes = []string{""}) which sends invalid filters to the BMC;
instead, only set payload.RegistryPrefixes and payload.ResourceTypes when you
have real values (e.g., from ev.ResourceTypes or a registry list) and leave them
nil/unset to mean “all” otherwise—update the branch around ev.ResourceTypes and
any code that sets payload.RegistryPrefixes/payload.ResourceTypes so they are
omitted (or set to nil) when there are no real filter values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/controller/bmc_controller.go`:
- Around line 586-590: The deletion logic in deleteEventSubscription currently
returns early if r.EventURL == "", which prevents cleanup of external Redfish
subscriptions referenced in the resource status and can leak subscriptions;
update deleteEventSubscription (BMCReconciler) to not gate on r.EventURL, and
instead read the stored subscription URIs from the BMC object's Status (e.g.,
Status.*SubscriptionLink fields) and attempt to delete those external
subscriptions via the bmcClient when present; ensure the function still handles
an empty EventURL gracefully (use the stored link(s) for deletion) and clears
the Status subscription entries on successful removal so finalizers can
complete.

---

Duplicate comments:
In `@bmc/redfish.go`:
- Around line 1015-1021: The code is assigning placeholder empty-string filters
(payload.RegistryPrefixes = []string{""} and payload.ResourceTypes =
[]string{""}) which sends invalid filters to the BMC; instead, only set
payload.RegistryPrefixes and payload.ResourceTypes when you have real values
(e.g., from ev.ResourceTypes or a registry list) and leave them nil/unset to
mean “all” otherwise—update the branch around ev.ResourceTypes and any code that
sets payload.RegistryPrefixes/payload.ResourceTypes so they are omitted (or set
to nil) when there are no real filter values.

In `@cmd/main.go`:
- Around line 221-230: The code logs a missing optional EVENT_ADDRESS as an
error; change the logging level to Info to reflect that this simply disables the
event subsystem. In the branch where eventURL == "" and eventAddr == ""
(variables eventURL and eventAddr), replace the setupLog.Error(nil, ...) call
with setupLog.Info(...) and keep the same message/context so it becomes
informational rather than an error.

In `@internal/controller/bmc_controller.go`:
- Around line 112-120: The finalizer currently aborts deletion if deleting
Redfish subscriptions fails: in the block using bmcutils.GetBMCClientFromBMC,
change the behavior so subscription cleanup is best-effort — do not return an
error from r.deleteEventSubscription; instead log the failure and continue
removing the finalizer. Specifically, keep creating bmcClient via
GetBMCClientFromBMC and defer bmcClient.Logout() only when err == nil, but
replace the `if err := r.deleteEventSubscription(...); err != nil { return ...,
fmt.Errorf(...) }` with a log statement (e.g., log.Info or log.Error) that
records the error and proceeds; ensure any client logout still runs when created
and no early return occurs on subscription cleanup failure.
- Around line 549-579: handleEventSubscriptions currently defers persisting
subscription links until both SubscribeEvents and SubscribeMetricsReport have
succeeded, which can lose the first result if the second fails; change the flow
so after a successful SubscribeEvents call you immediately set
bmcObj.Status.EventsSubscriptionLink and call r.Status().Patch(ctx, bmcObj,
client.MergeFrom(bmcBase)) (handling its error) before proceeding to call
serverevents.SubscribeMetricsReport, and likewise persist the MetricsReport link
immediately after its success; reference the function handleEventSubscriptions,
the calls to serverevents.SubscribeEvents and
serverevents.SubscribeMetricsReport, and the use of r.Status().Patch and
bmcObj.Status.EventsSubscriptionLink / MetricsReportSubscriptionLink when making
this change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6e5fb79e-9d94-45d6-bf53-2bed96eb3313

📥 Commits

Reviewing files that changed from the base of the PR and between dbc8edd and adca1ef.

📒 Files selected for processing (8)
  • bmc/bmc.go
  • bmc/redfish.go
  • cmd/main.go
  • docs/api-reference/api.md
  • go.mod
  • internal/controller/bmc_controller.go
  • internal/controller/server_controller.go
  • internal/controller/suite_test.go
💤 Files with no reviewable changes (1)
  • internal/controller/server_controller.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • go.mod
  • docs/api-reference/api.md

Comment thread internal/controller/bmc_controller.go
@github-actions github-actions Bot added size/XL and removed size/XXL labels Mar 12, 2026
Changes:
- Remove non-blocking select that would drop events when pool is full
- Block on semaphore acquisition to ensure all critical events are processed
- Add comprehensive tests verifying no events are dropped under load
Changes:
- Change deleteEventSubscription errors from blocking to logged warnings
- BMC deletion now proceeds even if subscriptions cannot be cleaned up
- Add test verifying deletion succeeds despite subscription cleanup failure

This addresses CodeRabbit feedback: subscription cleanup failures (due to
already-deleted subscriptions, network errors, or powered-off BMCs) should
not prevent BMC resource deletion and leave resources stuck in Terminating
state.

Scenarios now handled gracefully:
- Subscription already deleted directly on BMC (404 error)
- BMC factory reset or firmware upgrade (subscription lost)
- Network errors during cleanup
- BMC powered off or unreachable

The behavior is now consistent with BMC client creation failure handling
(line 119), where errors are logged but don't block deletion.
Changes:
- Patch subscription links immediately after each successful creation
- Move bmcBase deep copy inside each subscription block
- Each subscription now persisted independently

This fixes a data loss bug where EventsSubscriptionLink would be lost if
SubscribeMetricsReport failed after SubscribeEvents succeeded. The previous
code batched both subscriptions and only patched at the end, causing the
first subscription link to be discarded on error.

New behavior:
- EventsSubscription created and persisted → then MetricsReport attempted
- If MetricsReport fails, EventsSubscription link is already saved
- On next reconciliation, only MetricsReport subscription is retried
- Prevents duplicate subscription creation on BMC

This ensures partial success is retained and subscription state is always
consistent with what exists on the BMC.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-change area/metal-automation documentation Improvements or additions to documentation size/XL

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants