Skip to content

Add Azure Monitor alerting for cloud-level resources#143

Draft
ian-flores wants to merge 3 commits intomainfrom
azure-monitor-alerting
Draft

Add Azure Monitor alerting for cloud-level resources#143
ian-flores wants to merge 3 commits intomainfrom
azure-monitor-alerting

Conversation

@ian-flores
Copy link
Contributor

Summary

Add Azure Monitor-based alerting for Azure cloud resources, equivalent to PR #139 for AWS CloudWatch.

Closes: ptd-config#2779

  • Add prometheus.exporter.azure config blocks to Alloy for Azure workloads (PostgreSQL, NetApp Files, Load Balancer, Storage, NAT Gateway)
  • Create Monitoring Reader RBAC role assignment and workload identity for Alloy managed identity
  • Create Azure-specific Grafana alert rule YAML files (azure_postgres, azure_netapp, azure_loadbalancer, azure_storage)
  • Deploy alert ConfigMaps to Azure clusters (both cloud-agnostic and Azure-specific alerts — previously Azure had zero alert deployment)
  • Enable Grafana sidecar for alert provisioning on Azure

Alert rules

Resource Alert Threshold Duration
PostgreSQL CPU High >80% 10m
PostgreSQL Storage High >80% 5m
PostgreSQL Memory High >80% 10m
PostgreSQL Connections High >500 5m
PostgreSQL Failed Connections >10 5m
PostgreSQL Deadlocks >0 5m
NetApp Files Capacity High >80% 10m
NetApp Files Read Latency High >10ms 10m
NetApp Files Write Latency High >10ms 10m
Load Balancer Health Probe Down <100% 5m
Load Balancer Data Path Down <100% 5m
Load Balancer SNAT Port Exhaustion >80% 5m
Storage Availability Low <99.9% 5m
Storage Latency High >1000ms 10m

Bonus fix

Azure clusters previously received zero alerts — not even cloud-agnostic ones (pods, nodes, healthchecks, applications, mimir). This PR deploys all of them.

Test plan

  • All 190 tests pass (163 existing + 27 new)
  • Lint and format clean
  • Deploy to Azure test cluster (duplicado03-staging)
  • Verify metrics arrive in Mimir
  • Verify alert rules appear in Grafana Alerting

Add prometheus.exporter.azure config blocks to Alloy for Azure workloads
covering PostgreSQL, NetApp Files, Load Balancer, Storage, and NAT Gateway
(conditional on public_subnet_cidr). Create Monitoring Reader role
assignment and workload identity for Alloy service account.

Ref: ptd-config#2779
Create Grafana provisioned alert YAML files for Azure cloud resources:
- azure_postgres.yaml: CPU, storage, memory, connections, deadlocks
- azure_netapp.yaml: capacity, read/write latency
- azure_loadbalancer.yaml: health probe, data path, SNAT exhaustion
- azure_storage.yaml: availability, E2E latency

Ref: ptd-config#2779
Add 27 tests covering:
- Azure Monitor Alloy config generation (metric blocks, NAT conditional,
  subscription/resource group interpolation, AWS returns empty)
- Alert YAML file validation (existence, structure, metric queries)
- Alloy monitoring identity method existence and signature

Ref: ptd-config#2779
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant