GitHub - rchakode/kube-opex-analytics: Kubernetes usage analytics for CPU, Memory, and GPU — track costs and optimize cluster resources

Overview

kube-opex-analytics is a Kubernetes usage accounting and analytics tool that helps organizations track CPU, Memory, and GPU resources consumed by their clusters over time (hourly, daily, monthly).

It provides insightful usage analytics metrics and charts that engineering and financial teams can use as key indicators for cost optimization decisions.

Tracked Resources

CPU - Core usage and requests per namespace
Memory - RAM consumption and requests per namespace
GPU - NVIDIA GPU utilization via DCGM integration (v26.01.0-beta1 or later)

Multi-cluster Integration: kube-opex-analytics tracks usage for a single Kubernetes cluster. For centralized multi-cluster analytics, see Krossboard Kubernetes Operator (demo video).

Key Features

Feature	Description
Hourly/Daily/Monthly Trends	Tracks actual usage and requested capacities per namespace, collected every 5 minutes and consolidated hourly
Non-allocatable Capacity Tracking	Highlights system overhead (OS, kubelets) vs. usable application capacity at node and cluster levels
Cluster Capacity Planning	Visualize consumed capacity globally, instantly, and over time
Usage Efficiency Analysis	Compare resource requests against actual usage to identify over/under-provisioning
Cost Allocation & Chargeback	Automatic resource usage accounting per namespace for billing and showback
Prometheus Integration	Native exporter at `/metrics` for Grafana dashboards and alerting

Quick Start

Prerequisites

Kubernetes cluster v1.19+ (or OpenShift 4.x+)
kubectl configured with cluster access
Helm 3.x (fine-tuned installation) or kubectl for a basic opinionated deployment
Cluster permissions: read access to pods, nodes, and namespaces
Kubernetes Metrics Server deployed in your cluster (required for CPU and memory metrics)
NVIDIA DCGM Exporter deployed in your cluster (required for GPU metrics, optional if no GPUs)

Verify Metrics Server

Before installing, ensure metrics-server is running in your cluster:

# Check if metrics-server is deployed
kubectl -n kube-system get deploy | grep metrics-server

# Verify it's working
kubectl top nodes

# If not installed, deploy with kubectl
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Verify DCGM Exporter (GPU metrics)

If your cluster has NVIDIA GPUs and you want GPU metrics, ensure DCGM Exporter is running:

# Check if DCGM Exporter is deployed
kubectl get daemonset -A | grep dcgm

# If not installed, deploy with Helm (requires NVIDIA GPU Operator or drivers)
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter \
  --namespace gpu-operator \
  --create-namespace

Clone the Repository

git clone https://github.com/rchakode/kube-opex-analytics.git --depth=1
cd kube-opex-analytics

Install with Kustomize (Quick Start)

OpenShift users: Skip this section and use Helm installation with OpenShift-specific settings.

# Create namespace
kubectl create namespace kube-opex-analytics

# Deploy using Kustomize
kubectl apply -k ./manifests/kustomize -n kube-opex-analytics

# Watch pod status
kubectl get pods -n kube-opex-analytics -w

Install with Helm (Advanced)

For advanced customization (OpenShift, custom storage, etc.), edit manifests/helm/values.yaml:

OpenShift: Set securityContext.openshift: true
Custom storage: Set dataVolume.storageClass and dataVolume.capacity
DCGM Integration: Set dcgm.enable: true and dcgm.endpoint

Then run:

# Create namespace
kubectl create namespace kube-opex-analytics

# Install with Helm
helm upgrade --install kube-opex-analytics ./manifests/helm -n kube-opex-analytics

# Watch pod status
kubectl get pods -n kube-opex-analytics -w

Access the Dashboard

# Port-forward to access the UI
kubectl port-forward svc/kube-opex-analytics 5483:80 -n kube-opex-analytics

# Open http://localhost:5483 in your browser

Install with Docker

Requires kubectl proxy running locally to provide API access:

# Start kubectl proxy in background
kubectl proxy &

# Run kube-opex-analytics
docker run -d \
  --net="host" \
  --name kube-opex-analytics \
  -v /var/lib/kube-opex-analytics:/data \
  -e KOA_DB_LOCATION=/data/db \
  -e KOA_K8S_API_ENDPOINT=http://127.0.0.1:8001 \
  rchakode/kube-opex-analytics

# Access at http://localhost:5483

Architecture

┌───────────────────┐
│  Metrics Server   │──┐
│  (CPU/Memory)     │  │    ┌──────────────────────────────────────┐
└───────────────────┘  ├───>│         kube-opex-analytics          │
┌───────────────────┐  │    │  ┌─────────┐  ┌────────┐  ┌─────────┐│
│  DCGM Exporter    │──┘    │  │ Poller  │─>│RRD DBs │─>│ API     ││
│  (GPU metrics)    │       │  │ (5 min) │  │        │  │         ││
└───────────────────┘       │  └─────────┘  └────────┘  └────┬────┘│
                            └───────────────────────────────┼──────┘
                                                            │
                            ┌───────────────────────────────┼──────┐
                            │                               v      │
                            │  ┌────────────┐    ┌────────────┐    │
                            │  │  Web UI    │    │  /metrics  │    │
                            │  │  (D3.js)   │    │ (Prometheus│    │
                            │  └────────────┘    └────────────┘    │
                            └──────────────────────────────────────┘
                                     │                  │
                                     v                  v
                              Built-in Dashboards   Grafana/Alerting

Data Flow:

Metrics polled every 5 minutes (configurable):
- CPU/Memory from Kubernetes Metrics Server
- GPU from NVIDIA DCGM Exporter
Metrics are processed and stored in internal lightweight time-series databases (round-robin DBs)
Data is consolidated into hourly, daily, and monthly aggregates
API serves data to the built-in web UI and Prometheus scraper

Documentation

Topic	Link
Installation on Kubernetes/OpenShift	docs/installation-on-kubernetes-and-openshift.md
Installation on Docker	docs/installation-on-docker.md
Built-in Dashboards	docs/built-in-dashboards-and-charts.md
Prometheus & Grafana	docs/prometheus-exporter-grafana-dashboard.md
Configuration Reference	docs/configuration-settings.md
Design Fundamentals	docs/design-fundamentals.md

Configuration

Key environment variables:

Variable	Description	Default
`KOA_K8S_API_ENDPOINT`	Kubernetes API server URL	Required
`KOA_K8S_AUTH_TOKEN`	Service account token	Auto-detected in-cluster
`KOA_DB_LOCATION`	Path for RRDtool databases	`/data`
`KOA_POLLING_INTERVAL_SEC`	Metrics collection interval	`300`
`KOA_COST_MODEL`	Billing model (`CUMULATIVE_RATIO`, `RATIO`, `CHARGE_BACK`)	`CUMULATIVE_RATIO`
`KOA_BILLING_HOURLY_RATE`	Hourly cost for chargeback model	`-1.0`
`KOA_BILLING_CURRENCY_SYMBOL`	Currency symbol for cost display	`$`
`KOA_NVIDIA_DCGM_ENDPOINT`	NVIDIA DCGM Exporter endpoint for GPU metrics	Not set (GPU disabled)

GPU Metrics (NVIDIA DCGM)

To enable GPU metrics collection, set the DCGM Exporter endpoint:

# Environment variable
export KOA_NVIDIA_DCGM_ENDPOINT=http://dcgm-exporter.gpu-operator:9400/metrics

# Or with Helm
helm upgrade --install kube-opex-analytics ./manifests/helm \
  --set dcgm.enabled=true \
  --set dcgm.endpoint=http://dcgm-exporter.gpu-operator:9400/metrics

See Configuration Settings for the complete reference.

Troubleshooting

Common Issues

Pod stuck in CrashLoopBackOff

Check logs: kubectl logs -f deployment/kube-opex-analytics -n kube-opex-analytics
Verify RBAC permissions are correctly applied
Ensure the service account has read access to pods and nodes

No data appearing in dashboard

Wait at least 5-10 minutes for initial data collection
Verify the pod can reach the Kubernetes API: check for connection errors in logs
Confirm KOA_K8S_API_ENDPOINT is correctly set

Metrics not appearing in Prometheus

Ensure the /metrics endpoint is accessible
Check ServiceMonitor/PodMonitor configuration if using Prometheus Operator
Verify network policies allow Prometheus to scrape the pod

Pooling interval

By default, the polling interval to collect raw metrics from Kubernetes API or NVIDIA DCGM is 300 seconds (5 minutes).
You can increase this limit using the variable KOA_POLLING_INTERVAL_SEC. Always use a multiple 300 seconds, as the backend RRD database is based on a 5-minutes resolution.

Getting Help

Check existing GitHub Issues
Review the Design Fundamentals for architectural context

License

kube-opex-analytics is licensed under Apache License 2.0.

Third-party library licenses are documented in NOTICE.

Support & Contributions

We welcome feedback and contributions!

Report Issues: GitHub Issues
Contribute Code: Pull Requests

All contributions must be released under Apache 2.0 License terms.

Name		Name	Last commit message	Last commit date
Latest commit History 519 Commits
.github		.github
css		css
docs		docs
js		js
manifests		manifests
marketing		marketing
screenshots		screenshots
static/images		static/images
tests		tests
third-parties		third-parties
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
RELEASES.md		RELEASES.md
SECURITY.md		SECURITY.md
backend.py		backend.py
entrypoint.sh		entrypoint.sh
index.html		index.html
kube-opex-analytics.png		kube-opex-analytics.png
pyproject.toml		pyproject.toml
run-debug-docker.sh		run-debug-docker.sh
run-debug.sh		run-debug.sh
test_backend.py		test_backend.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of Contents

Overview

Tracked Resources

Key Features

Quick Start

Prerequisites

Verify Metrics Server

Verify DCGM Exporter (GPU metrics)

Clone the Repository

Install with Kustomize (Quick Start)

Install with Helm (Advanced)

Access the Dashboard

Install with Docker

Architecture

Documentation

Configuration

GPU Metrics (NVIDIA DCGM)

Troubleshooting

Common Issues

Getting Help

License

Support & Contributions

About

Uh oh!

Releases 39

Packages

Uh oh!

Contributors 11

Uh oh!

Languages

License

rchakode/kube-opex-analytics

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Overview

Tracked Resources

Key Features

Quick Start

Prerequisites

Verify Metrics Server

Verify DCGM Exporter (GPU metrics)

Clone the Repository

Install with Kustomize (Quick Start)

Install with Helm (Advanced)

Access the Dashboard

Install with Docker

Architecture

Documentation

Configuration

GPU Metrics (NVIDIA DCGM)

Troubleshooting

Common Issues

Getting Help

License

Support & Contributions

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 39

Packages 0

Uh oh!

Contributors 11

Uh oh!

Languages

Packages