fix(infra): fix DR gaps discovered during Stage cluster recovery test#3452
fix(infra): fix DR gaps discovered during Stage cluster recovery test#3452manamana32321 wants to merge 10 commits intomainfrom
Conversation
- Rename AWS secret from Codedang-Sealed-Secrets-Prod to
Codedang-Sealed-Secrets-Production to match bootstrap script's
${ENVIRONMENT^} convention (fixes bootstrap failure on production DR)
- Add SKIP_ARGOCD option to bootstrap script for stage clusters
managed by production ArgoCD
- Fix helm commands missing --kube-context when CLUSTER_CONTEXT is set
- Add DR test plan document
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ler type - sealed-secrets Helm chart uses release name as deployment/service name (`sealed-secrets`), not `sealed-secrets-controller` - ArgoCD application-controller is a StatefulSet, not a Deployment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Application CRD doesn't exist until ArgoCD is installed, so kubectl apply -f argocd.yaml would fail on a fresh cluster. Now bootstraps ArgoCD via Helm first, then applies the self-management Application for GitOps takeover. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DR-TEST-PLAN.md is for local reference only, not for the repository. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without automated sync, bootstrap requires manual sync trigger before ArgoCD creates child applications. This blocks full automated DR recovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ArgoCD-managed ApplicationSets for operators that were previously installed manually: redis-operator, rabbitmq cluster/topology operators, minio-operator, otel-operator, and reflector - Add ServerSideApply=true to ARC to handle large CRDs (>262KB) - Add sync-wave '-1' to all CRD providers (operators, cert-manager, sealed-secrets) so they deploy before consumers - Remove unused kubernetes-dashboard Application (replaced by headlamp) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Adjust sync-waves: sealed-secrets/cert-manager to -3 (highest priority), operators to -2, app services remain at 0 (default) - Include github-app-secret SealedSecret in arc-runner-scale-set Application via multi-source directory include, so DR restores ARC runners without manual kubectl apply Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f8f0c8b573
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Upgrade from v2.12.0 to v2.19.1 (latest stable). v2.12.0 is the official baseline upgrade version, so direct upgrade is supported. Note: this will cause a rolling update of RabbitMQ StatefulSets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prometheus was using emptyDir, losing all TSDB data on pod restart. With 90d retention configured but no PVC, metrics history was wiped on every DR or node restart. Add 50Gi local-path PVC for both stage and production. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6ce02eb to
bdb61f8
Compare
sync-wave was on ApplicationSet metadata, but ArgoCD reads it from the generated Application objects. Move annotations to spec.template.metadata.annotations so wave ordering actually works. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@codex do additional review |
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
| helm repo update | ||
| $HELM upgrade --install argocd argo/argo-cd \ | ||
| --namespace argocd \ | ||
| --version 9.4.2 \ |
There was a problem hiding this comment.
infra/k8s/argocd/applications/argocd.yaml 에서도 버전이 명시되어있고 여기에도 버전이 명시되어있네요. 나중에 버전 업데이트할 때 충돌이 생길 수도 있을 것 같은데, 어떻게 생각하시는지요?
There was a problem hiding this comment.
정확한 지적입니다. 현재 버전에 대한 SSOT가 보장되지 않는 상황입니다.
latest 태그 사용보다는 낫다고 판단됩니다만 정답은 아니죠.
다른 좋은 방안 있을까요? 전 잘 모르겠네요
There was a problem hiding this comment.
어차피 이 스크립트는 clone된 레포 내에서 돌아가니, argocd.yaml을 셸 스크립트 내부에서 파싱해도 될 것 같습니다.
ARGOCD_VERSION=$(grep "targetRevision:" \
"${SCRIPT_DIR}/k8s/argocd/applications/argocd.yaml" \
| head -1 | awk -F"'" '{print $2}')
$HELM upgrade --install argocd argo/argo-cd \
--namespace argocd \
--version "${ARGOCD_VERSION}" \
...이런 식으로 말이지요.
Description
Stage 클러스터 DR(Disaster Recovery) 테스트에서 발견된 인프라 갭을 수정합니다.
Bootstrap 스크립트 수정
Prod→Production)sealed-secretsdeployment/service 이름을 Helm chart release name 기준으로 수정application-controller를 StatefulSet으로 올바르게 참조SKIP_ARGOCD옵션 추가ArgoCD 관리 밖에 있던 Operator를 선언적 관리로 전환
수동
kubectl apply로만 설치되던 6개 operator에 대해 ArgoCD ApplicationSet 생성:배포 순서 보장 (sync-wave)
CRD 제공자가 소비자보다 먼저 배포되도록 sync-wave 계층화:
Prometheus persistent storage 추가
emptyDir→ 10Gilocal-pathPVC로 변경기타
ServerSideApply=true추가 (CRD 262KB 초과 문제)github-app-secretSealedSecret을 arc-runner-scale-set multi-source에 포함kubernetes-dashboardApplication 삭제 (headlamp으로 대체)closes TAS-2598
Additional context
DR 테스트에서 추가로 발견된 운영 절차 이슈 (코드 외):
kubectl patch로 operationState 제거 필요Before submitting the PR, please make sure you do the following
fixes #123).