AKS Without the Headaches: GitOps, Fleet Management, and Day-2 Operations

by G.R Badhon

Goal: Production-grade AKS lifecycle using GitOps and multi-cluster management, with pragmatic day-2 operations.

Architecture

 GitHub or Azure Repos
                             main     env/dev     env/prod
                               |          |            |
                               |   PR->merge gates     |
                               v          v            v
                        ┌──────────────────────────────────┐
                        │               Flux                │
                        │  GitRepository + Kustomization   │
                        │  Reconcile loop per cluster      │
                        └──────────────────────────────────┘
                                 ^       ^        ^
                                 |       |        |
                    ┌────────────┴───┐ ┌─┴─────────────┐ ┌───────────────┐
                    │  Cluster Dev   │ │  Cluster QA   │ │  Cluster Prod │
                    │  MI + OIDC     │ │  MI + OIDC    │ │  MI + OIDC    │
                    │  CNI + NP      │ │  CNI + NP     │ │  CNI + NP     │
                    └──────┬─────────┘ └─────┬─────────┘ └──────┬────────┘
                           |                 |                   |
                   flux bootstrap      flux bootstrap      flux bootstrap
                           |                 |                   |
              bootstrap repo path   bootstrap repo path   bootstrap repo path
                 clusters/dev          clusters/qa           clusters/prod 

Notes: OIDC issuer and Workload Identity connect pods to Azure resources. CNI is Azure CNI or Cilium. NP means NetworkPolicy.

Cluster IaC

az CLI quickstart

# Resource group and ACR
az group create -n rg-aks-prod -l uksouth
az acr create -n acrprod123 -g rg-aks-prod --sku Premium

# AKS with managed identity, OIDC, workload identity, autoscaler
az aks create \
  -g rg-aks-prod -n aks-prod \
  --enable-managed-identity \
  --enable-oidc-issuer --enable-workload-identity \
  --network-plugin azure \
  --node-count 3 --min-count 3 --max-count 10 \
  --enable-cluster-autoscaler \
  --auto-upgrade-channel stable

# Attach ACR pull
az aks update -g rg-aks-prod -n aks-prod --attach-acr acrprod123

# Add a user node pool for workloads
az aks nodepool add -g rg-aks-prod --cluster-name aks-prod \
  -n userpool --mode User --node-count 3 \
  --enable-cluster-autoscaler --min-count 3 --max-count 20

# Flux bootstrap against repo
az aks get-credentials -g rg-aks-prod -n aks-prod
flux bootstrap github \
  --owner your-org --repository aks-gitops \
  --branch env/prod --path clusters/prod --personal 

Bicep sketch

param location string = resourceGroup().location
param clusterName string = 'aks-prod'
param acrId string

resource aks 'Microsoft.ContainerService/managedClusters@2024-05-01' = {
  name: clusterName
  location: location
  identity: { type: 'SystemAssigned' }
  properties: {
    dnsPrefix: '${clusterName}-dns'
    oidcIssuerProfile: { enabled: true }
    workloadIdentityProfile: { enabled: true }
    kubernetesVersion: '1.29.7' // pin intentionally
    apiServerAccessProfile: {
      enablePrivateCluster: true
    }
    autoScalerProfile: {
      maxEmptyBulkDelete: '10'
      scanInterval: '20s'
    }
    agentPoolProfiles: [
      {
        name: 'system'
        mode: 'System'
        count: 3
        vmSize: 'Standard_D4s_v5'
        osType: 'Linux'
        enableAutoScaling: true
        minCount: 3
        maxCount: 10
        orchestratorVersion: '1.29.7'
      }
    ]
    networkProfile: {
      networkPlugin: 'azure' // or 'none' with Cilium CNI addon
    }
  }
}

@description('User pool defined as child resource for surge and taints')
resource userpool 'Microsoft.ContainerService/managedClusters/agentPools@2024-05-01' = {
  name: '${aks.name}/userpool'
  properties: {
    mode: 'User'
    count: 3
    vmSize: 'Standard_D4s_v5'
    enableAutoScaling: true
    minCount: 3
    maxCount: 20
    nodeTaints: [ 'workload=true:NoSchedule' ]
  }
}

// ACR pull permission
resource acrRole 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(aks.id, 'AcrPull')
  scope: resourceId('Microsoft.ContainerRegistry/registries', split(acrId,'/')[8])
  properties: {
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '7f951dda-4ed3-4680-a7ca-43fe172d538d')
    principalId: aks.identity.principalId
    principalType: 'ServicePrincipal'
  }
} 

Workload Identity for a pod

apiVersion: v1
kind: ServiceAccount
metadata:
  name: sa-blob
  namespace: app
  annotations:
    azure.workload.identity/client-id: <user-assigned-mi-client-id>
    azure.workload.identity/tenant-id: <tenant-id>
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: app
spec:
  replicas: 3
  selector: { matchLabels: { app: api } }
  template:
    metadata: { labels: { app: api } }
    spec:
      serviceAccountName: sa-blob
      containers:
      - name: api
        image: acrprod123.azurecr.io/api:1.2.3
        env:
        - name: AZURE_CLIENT_ID
          valueFrom:
            fieldRef: { fieldPath: metadata.annotations['azure.workload.identity/client-id'] } 

GitOps with Flux

Environment branches hold overlays. Flux reconciles per cluster.

apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata:
  name: gitops
  namespace: flux-system
spec:
  interval: 1m
  url: https://github.com/your-org/aks-gitops
  ref: { branch: env/prod }
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: platform
  namespace: flux-system
spec:
  interval: 1m
  prune: true
  sourceRef: { kind: GitRepository, name: gitops }
  path: ./platform
  wait: true
  dependsOn: []
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: workloads
  namespace: flux-system
spec:
  interval: 1m
  prune: true
  sourceRef: { kind: GitRepository, name: gitops }
  path: ./workloads/prod
  dependsOn:
  - name: platform 

Repo structure example:

/cluster-bootstrap
/clusters/dev
/clusters/prod
/platform/cert-manager
/platform/ingress
/workloads/dev
/workloads/prod 

Fleet and multi cluster patterns

  • Namespaces vs clusters: use namespaces for teams with shared trust and soft isolation. Use dedicated clusters for tenant isolation, regulatory boundaries, or conflicting add-ons.
  • Placement: one repo, many clusters. Parameterize overlays by cluster labels. Flux supports kustomize substitutions and Json6902 patches.
  • Multi tenancy: restrict RBAC to namespaces. Use network policies as a default deny.

NetworkPolicy baseline:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny
  namespace: app
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-and-egress-gw
  namespace: app
spec:
  podSelector: {}
  egress:
  - to:
    - namespaceSelector: { matchLabels: { kube-system: "true" } }
    ports: [ { port: 53, protocol: UDP }, { port: 53, protocol: TCP } ]
  - to:
    - ipBlock: { cidr: 0.0.0.0/0 } # replace with egress IP or firewall
    ports: [ { port: 443, protocol: TCP } ]
  policyTypes: [Egress] 

Day 2 operations

  • Upgrades: pin minor in Bicep. Rotate node images regularly.
# Control plane first
az aks upgrade -g rg-aks-prod -n aks-prod --control-plane-only --kubernetes-version 1.29.7
# Node images
az aks upgrade -g rg-aks-prod -n aks-prod --node-image-only
# Optional pool level
az aks nodepool upgrade -g rg-aks-prod --cluster-name aks-prod -n userpool 
  • Blue or green strategy: create a parallel user pool with taints and drain workloads by label. At the traffic layer, use Gateway API with weighted backends.
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: shop
spec:
  parentRefs: [ { name: public-gw } ]
  rules:
  - backendRefs:
    - name: shop-blue
      weight: 90
    - name: shop-green
      weight: 10 
  • Backup and restore: Velero is portable. Snapshot volumes via CSI. For Azure native, use Azure Backup for AKS to protect PVs and workloads.
  • Egress and Private Link: use a NAT Gateway or Azure Firewall for deterministic egress IPs. Prefer Private Link to ACR, Key Vault, Storage. Lock down API server with private cluster and authorized subnets.

Observability

  • Container Insights: enable via addon for node and control plane metrics and logs. Route to Log Analytics.
az aks enable-addons -g rg-aks-prod -n aks-prod -a monitoring 
  • Prometheus stack: kube-prometheus-stack via Flux. Record SLOs and alert on burn rates.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: slo-api
  namespace: monitoring
spec:
  groups:
  - name: api-latency
    rules:
    - record: slo:request_latency_seconds:rate5m
      expr: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{job="api"}[5m])))
    - alert: APIHighErrorRate
      expr: sum(rate(http_requests_total{job="api",code=~"5.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m])) > 0.05
      for: 10m
      labels: { severity: page }
      annotations:
        summary: API error rate above 5 percent 

Dashboards: surface SLO target, error budget remaining, and burn alerts. Grafana, Azure Managed Grafana, or vendor of choice are fine.

Security

  • Secrets: use Secrets Store CSI with Azure provider. Mount secrets from Key Vault with Workload Identity.
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: kv-secrets
  namespace: app
spec:
  provider: azure
  parameters:
    useWorkloadIdentity: "true"
    clientID: <user-assigned-mi-client-id>
    keyvaultName: my-kv
    objects: |
      array:
        - |
          objectName: app-secret
          objectType: secret
    tenantId: <tenant-id> 
  • Image provenance: sign images in CI using Cosign. Store in ACR with OCI attestations. Enforce at admission.
  • Policy: Gatekeeper or Kyverno. Example Kyverno verifyImages.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-signed-images
spec:
  rules:
  - name: require-cosign
    match:
      any:
      - resources:
          kinds: [Pod, Deployment]
    verifyImages:
    - image: "acrprod123.azurecr.io/*"
      keyless:
        issuer: "https://token.actions.githubusercontent.com"
        subject: "repo:your-org/*:ref:refs/heads/main"
      attestations:
      - predicateType: "cosign.sigstore.dev/attestation/v1" 

Gatekeeper example for required labels:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: req-team-label
spec:
  match:
    kinds:
    - apiGroups: [""]
      kinds: ["Namespace"]
  parameters:
    labels: ["owner", "costCenter"] 

Ops checklist

  • Pull requests gate changes. Flux reconciles in minutes.
  • Autoscaler and max surge tuned. Drains scripted during upgrades.
  • Egress pinned to NAT IPs. Private Link for supply chain.
  • SLOs define your paging. Policy prevents bad images and missing labels.
  • Backups verified monthly. Restore playbook tested.

You may also like