Cluster Management
Pneuma provisions one GKE cluster per zone, consuming Corpus networking and Logos team data. Clusters are CIS-hardened, Fleet-enrolled, and configured for Workload Identity from the start.
- GKE clusters: Regional clusters (highly available control plane) with a single-zone node pool per cluster — one cluster per zone (e.g.,
pt-pneuma-us-east1-b). Zone-scoped node pools ensure Istio's locality-aware load balancing keeps traffic within a zone, eliminating cross-zone hot spots in the mesh. Clusters are CIS GKE Benchmark hardened and Fleet-enrolled for multi-cluster ingress. - Workload Identity: Kubernetes service accounts are mapped to GCP service accounts, eliminating node-level credential access
- Namespace onboarding: A dedicated onboarding workspace in the Pneuma pipeline creates Kubernetes namespaces and Workload Identity bindings per team. It runs automatically within the same pipeline after the zonal cluster job completes — no separate trigger is needed once the pipeline starts. The Pneuma pipeline itself triggers on every merge to
pt-pneumamain, or a platform engineer can trigger it immediately viaworkflow_dispatch.
Namespace Provisioning
Namespace provisioning is automated within the Pneuma pipeline but is not cross-triggered by a Logos or Corpus merge. Each repo deploys independently when its own code merges to main, or when a platform engineer triggers workflow_dispatch.
Corpus and Pneuma are triggered independently — there is no automatic cascade from a Logos PR merge. A platform engineer must either merge a pending PR to each repo or manually trigger their workflow_dispatch workflow. Once Pneuma's pipeline runs, the onboarding workspace executes automatically as part of the pipeline; no additional action is required.
This page includes Architecture Decision Records documenting the key design decisions.
Components
| Component | Description |
|---|---|
gke-cluster | A regional GKE cluster (regional control plane with zone-scoped node pools, e.g., pt-pneuma-us-east1-b) with KMS encryption, Workload Identity, and CIS hardening — regional control planes with zone-scoped node pools reduce Istio control plane hotspots |
node-pool | A managed node pool with auto-provisioning, node auto-repair, and auto-upgrade |
fleet | A GKE Fleet registration enabling multi-cluster service discovery and ingress across zones |
workload-identity | Kubernetes-to-GCP service account mapping, allowing pods to authenticate to GCP without keys |
Core Invariants
- etcd is KMS-encrypted at rest —
database_encryptionwithstate = "ENCRYPTED"is hardcoded in the GKE module. - Workload Identity is enabled on every node pool — no static credentials for pod-level GCP access.
- Shielded nodes with Secure Boot and integrity monitoring are enforced on every node — no unverified boot path.
- Client certificate authentication is permanently disabled —
issue_client_certificate = falseis hardcoded. - Dataplane V2 (eBPF) is hardcoded as the network datapath — no legacy kube-proxy on any cluster.
Architecture Decision Records
One Cluster Per Zone with Five Add-on Layers
| Status | Date | Deciders |
|---|---|---|
| Accepted ✅ | April 2026 | Pneuma |
Context and Problem Statement
Kubernetes workload environments require more than just a cluster — they need a service mesh, certificate management, admission policy enforcement, and observability before they are ready to receive application teams. These concerns are operationally distinct: they have different upgrade cycles, failure modes, and ownership boundaries. Running them as one undifferentiated deployment creates coordination problems and makes individual components hard to reason about or replace.
The platform also needs a clear scaling model for clusters. Sizing a single large cluster per environment requires predicting peak load. Cluster failure affects all tenants. Regional failure affects all zones at once.
Decisions
-
Regional clusters with zone-scoped node pools. Each cluster has a regional control plane (highly available across three zones) but a single-zone node pool — one cluster per zone (e.g.,
pt-pneuma-us-east1-b,pt-pneuma-us-east4-a). Keeping node pools zone-local ensures Istio's locality-aware load balancing routes traffic within the zone by default, preventing cross-zone hot spots in the mesh. Clusters scale by adding zones, not by growing individual cluster size. Each cluster is independently managed, independently upgradeable, and independently recoverable. -
Five add-on layers for the cluster. The workload runtime is decomposed into five areas of concern, each deployed and managed independently:
| Layer | Tool | Concern |
|---|---|---|
| Cluster Management | GKE | Compute, networking attachment, Workload Identity, Fleet enrollment |
| Service Mesh | Istio | mTLS, traffic management, ingress, Datadog AAP |
| Certificate Management | cert-manager | Istio CA, mTLS PKI, and workload certificate signing via istio-csr |
| Policy Enforcement | OPA Gatekeeper | Kubernetes admission control and audit |
| Observability | Datadog Operator | Metrics, logs, and traces from all workloads |
Each layer maps to a dedicated subdirectory workspace in pt-pneuma, deployed in the correct order via GitHub Actions needs dependencies.
Alternatives Considered
- One large cluster per environment — Rejected. Single point of failure per environment. Blast radius of a misconfiguration or upgrade failure is the entire environment. Cannot scale horizontally without redesign.
- One cluster per team — Rejected. Multiplies operational overhead without a commensurate benefit. The one-cluster-per-zone model provides tenant isolation via namespaces and OPA policies without per-team cluster sprawl.
- Bundle all add-ons into a single deployment step — Rejected. CRD ordering constraints (e.g., cert-manager CRDs must exist before Istio certificate resources) require a defined deployment order. Separate workspaces with explicit dependencies makes this order visible and enforceable.
Consequences
- Zone failure is contained — other clusters continue serving traffic
- Add-on upgrades (e.g., Istio minor version) are applied per cluster without touching cluster infrastructure
- Adding a new zone requires claiming a CIDR slot from the Corpus IPAM plan and adding a zonal workspace to
pt-pneuma - Each add-on layer has its own subdirectory, workspace, and deployment job in the workflow