Pneuma
Pneuma is the breath of life animating the platform via Kubernetes — orchestrating dynamic, self-healing, and scalable services atop the Logos foundation. Where Corpus gives form, Pneuma gives life, transforming infrastructure into workload environments capable of receiving and running application teams.
- Cluster Management: GKE clusters with autoscaling node pools, Workload Identity, and Fleet enrollment
- Service Mesh: Istio with mTLS, traffic management, and Datadog AAP-backed ingress
- Certificate Management: cert-manager with istio-csr as the mesh CA, issuing all workload mTLS certificates from a self-signed root
- Policy Enforcement: OPA Gatekeeper constraint templates and audit mode
- Observability: Datadog Operator for cluster metrics, traces, and log collection
Pneuma consumes Corpus networking and Logos team data to create fully operational Kubernetes environments.
This page includes Architecture Decision Records documenting the key design decisions.
Repositories
- pt-pneuma: OpenTofu configuration for GKE clusters and Kubernetes add-ons (cert-manager, Istio, OPA Gatekeeper, Datadog Operator)
- pt-pneuma-istio-test: Example Istio test application that displays GKE cluster information; deployed as a container image to Google Artifact Registry and run on GKE clusters managed by pt-pneuma
AI Context
- pt-pneuma-ai-context: Team-level Copilot instructions for
pt-pneuma-*repositories
Context
Pneuma consumes from Corpus (networking and project infrastructure) and Arche (team data originating in Logos). It supplies Kubernetes clusters to all teams that need one — including Kryptos, which runs OpenBao on a Pneuma-managed cluster. See team dependencies.
Glossary
| Term | Meaning in this context |
|---|---|
| Certificate | An mTLS leaf certificate issued by cert-manager and signed by the mesh CA |
| Cluster | A GKE Kubernetes cluster deployed to one or more zones within a Corpus project |
| Constraint | An OPA Gatekeeper policy rule enforced at admission time against incoming Kubernetes resources |
| Fleet | A GCP construct grouping GKE clusters for unified management and policy |
| Operator | A Kubernetes controller deployed as an add-on (Datadog Operator, cert-manager) managing its resource lifecycle |
| Policy | A set of constraints defining the compliance posture for a cluster |
| Service mesh | The Istio control plane managing mTLS, traffic routing, and observability across all pods |
| Workload identity | See the Corpus glossary — Pneuma consumes Workload Identity bindings provisioned by Corpus |
Team Topologies
Cognitive Load
Pneuma is the most cognitively demanding platform team. It operates three domains of high inherent complexity simultaneously — Kubernetes clusters, an Istio service mesh, and a full PKI chain for workload certificates — alongside policy enforcement and observability. This is by design: these domains are inseparable at the cluster layer, and Arche modules carry the implementation weight so Pneuma engineers focus on orchestration and configuration rather than raw tooling.
| Working Domains | High Intrinsic Domains |
|---|---|
| 🔴 5 / 4 | 🟠 3 / 3 |
Cognitive load by domain:
| Domain | Intrinsic | Extraneous Reduced By | Germane Expertise |
|---|---|---|---|
| Cluster Management | 🔴 High | Arche GKE module | GKE internals, Fleet enrollment |
| Service Mesh | 🔴 High | Arche Istio module | mTLS, traffic policy |
| Certificate Management | 🔴 High | Arche cert-manager module | PKI chains, issuers |
| Policy Enforcement | 🟡 Medium | Arche OPA module | Rego, constraint authoring |
| Observability | 🟡 Medium | Arche Datadog module | Cluster metrics & traces |
Capacity: 3 high-complexity domains — at the Team Topologies guideline of 2–3; team members hold 5 active domains — above the ~4 working-knowledge limit. Arche Kubernetes modules are the primary mitigation: all Helm-based add-on deployment is encapsulated, leaving Pneuma to own configuration and integration rather than implementation.
Extraneous load is minimized by:
- Arche Kubernetes modules (
pt-arche-kubernetes-*) wrap Istio, cert-manager, OPA Gatekeeper, and the Datadog Operator — no raw Helm chart management - Corpus handles all networking prerequisites; Pneuma consumes them via
module.core_helpers - Called workflows provide OpenTofu deployment pipelines — no CI/CD to build or maintain
Germane load is built through:
- Cloud-native orchestration: GKE internals, autoscaling, Workload Identity, and Fleet enrollment
- Zero-trust networking: Istio mTLS, traffic policy, and Datadog AAP integration
- Applied PKI: ECDSA root CA chains, cert-manager issuers, and istio-csr for mesh certificate signing
- Policy-as-code: Rego constraint authoring and audit-mode enforcement patterns
Team Capacity
- Headcount: 1–2 platform engineers
- Scale signal: Add a second engineer when cluster count grows or multiple add-on upgrades run in parallel — the one team where headcount scales with the platform
Architecture Decision Records
Pneuma Cognitive Load Mitigation
| Status | Date | Deciders |
|---|---|---|
| Accepted ✅ | April 2026 | Pneuma, Platform Lead |
Context and Problem Statement
Pneuma operates 5 working domains against the Team Topologies recommended limit of 4, with 3 high-intrinsic domains at the guideline ceiling of 3. This places the team formally at 🔴 over limit in the platform cognitive load table. The structural risk is that an overloaded team becomes a bottleneck, accrues technical debt faster, and is more vulnerable to failure when any single domain demands sustained attention. Acknowledging the overload without a documented mitigation and re-evaluation commitment leaves the risk unmanaged organizationally.
The five domains — Cluster Management, Service Mesh, Certificate Management, Policy Enforcement, and Observability — cannot be separated without creating artificial coupling problems. cert-manager CRDs must exist before Istio certificate resources; OPA Gatekeeper runs against all workloads on the cluster. Splitting these concerns across teams would require tight coordination at every upgrade cycle and introduce more extraneous load than the split would remove.
Decisions
-
Accept the 🔴 overload state as a managed risk. The five domains are operationally inseparable at the cluster layer. This is a structural reality of the platform, not a resourcing failure. The risk is acknowledged, documented, and mitigated — not ignored.
-
Arche Kubernetes modules are the primary load mitigation. Each of the five domains has a corresponding
pt-arche-kubernetes-*module that encapsulates all Helm chart management and complex resource orchestration. Pneuma engineers own configuration and integration, not implementation. This mitigation is load-bearing: if Arche module coverage degrades, Pneuma's effective cognitive load increases proportionally. -
Headcount of 1–2 engineers is an acknowledged trade-off. One engineer can operate the domain within current scope because Arche modules absorb implementation complexity. A second engineer is the first scaling response when cluster count grows or parallel add-on upgrades become routine. This is a deliberate trade-off, not an oversight — a third engineer is not warranted while the scope remains at five domains and Arche coverage is intact.
-
Explicit trigger conditions govern when this decision must be re-evaluated:
- A sixth domain is added to Pneuma's scope
- Any
pt-arche-kubernetes-*module loses coverage or is removed without a replacement - Incident rate or on-call burden increases in a pattern consistent with cognitive overload
- The team drops below minimum headcount (fewer than 1 active engineer)
- A stream-aligned team reports consistent delays in namespace provisioning or cluster support
Alternatives Considered
- Split Pneuma into two teams (Cluster Management + Mesh/Add-ons) — Rejected. The five domains are tightly coupled at deployment time: the
needsdependency chain in the pipeline (cluster → onboarding → cert-manager → Istio → OPA → Datadog) requires a single owner who understands the full ordering. Splitting ownership would require cross-team coordination at every upgrade cycle, adding more extraneous load than the split removes. - Reduce scope by removing Policy Enforcement or Observability — Rejected. Both domains are required for the platform's baseline readiness guarantee: every cluster must have mTLS enforced, policy enforcement active, and observability running before any workload is accepted. Removing either breaks that guarantee.
- Add a third engineer proactively — Deferred. Current headcount is sufficient while Arche module coverage is intact and cluster count is stable. Adding headcount ahead of a clear scaling signal increases coordination overhead without reducing cognitive load.
Consequences
- Arche module coverage is a load-bearing dependency for the Pneuma team's capacity. Any PR that removes or degrades an Arche module without a replacement must be evaluated for its impact on Pneuma's cognitive load before merge.
- Every new domain proposed for Pneuma must be accompanied by a corresponding Arche module that absorbs its implementation complexity, or by an explicit scope trade-off that removes an existing domain.
- The 🔴 overload state in the platform cognitive load table is a known, managed risk. It must be re-examined at every headcount change, every scope change, and every Arche coverage change.