Skip to main content

Pneuma

Pneuma is the breath of life animating the platform via Kubernetes — orchestrating dynamic, self-healing, and scalable services atop the Logos foundation. Where Corpus gives form, Pneuma gives life, transforming infrastructure into workload environments capable of receiving and running application teams.

  • Cluster Management: GKE clusters with autoscaling node pools, Workload Identity, and Fleet enrollment
  • Service Mesh: Istio with mTLS, traffic management, and Datadog AAP-backed ingress
  • Certificate Management: cert-manager with istio-csr as the mesh CA, issuing all workload mTLS certificates from a self-signed root
  • Policy Enforcement: OPA Gatekeeper constraint templates and audit mode
  • Observability: Datadog Operator for cluster metrics, traces, and log collection

Pneuma consumes Corpus networking and Logos team data to create fully operational Kubernetes environments.

Architecture Decision Records

This page includes Architecture Decision Records documenting the key design decisions.

Repositories

  • pt-pneuma: OpenTofu configuration for GKE clusters and Kubernetes add-ons (cert-manager, Istio, OPA Gatekeeper, Datadog Operator)
  • pt-pneuma-istio-test: Example Istio test application that displays GKE cluster information; deployed as a container image to Google Artifact Registry and run on GKE clusters managed by pt-pneuma

AI Context

Context

Pneuma consumes from Corpus (networking and project infrastructure) and Arche (team data originating in Logos). It supplies Kubernetes clusters to all teams that need one — including Kryptos, which runs OpenBao on a Pneuma-managed cluster. See team dependencies.

Glossary

TermMeaning in this context
CertificateAn mTLS leaf certificate issued by cert-manager and signed by the mesh CA
ClusterA GKE Kubernetes cluster deployed to one or more zones within a Corpus project
ConstraintAn OPA Gatekeeper policy rule enforced at admission time against incoming Kubernetes resources
FleetA GCP construct grouping GKE clusters for unified management and policy
OperatorA Kubernetes controller deployed as an add-on (Datadog Operator, cert-manager) managing its resource lifecycle
PolicyA set of constraints defining the compliance posture for a cluster
Service meshThe Istio control plane managing mTLS, traffic routing, and observability across all pods
Workload identitySee the Corpus glossary — Pneuma consumes Workload Identity bindings provisioned by Corpus

Team Topologies

Cognitive Load

Pneuma is the most cognitively demanding platform team. It operates three domains of high inherent complexity simultaneously — Kubernetes clusters, an Istio service mesh, and a full PKI chain for workload certificates — alongside policy enforcement and observability. This is by design: these domains are inseparable at the cluster layer, and Arche modules carry the implementation weight so Pneuma engineers focus on orchestration and configuration rather than raw tooling.

Working DomainsHigh Intrinsic Domains
🔴 5 / 4🟠 3 / 3

Cognitive load by domain:

DomainIntrinsicExtraneous Reduced ByGermane Expertise
Cluster Management🔴 HighArche GKE moduleGKE internals, Fleet enrollment
Service Mesh🔴 HighArche Istio modulemTLS, traffic policy
Certificate Management🔴 HighArche cert-manager modulePKI chains, issuers
Policy Enforcement🟡 MediumArche OPA moduleRego, constraint authoring
Observability🟡 MediumArche Datadog moduleCluster metrics & traces

Capacity: 3 high-complexity domains — at the Team Topologies guideline of 2–3; team members hold 5 active domains — above the ~4 working-knowledge limit. Arche Kubernetes modules are the primary mitigation: all Helm-based add-on deployment is encapsulated, leaving Pneuma to own configuration and integration rather than implementation.

Extraneous load is minimized by:

  • Arche Kubernetes modules (pt-arche-kubernetes-*) wrap Istio, cert-manager, OPA Gatekeeper, and the Datadog Operator — no raw Helm chart management
  • Corpus handles all networking prerequisites; Pneuma consumes them via module.core_helpers
  • Called workflows provide OpenTofu deployment pipelines — no CI/CD to build or maintain

Germane load is built through:

  • Cloud-native orchestration: GKE internals, autoscaling, Workload Identity, and Fleet enrollment
  • Zero-trust networking: Istio mTLS, traffic policy, and Datadog AAP integration
  • Applied PKI: ECDSA root CA chains, cert-manager issuers, and istio-csr for mesh certificate signing
  • Policy-as-code: Rego constraint authoring and audit-mode enforcement patterns

Team Capacity

  • Headcount: 1–2 platform engineers
  • Scale signal: Add a second engineer when cluster count grows or multiple add-on upgrades run in parallel — the one team where headcount scales with the platform

Architecture Decision Records

Pneuma Cognitive Load Mitigation

StatusDateDeciders
Accepted ✅April 2026Pneuma, Platform Lead

Context and Problem Statement

Pneuma operates 5 working domains against the Team Topologies recommended limit of 4, with 3 high-intrinsic domains at the guideline ceiling of 3. This places the team formally at 🔴 over limit in the platform cognitive load table. The structural risk is that an overloaded team becomes a bottleneck, accrues technical debt faster, and is more vulnerable to failure when any single domain demands sustained attention. Acknowledging the overload without a documented mitigation and re-evaluation commitment leaves the risk unmanaged organizationally.

The five domains — Cluster Management, Service Mesh, Certificate Management, Policy Enforcement, and Observability — cannot be separated without creating artificial coupling problems. cert-manager CRDs must exist before Istio certificate resources; OPA Gatekeeper runs against all workloads on the cluster. Splitting these concerns across teams would require tight coordination at every upgrade cycle and introduce more extraneous load than the split would remove.

Decisions

  1. Accept the 🔴 overload state as a managed risk. The five domains are operationally inseparable at the cluster layer. This is a structural reality of the platform, not a resourcing failure. The risk is acknowledged, documented, and mitigated — not ignored.

  2. Arche Kubernetes modules are the primary load mitigation. Each of the five domains has a corresponding pt-arche-kubernetes-* module that encapsulates all Helm chart management and complex resource orchestration. Pneuma engineers own configuration and integration, not implementation. This mitigation is load-bearing: if Arche module coverage degrades, Pneuma's effective cognitive load increases proportionally.

  3. Headcount of 1–2 engineers is an acknowledged trade-off. One engineer can operate the domain within current scope because Arche modules absorb implementation complexity. A second engineer is the first scaling response when cluster count grows or parallel add-on upgrades become routine. This is a deliberate trade-off, not an oversight — a third engineer is not warranted while the scope remains at five domains and Arche coverage is intact.

  4. Explicit trigger conditions govern when this decision must be re-evaluated:

    • A sixth domain is added to Pneuma's scope
    • Any pt-arche-kubernetes-* module loses coverage or is removed without a replacement
    • Incident rate or on-call burden increases in a pattern consistent with cognitive overload
    • The team drops below minimum headcount (fewer than 1 active engineer)
    • A stream-aligned team reports consistent delays in namespace provisioning or cluster support

Alternatives Considered

  • Split Pneuma into two teams (Cluster Management + Mesh/Add-ons) — Rejected. The five domains are tightly coupled at deployment time: the needs dependency chain in the pipeline (cluster → onboarding → cert-manager → Istio → OPA → Datadog) requires a single owner who understands the full ordering. Splitting ownership would require cross-team coordination at every upgrade cycle, adding more extraneous load than the split removes.
  • Reduce scope by removing Policy Enforcement or Observability — Rejected. Both domains are required for the platform's baseline readiness guarantee: every cluster must have mTLS enforced, policy enforcement active, and observability running before any workload is accepted. Removing either breaks that guarantee.
  • Add a third engineer proactively — Deferred. Current headcount is sufficient while Arche module coverage is intact and cluster count is stable. Adding headcount ahead of a clear scaling signal increases coordination overhead without reducing cognitive load.

Consequences

  • Arche module coverage is a load-bearing dependency for the Pneuma team's capacity. Any PR that removes or degrades an Arche module without a replacement must be evaluated for its impact on Pneuma's cognitive load before merge.
  • Every new domain proposed for Pneuma must be accompanied by a corresponding Arche module that absorbs its implementation complexity, or by an explicit scope trade-off that removes an existing domain.
  • The 🔴 overload state in the platform cognitive load table is a known, managed risk. It must be re-examined at every headcount change, every scope change, and every Arche coverage change.