Secure AI Platform — Architectural Overview¶
Architectural definition of all modules, repos, and interconnections.
Platform Vision & Goals¶
The Secure AI Platform (SAI Platform) is a modular, production-style ecosystem designed to demonstrate secure-by-design AI engineering across the full lifecycle of machine learning and large language model systems.
Vision: To build a reference architecture that unifies, AI, security, and operations, showcasing how a Staff/Principal-level AI Security Engineer would design, deploy, and defend an AI-driven platform.
Design Principles¶
- Security-first: Every component integrates identity, logging, secrets management, and compliance.
- Reproducibility: All pipelines, models, and deployments can be rebuilt deterministically.
- Transparency: Clear observability and auditability across systems.
- Modularity: Each function exists in its own repo, but integrates through shared infrastructure.
- Pragmatism: Built using open-source, self-hosted, and production-relevant technologies.
Platform Modules¶
| Module | Purpose | Key Tech | Repository |
|---|---|---|---|
| Documentation & Governance | Program specifications, architecture, standards, and policies | Markdown, GitHub Actions, MkDocs | sai-platform-meta |
| Platform Infrastructure | Unified infrastructure: Vault, Caddy, Loki, Grafana, MLflow | Docker Compose, Vault, Caddy, Grafana, Loki, OTel | sai-platform-infra |
| ML Foundations | Core machine learning pipelines and datasets | PyTorch, scikit-learn, MLflow | sai-ml-foundations |
| Inference API | Secure model-serving API with JWT + RBAC | FastAPI, Pydantic, PyTorch, MLflow SDK | sai-inference-api |
| Agent SecOps | Secure LangChain agents for SOC automation | LangChain, FastAPI, Vault, OTel | sai-agent-secops |
| Adversarial Lab | Adversarial ML, red-teaming, and security tests | Adversarial Robustness Toolbox, PyTorch, LLM Red Team tools | sai-adversarial-lab |
| MLOps Pipeline | CI/CD, scanning, SBOMs, and release governance | GitHub Actions, Trivy, Syft, OPA, Cosign | sai-mlops-pipeline |
Repository Definitions¶
sai-platform-meta¶
Purpose: Acts as the control plane and governance layer for the entire platform.
- Contains program specifications, templates, CI/CD baselines, and policies.
- Source of truth for architecture, security, and observability standards.
- Provides reusable
.github/workflows and.dev/scaffolds.
Key Contents:
/docs/-> specs, architecture, narratives.github/-> CI/CD and security automation templates.dev/-> pre-commit hooks, linting, make targets
sai-platform-infra¶
Purpose: Provides local infrastructure for all other services.
- Hosts Vault, Caddy, Loki, Grafana, and MLflow under Compose profiles.
- Exposes telemetry and secret-management endpoints for other repos.
Key Technologies:
- Docker Compose (profiles: observability, secrets, registry, proxy)
- Caddy (reverse proxy), Loki/Grafana (logging & metrics), Vault (secrets)
- OTel Collector (traces)
Deliverables:
docker-compose.yml,Caddyfile,otel-collector.yaml,vault-bootstrap.sh/docs/tech/runbook.md(how to operate locally)
sai-ml-foundation¶
Purpose: Reproducible, secure ML training pipelines.
- Demonstrates secure model development lifecycle.
- Integrates MLflow for experiment tracking.
- Documents dataset provenance and lineage.
Key Technologies: PyTorch, scikit-learn, MLflow, pre-commit, ruff
Deliverables:
/src/-> model training & evaluation/data/-> dataset cards & manifests/docs/-> threat model, model cards, architecture
sai-inference-api¶
Purpose: Securely serve ML models as APIs.
- Implements JWT authentication and RBAC.
- Logs inferences with correlation IDs for traceability.
- Integrates with OTel and the observability stack.
Key Technologies: FastAPI, Uvicorn, PyTorch, Pydantic, JWT, OTel
Deliverables:
/src/-> API code/tests/-> security and functional tests/docs/-> OpenAPI spec, runbook, threat model
sai-agent-secops¶
Purpose: Build and secure LangChain agents for SOC automation.
- Enforces tool allowlists, RBAC, and policy-based prompt filtering.
- Integrates Vault for secret injection and audit logging.
- Provides auditability of prompt, tool, and result chains.
Key Technologies: LangChain, FastAPI, Vault, OTel, RegexGuard
Deliverables:
/src/agent/-> core agent logic/src/security/-> guardrails & filters/docs/-> security controls, chain visualization
sai-adversarial-lab¶
Purpose: Adversarial testing and AI red teaming.
- Tests model and agent resilience using adversarial attacks.
- Simulates prompt injections, jailbreaks, and data poisoning.
- Produces structured attack reports and metrics.
Key Technologies: Adversarial Robustness Toolbox (ART), PyTorch, LangChain Red Team tools
Deliverables:
/attacks/-> attack harness/tests/-> regression and defense tests/reports/-> structured output/docs/tech/threat-model.md-> red team findings
sai-mlops-pipeline¶
Purpose: Centralized CI/CD pipelines and policy enforcement.
- Provides reusable GitHub Actions workflows.
- Automates SBOM generation, container scanning, and artifact signing.
- Integrates OPA policies to enforce security gates.
Key Technologies: GitHub Actions, Trivy, Syft, Cosign, OPA/Conftest
Deliverables:
.github/workflows/-> reusable CI/CD templates/policy/-> OPA policy bundles/docs/-> CI/CD standards and release governance
Security & Compliance Foundations¶
Security Architecture Principles:
- Identity-first (JWT + RBAC on all APIs)
- Secrets in Vault only; short-lived tokens, no
.envfiles committed - SBOMs and signing enforced for every image (Trivy + Cosign)
- CI/CD policies enforced by OPA (fail closed)
- Static and dependency scanning (CodeQL, Dependabot)
Threat Modeling: Each repo maintains /docs/tech/threat-model.md following a shared template.
Platform-level threat models aggregate into: /docs/architecture/THREAT_MODEL.md within sai-platform-meta.
Observability & Operations¶
Telemetry Standard: OpenTelemetry (OTel) is instrumented across every service.
| Signal | Collector | Sink |
|---|---|---|
| Logs | Loki | Grafana Loki UI |
| Metrics | Prometheus exporters -> Grafana | Dashboards |
| Traces | OTel Collector | Grafana Tempo |
Key Dashboards:
- API latency and inference errors
- Model accuracy vs drift
- Agent tool usage statistics
- Adversarial lab attack outcomes
Runbook: /docs/tech/runbook.md in sai-platform-infra describes log/trace collection.
Integration Topology¶
Trust Zones:
public-> API Gateway / Agent endpoints (JWT auth)internal-> MLflow, Vault, Observability, CI/CDsecure-> Signing keys, OPA policies, SBOM registry
Secrets Flow:
- Vault issues short-lived tokens for APIs and CI/CD.
- Services retrieve secrets dynamically via OIDC or AppRole.
- No static keys or
.envfiles under version control.
Network Summary:
| Service | Port | Description |
|---|---|---|
| Grafana | 3000 | Observability UI |
| Loki | 3100 | Logs ingestion |
| OTel Collector | 4317 / 4318 | Telemetry input |
| Vault | 8200 | Secrets API |
| Caddy | 80 / 443 | Reverse proxy |
Development Lifecycle¶
Flow Overview:
Code -> Lint/Test -> SBOM -> Scan -> Sign -> Deploy -> Monitor -> Feedback -> Retrain
Stages:
- Code & Commit -> pre-commit hooks enforce lint, typing, and secret scan.
- CI Build -> SBOM generation (Syft) + scanning (Trivy).
- Policy Gate -> OPA checks; fail if HIGH/CRITICAL unwaived.
- Sign & Release -> Cosign signature, Git tag, changelog update.
- Deploy -> Compose or K8s.
- Monitor -> OTel + Grafana dashboards.
- Feedback Loop -> retraining, adversarial testing, or model updates.
Versioning: Semantic (v<major>.<minor>.<patch>)
Future Expansion¶
Planned Enhancements:
- Containerized K8s Helm deployment
- SOC Integration: alert triage bot and LLM-driven response workflows
- Compliance-as-code via Open Policy Agent extensions
- Distributed tracing with Grafana Tempo
- Public documentation site via MkDocs or Docusaurus
Long-term Goals:
- Demonstrate full AI security lifecycle: design -> defense -> detection -> governance
- Publish the architecture as a public reference for secure MLOps.