Building a Secrets Management Platform: Encryption, Key Hierarchy and Access Control

How to design a system that encrypts every secret with its own key, rotates credentials without downtime and gives developers a CLI they actually use.

🚀 Land Your Dream Tech Job in Weeks

💰 $50–$120/hr | Multiple Roles Open
Frontend • Backend • Full Stack • AI/ML • DevOps

👉 Apply Now & Get Hired Faster

GitGuardian’s State of Secrets Sprawl 2026 report found 28.65 million new hardcoded secrets in public GitHub commits during 2025. That is a 34% year-over-year increase. 64% of secrets leaked in 2022 were still active four years later. Nobody revoked them. Nobody rotated them. They just sat there.

IBM’s Cost of a Data Breach Report 2024 puts the average breach at $4.88 million. Stolen credentials remain the most common initial attack vector. Uber’s 2022 breach started with a hardcoded secret in a PowerShell script. CircleCI’s 2023 incident required every customer to rotate every secret stored on the platform.

Most teams eventually land on one of two options: a managed service like AWS Secrets Manager or a self-hosted tool like HashiCorp Vault. But what if you needed to build your own? Maybe you need multi-tenant isolation. Maybe you need per-environment access control that existing tools do not offer. Maybe you need a platform that fits your team’s workflow instead of forcing your team to fit the platform.

This article walks through the system design of a secrets management platform from scratch. Every architectural decision, every trade-off, every component explained.

Part 1: The Real Problem
Part 2: Requirements and Constraints
Part 3: Back-of-the-Envelope Estimation
Part 4: High-Level Architecture
Part 5: Encryption Deep Dive (DEK + KEK)
Part 6: Database Design
Part 7: Access Control Model
Part 8: API Design
Part 9: CLI Architecture
Part 10: Secret Lifecycle (Versioning, Rotation, Sharing)
Part 11: Failure Handling and Recovery
Part 12: Security and Compliance
Part 13: Scaling Strategies
Part 14: Cost Analysis
Part 15: Monitoring and Observability
Part 16: Trade-offs Discussed
Part 17: Key Takeaways
Part 18: Homework Assignment

Part 1: The Real Problem

Most teams go through the same progression with secrets. First, someone hardcodes a database password. A senior engineer catches it in code review and says “use environment variables.” The team moves to .env files. Everyone feels secure.

Except .env files are plain text on disk. They show up in Docker image layers if your Dockerfile copies them before the .dockerignore kicks in. They get committed to repositories when someone typos the .gitignore. They get pasted into Slack when a new developer needs access. They get captured in crash dumps and debug logs.

The real problem is not that developers are careless. The problem is that the insecure path (copy this file, paste this string) requires fewer steps than the secure path (authenticate, fetch from vault, inject at runtime). Any system you design has to invert that equation.

The tools that exist are good but imperfect. HashiCorp Vault is powerful and complex: it requires dedicated operational knowledge most small teams do not have. AWS Secrets Manager is simple but vendor-locked at $0.40 per secret per month. Doppler and Infisical (25,000+ GitHub stars) are strong but may not fit every multi-tenant or compliance model.

If you were designing a secrets management platform from zero, what would the architecture look like?

Part 2: Requirements and Constraints

Functional Requirements

Store secrets with typed values (text, JSON, YAML, boolean, integer, password) organized by organization, project and environment
Encrypt at rest using envelope encryption where each secret gets its own unique data encryption key
Version every change with full history, diff capability and rollback
Control access at the environment level with separate permissions for viewing metadata vs. decrypting values
Authenticate via OAuth, email/password with 2FA, scoped API tokens for CI/CD, and CLI tokens
Audit every access with immutable append-only logs
Sync secrets to third-party CI/CD providers (GitHub Actions, GitLab CI) on change
Inject at runtime through a CLI so .env files never need to touch disk

Non-Functional Requirements

Latency: Secret retrieval under 50ms p99 (dominated by decryption and key resolution time)
Availability: 99.9% uptime minimum. If the secrets platform is down, no application can cold-start
Consistency: A revoked secret must be unreadable within seconds, not minutes
Multi-tenancy: Complete data isolation between organizations at the database level
Key rotation: Rotate encryption keys without re-encrypting every secret immediately

Constraints

PostgreSQL as the primary datastore (mature, relational, well understood)
Must support both self-hosted and cloud-managed deployment
CLI must work offline for local config, online for secret operations

Core Challenges

Encryption at scale: Per-secret DEKs mean every read and write involves cryptographic operations
Key management chicken-and-egg: The system that manages secrets needs its own secret (the root key) to operate
Multi-tenant isolation: One organization must never see another’s data, even at the database level
CLI developer experience: If the CLI is harder than copying a .env file, nobody uses it
Zero-downtime key rotation: Rotating encryption keys without re-encrypting thousands of secrets simultaneously

Part 3: Back-of-the-Envelope Estimation

Before designing the architecture, let us estimate the scale this system needs to handle.

Traffic Estimation

Target: A mid-scale SaaS serving 500 organizations with 50 projects each, averaging 3 environments per project.

Organizations:           500
Projects per org:        50
Environments per project: 3
Total environments:      500 × 50 × 3 = 75,000

Secrets per environment:  30 (average)

Total secrets:           75,000 × 30 = 2,250,000

Secret reads per day:
  - CI/CD pipeline runs:  ~200,000 inject calls/day
  - Developer CLI pulls:  ~50,000 pulls/day
  - Dashboard views:      ~30,000 views/day
  Total reads:            ~280,000/day → ~3.2 reads/sec (average)
  Peak reads (deploy hour): ~10× average → 32 reads/sec

Secret writes per day:
  - Manual updates:       ~5,000/day
  - Sync pushes:          ~2,000/day
  Total writes:           ~7,000/day → ~0.08 writes/sec

This is not Twitter-scale traffic. Secrets management is read-heavy but low-throughput compared to consumer apps. The bottleneck is not requests per second. The bottleneck is cryptographic operations per request and KMS call latency.

Storage Estimation

Secret version row:
  - UUID (16 bytes) + secret_id (16 bytes) + type (10 bytes)
  - ciphertext (~200 bytes avg, base64-encoded encrypted value)
  - wrapped_dek (~120 bytes, base64-encoded wrapped key)
  - version int (4 bytes) + kms metadata (20 bytes)
  - timestamps (16 bytes)
  Total per version: ~400 bytes

Total secret versions (assume avg 5 versions per secret):
  2,250,000 × 5 = 11,250,000 versions
  11,250,000 × 400 bytes = ~4.5 GB

Audit log row: ~300 bytes
  280,000 reads/day × 365 days × 300 bytes = ~30 GB/year

Total storage (Year 1): ~35 GB

Total storage (3-year retention): ~100 GB

PostgreSQL handles this comfortably on a single node. No sharding required at this scale.

Compute Estimation

Per secret read (inject call for 30 secrets):
  - Auth token validation:    ~1ms
  - Permission check:         ~2ms (cached)
  - KEK resolution:           ~1ms (cache hit) or ~50ms (cache miss + KMS call)
  - 30× DEK unwrap + decrypt: ~30 × 0.5ms = ~15ms
  - DB query (batch):         ~5ms
  Total: ~25ms (cache hit) or ~75ms (cache miss)

CPU per decrypt operation: ~0.05ms on modern hardware
Peak: 32 req/sec × 30 secrets × 0.05ms = 48ms CPU time/sec

A single 4-core server handles peak load at under 2% CPU utilization. The platform is I/O-bound (database, KMS calls), not CPU-bound.

Part 4: High-Level Architecture

The platform needs three client surfaces: a web dashboard for humans, a CLI for developers, and an API for CI/CD service accounts.

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Web UI     │    │     CLI      │    │  CI/CD API   │
│ (dashboard)  │    │   (binary)   │    │  (tokens)    │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │
              ┌────────────▼────────────┐
              │     Backend API         │
              │  (auth, encryption,     │
              │   RBAC, audit)          │
              └────────────┬────────────┘
                           │
         ┌─────────────────┼─────────────────┐
         │                 │                 │
┌────────▼──────┐  ┌──────▼───────┐  ┌─────▼──────┐
│  PostgreSQL   │  │    Redis     │  │  KMS Layer  │
│  (secrets,    │  │  (cache,     │  │  (local or  │
│   versions,   │  │   queues,    │  │  cloud HSM) │
│   audit logs) │  │   locks)     │  │             │
└───────────────┘  └──────────────┘  └─────────────┘

Why PostgreSQL and not a dedicated secrets backend? Secrets platforms are metadata-heavy: versions, audit logs, permissions, organizations, projects, environments. That is relational data. PostgreSQL handles it well. The encryption barrier means PostgreSQL never sees plaintext. It stores ciphertext. The security boundary lives in the application layer, not the database.

Why a standalone CLI binary? Developers need to install the CLI without runtime dependencies. A language like Go compiles to a single static binary, handles cross-platform builds and has standard library support for AES-GCM encryption. The CLI also needs to encrypt its own stored tokens at rest, which means running its own encryption independent of the server.

Why Redis? Two jobs. First, caching resolved encryption keys so the system does not call the KMS provider on every decryption. Second, distributed locks to prevent thundering herd problems when a cached key expires and 50 concurrent requests all try to resolve it at once.

Services Decomposition

At moderate scale, a monolithic backend serves well. As the platform grows, these are the natural service boundaries:

┌──────────────────────────────────────────────────┐
│                  API Gateway                     │
│         (rate limiting, routing, auth)           │
└────────┬──────────┬──────────┬──────────┬────────┘
         │          │          │          │
    ┌────▼───┐ ┌────▼───┐ ┌────▼───┐ ┌────▼────┐
    │ Auth   │ │ Secret │ │ Audit  │ │  Sync   │
    │Service │ │Service │ │Service │ │ Service │
    └────────┘ └────────┘ └────────┘ └─────────┘

Auth Service: Token issuance, OAuth flows, 2FA verification, permission resolution
Secret Service: CRUD operations, encryption/decryption, version management
Audit Service: Append-only log ingestion, search, compliance reporting
Sync Service: Background jobs pushing secrets to GitHub Actions, GitLab CI, etc.

The Secret Service is the only component that touches encryption keys. This minimizes the blast radius: a vulnerability in the Sync Service cannot access the KMS layer.

Part 5: Encryption Deep Dive (DEK + KEK)

This is the core of the system. Get this wrong and nothing else matters.

The Envelope Encryption Pattern

Every secret is encrypted with its own unique Data Encryption Key (DEK). The DEK is then encrypted by a Key Encryption Key (KEK). The database stores two blobs per secret version: the encrypted secret and the encrypted DEK.

Why not encrypt everything with one master key? Because a single key compromise exposes every secret in the system. With per-secret DEKs, a compromised DEK exposes one secret. The KEK is the crown jewel, but it never touches the database in plaintext.

The Encryption Flow

Plaintext: "postgres://user:pass@db.prod:5432/app"
                    │
                    ▼
    ┌───────────────────────────────┐
    │ 1. Generate random DEK        │
    │    32 bytes from CSPRNG       │
    └───────────────┬───────────────┘
                    │
                    ▼
    ┌───────────────────────────────┐
    │ 2. Encrypt plaintext with DEK │
    │    Algorithm: AES-256-GCM     │
    │    IV:  12 random bytes       │
    │    Tag: 16 bytes (integrity)  │
    └───────────────┬───────────────┘
                    │
           ciphertext = base64(IV ‖ Tag ‖ Ciphertext)
                    │
                    ▼
    ┌───────────────────────────────┐
    │ 3. Encrypt DEK with KEK       │
    │    AES-256-GCM (separate IV)  │
    │    IV2: 12 random bytes       │
    │    Tag2: 16 bytes             │
    └───────────────┬───────────────┘
                    │
           wrapped_dek = base64(IV2 ‖ Tag2 ‖ WrappedDEK)
                    │
                    ▼
    ┌───────────────────────────────┐
    │ 4. Store in database          │
    │    → ciphertext (secret data) │
    │    → wrapped_dek (encrypted   │
    │      DEK)                     │
    │    → kms_key_version          │
    │      (which KEK version)      │
    └───────────────────────────────┘

Implementation

function encrypt(plaintext, kmsKey):
    // Step 1: Unique DEK per secret
    dek = crypto.randomBytes(32)
    
    // Step 2: Encrypt plaintext with DEK
    iv1 = crypto.randomBytes(12)
    (ciphertext, tag1) = AES_256_GCM_encrypt(plaintext, dek, iv1)
    
    // Step 3: Resolve KEK from KMS, then wrap the DEK
    kek = kmsLayer.resolveKey(kmsKey)
    iv2 = crypto.randomBytes(12)
    (wrappedDek, tag2) = AES_256_GCM_encrypt(dek, kek, iv2)
    
    // Step 4: Pack for storage
    return {
        ciphertext:   base64(iv1 + tag1 + ciphertext),
        wrapped_dek:  base64(iv2 + tag2 + wrappedDek),
        key_version:  kmsKey.materialVersion
    }

Pseudocode. The actual implementation depends on your language’s crypto library (OpenSSL for PHP/Python, crypto/aes for Go, Web Crypto for Node). Verify IV sizes and tag handling against your specific library’s API.

Three things to notice:

Every secret gets a fresh DEK from a cryptographically secure random number generator. No key reuse.
The IV is packed with the ciphertext, not stored separately. One column, one blob, all the data needed for decryption.
GCM provides authenticated encryption. The auth tag means tampering with ciphertext in the database causes decryption to fail rather than producing corrupted output.

KEK Resolution and Caching

The KEK must be resolved from a KMS layer, which is the most latency-sensitive operation in the system. You need a caching strategy:

Check the in-memory cache for the KEK keyed by kms_key_id:version
If cache miss, acquire a distributed lock (Redis) to prevent thundering herd
After acquiring the lock, double-check the cache (another request may have populated it)
If still missing, call the KMS provider (local decryption or cloud API call)
Store the result in cache with a short TTL (5 minutes is reasonable)
Encrypt the KEK before storing it in cache. If Redis is compromised, the attacker gets encrypted blobs

The double-check locking pattern is critical. Without it, a cache expiry under high concurrency causes N simultaneous KMS calls, which is both slow and expensive if you are using a cloud KMS billed per API call.

Supported KMS Providers

Design the KMS layer as an abstraction with swappable backends:

Local KMS: The KEK is stored in the database encrypted by the application’s master key (an environment variable). This is a chicken-and-egg problem: the system that manages secrets has its own secret stored as an environment variable. For self-hosted deployments where the operator controls the server, this is acceptable.

Cloud KMS (AWS KMS, GCP Cloud KMS, Azure Key Vault): The KEK is wrapped by the cloud provider’s customer master key. The application calls the cloud API to unwrap it at runtime. The key material never leaves the provider’s hardware security module. This removes the chicken-and-egg problem by pushing the trust boundary to the cloud provider’s HSMs.

Key Rotation Without Downtime

When you rotate a KEK, new secrets use the new key version. Old secrets continue working because each stored secret version records which KEK version encrypted its DEK.

This is lazy rotation: old secrets are not re-encrypted immediately. They continue decrypting fine with the old KEK version (which remains in the key table but marked as retired). When a secret is next updated, the new version uses the current KEK. No bulk migration. No downtime. No “please wait while we re-encrypt 50,000 secrets.”

Part 6: Database Design

Core Schema

The data model follows a hierarchical multi-tenant structure:

Organization
  └── Project
       └── Environment (dev, staging, production)
            └── Secret
                 └── SecretVersion (encrypted data + metadata)

The critical tables:

-- Secrets are key-value pairs scoped to an environment
CREATE TABLE secrets (
    id          UUID PRIMARY KEY,
    env_id      UUID REFERENCES environments(id) ON DELETE CASCADE,
    key         VARCHAR(255),
    description TEXT,
    sort_order  SMALLINT DEFAULT 1,
    is_deleted  BOOLEAN DEFAULT false,
    created_at  TIMESTAMP,
    updated_at  TIMESTAMP,
    UNIQUE(env_id, key)
);

-- Every mutation creates a new version (append-only)
CREATE TABLE secret_versions (
    id                UUID PRIMARY KEY,
    secret_id         UUID REFERENCES secrets(id) ON DELETE CASCADE,
    data_type         VARCHAR(50) DEFAULT 'text',
    ciphertext        TEXT,    -- base64(IV + Tag + EncryptedData)
    wrapped_dek       TEXT,    -- base64(IV + Tag + WrappedDEK)
    version           INTEGER DEFAULT 1,
    kms_key_id        UUID REFERENCES kms_keys(id),
    kms_key_version   INTEGER DEFAULT 1,
    created_at        TIMESTAMP,
    UNIQUE(secret_id, version)
);

-- KMS key material with versioning for rotation
CREATE TABLE kms_keys (
    id                UUID PRIMARY KEY,
    organization_id   UUID REFERENCES organizations(id),
    provider          VARCHAR(50),   -- 'local' or 'aws'
    algorithm         VARCHAR(50) DEFAULT 'aes-256-gcm',
    material_version  INTEGER DEFAULT 1,
    wrapped_material  TEXT,          -- encrypted key material
    activated_at      TIMESTAMP,
    retired_at        TIMESTAMP
);

-- Immutable audit log
CREATE TABLE audit_logs (
    id              UUID PRIMARY KEY,
    org_id          UUID REFERENCES organizations(id),
    actor_id        UUID,
    actor_type      VARCHAR(20),    -- 'user', 'service_account', 'system'
    action          VARCHAR(50),    -- 'secret.read', 'secret.reveal', 'secret.create'
    resource_type   VARCHAR(50),
    resource_id     UUID,
    metadata        JSONB,          -- request IP, user agent, etc.
    created_at      TIMESTAMP
);

Design Decisions

UUIDs over auto-increment. UUIDs prevent enumeration attacks (an attacker cannot guess the next secret ID). They also allow distributed ID generation without sequence coordination if you ever shard.

Soft deletes on secrets. When someone accidentally deletes DB_PASSWORD at 3 AM, you want a restore command, not a post-mortem. Soft-deleted secrets are invisible to normal queries but recoverable through a restore endpoint. A separate permanent-delete operation exists for compliance where data must be fully purged.

Append-only versioning. Every secret change creates a new secret_version row. The current value is the highest version number. This gives you a full audit trail, instant rollback by pointing to a previous version, and the ability to diff between versions during an incident. The secret row is a container. The versions hold the encrypted data.

Two separate columns for ciphertext and wrapped DEK. You could combine them into one blob, but separating them makes key rotation operations cleaner. When re-encrypting with a new KEK, you only touch wrapped_dek. The ciphertext column (encrypted by the DEK, not the KEK) does not change.

Separate audit log table. Audit logs are append-only and never updated. They grow fast and are queried differently from operational data (time-range scans, not point lookups). Keeping them in a separate table allows independent indexing, partitioning by month, and archival to cold storage without touching the secrets tables.

Part 7: Access Control Model

The Permission Hierarchy

Access control operates at three levels, each narrowing the scope:

Organization (admin, billing, team management)
  └── Project (view, manage environments)
       └── Environment (read, reveal, create, update, delete, export)

The critical design decision is separating read from reveal. A developer with read permission can see that DB_PASSWORD exists, view its type, version number and last-updated timestamp. A developer with reveal permission can decrypt and view the actual value.

Why separate them? Because most operations do not need the plaintext. Listing secrets for a diff, checking version history, running audit reports: none of these require decryption. Separating the permissions reduces the number of users who trigger decryption, which shrinks both the attack surface and the audit noise.

Identity Sources

Different consumers authenticate differently. Design for multiple auth methods:

Human users via OAuth (Google, GitHub) or email/password with 2FA
CI/CD pipelines via scoped service account tokens that grant access to specific projects and environments
CLI sessions via short-lived tokens with refresh capability, stored encrypted on the developer’s machine

A service account token for a deployment pipeline might have read, reveal and export on production but no create, update or delete. If the token leaks, the attacker can read secrets from one environment but cannot modify them or access other environments.

Protected Environments

Production environments can be marked as protected. Any secret change in a protected environment requires an approval workflow: a second team member reviews and approves the change before it takes effect. This prevents a single compromised account from silently modifying production credentials.

Part 8: API Design

Core Endpoints

GET    /secrets                  → List secrets (metadata only)
GET    /secrets/{key}            → Get single secret (with optional reveal)
POST   /secrets/{key}            → Create or update a secret
DELETE /secrets/{key}            → Soft delete
DELETE /secrets/{key}/permanent  → Irreversible purge
GET    /secrets/{key}/history    → Version history
POST   /secrets/{key}/rollback   → Restore a previous version
POST   /secrets/{key}/restore    → Undelete a soft-deleted secret

Sync Endpoints

POST   /sync/push    → Upload local .env/json/yaml to server
GET    /sync/pull    → Download secrets as .env/json/yaml
POST   /sync/diff    → Compare local file against remote state

The push endpoint parses the uploaded file, diffs against existing secrets and returns a summary: { created: 3, updated: 1, unchanged: 12, errors: 0 }. It does not blindly overwrite. The developer sees what will change before confirming.

Inject Endpoint

POST   /inject    → Returns all decrypted secrets as key-value pairs

This powers the CLI’s runtime injection. The CLI calls this endpoint, receives all secrets for the target environment and injects them as environment variables into a subprocess:

# Secrets injected into the process, never written to disk
secretscli inject -- npm start

# Or start a subshell with secrets loaded
secretscli inject --shell

The inject response should filter dangerous environment variables (LD_PRELOAD, PATH, LD_LIBRARY_PATH, DYLD_INSERT_LIBRARIES) that could be abused for privilege escalation if an attacker managed to write malicious values to the secrets store.

Response Envelope

Every response follows a consistent envelope:

{
    "error": false,
    "message": "Secrets retrieved successfully",
    "data": { ... }
}

Consistent envelopes simplify CLI parsing and error handling. The CLI does not need to guess whether a 200 response contains data or an error.

Part 9: CLI Architecture

Token Encryption at Rest

The CLI stores authentication tokens in a local config file. Those tokens must be encrypted at rest because a compromised laptop should not give an attacker plaintext API tokens.

The pattern: generate a 32-byte AES key on first use, store it in a key file with restricted permissions (0600 on Unix), and use AES-256-GCM to encrypt each token before writing the config file. Encrypted tokens use a prefix like enc: so the CLI can distinguish them from plaintext during migration.

function encryptToken(token, keyPath):
    key = loadOrCreateKey(keyPath)  // 32-byte AES key, file perms 0600
    nonce = crypto.randomBytes(12)
    ciphertext = AES_256_GCM_seal(key, nonce, token)
    return "enc:" + base64(nonce + ciphertext)

function decryptToken(encryptedToken, keyPath):
    key = loadKey(keyPath)
    if key file permissions != 0600:
        ABORT("Insecure key file permissions")
    payload = base64.decode(encryptedToken.removePrefix("enc:"))
    nonce = payload[0:12]
    ciphertext = payload[12:]
    return AES_256_GCM_open(key, nonce, ciphertext)

Pseudocode. Go’s crypto/aes + cipher.NewGCM, Python's cryptography.hazmat, and Node's crypto.createCipheriv all support this pattern.

The permission check on the key file is important. If a backup tool or misconfigured deployment changes the permissions, the CLI should refuse to operate rather than silently using a key file that anyone on the system can read.

Context Defaults

Typing --org acme --project api --env production on every command is a developer experience failure. The CLI should store a default context:

# Set once
secretscli context set --org acme --project api --env production

# Use everywhere (no flags needed)
secretscli secret list
secretscli secret get DB_PASSWORD
secretscli inject -- npm start

This is the difference between a CLI that developers tolerate and one they actually use. Flags can override the saved context for one-off operations across different environments.

Request Integrity

Every CLI request should include an HMAC checksum header computed from the request body. The server validates the checksum to detect tampering in transit. If the checksum does not match, the request is rejected before any secret operation occurs. This adds a layer of integrity verification on top of TLS.

Part 10: Secret Lifecycle

Versioning

Every mutation creates a new version row. The secret itself is a container; the versions hold the encrypted data.

Secret: DB_PASSWORD (env: production)
├── v1: created Jan 15 by alice   (encrypted blob A)
├── v2: created Feb 01 by bob     (encrypted blob B)
└── v3: created Mar 10 by alice   (encrypted blob C) ← current

Rolling back to v2 creates v4 with the same plaintext as v2 but a fresh encryption (new DEK, current KEK version). The history is append-only. You always see that a rollback happened and who triggered it.

Secret Sharing

Sometimes you need to share a credential with someone outside the platform: a contractor, a partner team, a support engineer. Design time-limited share links with:

Optional password protection
Maximum view count (“viewable 3 times, then self-destructs”)
Expiration timestamp
Audit trail for every view

The share link decrypts the secret server-side and displays it once. The recipient never gets permanent access. Once it expires or hits the view limit, the link is dead.

Leak Detection

When a developer sets a secret typed as password, hash it against the Have I Been Pwned API using k-anonymity (send only the first 5 characters of the SHA-1 hash, so the full password never leaves your server). If the password appears in a known breach database, return a warning. Require explicit confirmation before storing a known-compromised credential.

Sync to Third-Party Providers

Secrets can be synced to GitHub Actions, GitLab CI or other providers through integration connections. When a secret changes, a background job pushes the updated value to the configured provider via their API. Developers update secrets in one place and the CI/CD pipeline picks up changes automatically. No manual copy-paste between platforms.

Part 11: Failure Handling and Recovery

A secrets platform sits on the critical path of every deployment. When it fails, applications cannot start. Failure handling must be designed, not bolted on.

Scenario 1: KMS Provider Unavailable

Problem: Cloud KMS API (AWS KMS, GCP) returns errors or times out.
Impact: Cannot resolve KEKs → cannot decrypt any secrets.

Mitigation:
  1. KEK cache (Redis) serves requests during outage (5-min TTL)
  2. Extend cache TTL automatically when KMS health check fails
  3. Circuit breaker opens after 3 consecutive KMS failures
  4. Alert fires immediately - this is a P0 incident

Recovery:
  When KMS recovers, circuit breaker closes automatically.
  No data loss - all encrypted data remains intact in PostgreSQL.
  Cache repopulates on next request.

Scenario 2: Redis Cache Failure

Problem: Redis cluster goes down.
Impact: Every decrypt request hits KMS directly (slow + expensive).

Mitigation:
  1. Fall back to in-memory LRU cache on the application server
  2. In-memory cache has shorter TTL (60 seconds) to limit stale data
  3. Rate-limit KMS calls per organization to prevent bill shock
  4. Queue non-urgent decrypt requests, process urgent ones only

Recovery:
  Redis recovery is automatic with Sentinel or Cluster failover.
  In-memory cache continues serving during failover window.

Scenario 3: Write Fails Mid-Operation

Problem: Application creates a new secret version, encrypts the DEK,
         but the database write fails after encryption.
Impact: Orphaned encryption operation. Wasted KMS call.

Mitigation:
  1. Wrap encrypt + DB write in a database transaction
  2. If DB write fails, the transaction rolls back
  3. The generated DEK is discarded (never stored)
  4. Return error to client: "Secret update failed, please retry"
  5. No partial state: either the full version row exists or nothing does

This is why encryption happens in the application layer, not the database.
The DB transaction guarantees atomicity of the write.

Scenario 4: Corrupt Ciphertext

Problem: Ciphertext or wrapped_dek is corrupted (bit flip, storage error).
Impact: AES-GCM decryption fails — the auth tag check catches corruption.

Mitigation:
  1. GCM's authentication tag detects tampering/corruption automatically
  2. Return clear error: "Decryption failed for secret X version Y"
  3. Fall back to previous version if available
  4. Alert on repeated decryption failures (possible attack indicator)

This is one of the key benefits of AES-GCM over AES-CBC: corruption
is detected rather than producing silently wrong plaintext.

Idempotency

CLI retries and network timeouts can cause duplicate requests. Every secret write operation must be idempotent:

POST /secrets/{key} with the same payload and a client-generated idempotency key returns the existing version instead of creating a duplicate
The idempotency key is stored in Redis with a 10-minute TTL
If the server processed the request but the client never received the response, the retry returns the cached result

Part 12: Security and Compliance

Authentication Flow

Human User → OAuth/Email+2FA → JWT (1-hour expiry, refresh token)
CLI         → OAuth device flow → Access token + Refresh token (encrypted at rest)
CI/CD       → Scoped API token → No expiry, revocable, env-locked permissions

Each identity type gets the minimum access it needs. A CI/CD token for the staging environment cannot see production. A developer with read on production cannot reveal without explicit grant.

Defense in Depth

The system has multiple independent security layers:

Transport: TLS 1.3 for all connections. HSTS headers. Certificate pinning in the CLI.
Authentication: Multi-factor for humans. Scoped tokens for machines. Short-lived JWTs with refresh rotation.
Authorization: Per-environment RBAC with read/reveal separation. Protected environment approval workflows.
Encryption at rest: Envelope encryption (DEK + KEK). Per-secret isolation. KEK never stored plaintext in application database.
Encryption in cache: KEKs encrypted before caching in Redis. If Redis is compromised, attacker gets encrypted blobs.
Audit: Every access logged immutably. Alert on anomalous patterns (bulk reveal, off-hours access, new IP).

Secret Scanning Integration

Run pre-commit hooks with tools like gitleaks (v8.21+) or truffleHog to catch secrets before they reach the repository. The platform can expose a webhook that receives alerts from GitHub’s secret scanning partner program and automatically rotates the leaked credential.

Part 13: Scaling Strategies

Vertical Scaling (Do This First)

At the scale estimated in Part 3, a single PostgreSQL instance with read replicas handles the load. Before adding complexity:

Add read replicas for dashboard queries and audit log searches
Use connection pooling (PgBouncer) to handle connection spikes from CI/CD bursts
Optimize the hot path: batch-fetch secrets in a single query for inject calls instead of N+1 queries

Horizontal Scaling (When You Need It)

Trigger: > 500 req/sec sustained or > 10,000 organizations

Strategy:
  1. Stateless API servers behind a load balancer
  2. PostgreSQL read replicas per region
  3. Redis Cluster for distributed caching
  4. Audit log partitioning by month (drop old partitions for retention)

Database Scaling Path

Stage 1 (< 1,000 orgs): Single primary + 2 read replicas
Stage 2 (< 10,000 orgs): Shard by organization_id (tenant isolation)
Stage 3 (< 100,000 orgs): Per-region deployments with data residency

Sharding by organization_id is natural because secrets never cross organization boundaries. Each shard contains a complete set of data for a set of organizations. Cross-shard queries are never needed for operational endpoints.

Caching Strategy

Layer 1: Application in-memory (LRU, 60-second TTL)
  → KEKs, permission lookups, org metadata

Layer 2: Redis (distributed, 5-minute TTL for KEKs)
  → KEKs (encrypted), session data, rate limit counters

Layer 3: PostgreSQL (source of truth)
  → All persistent data

Cache invalidation on permission changes must be immediate. When an admin revokes a developer’s reveal permission, the cached permission must be evicted. Use Redis pub/sub to broadcast invalidation events across application instances.

Part 14: Cost Analysis

Disclaimer: These are rough estimates using publicly available cloud pricing as of early 2026. Actual costs vary significantly based on reserved instances, enterprise agreements, region, and optimization. Use your cloud provider’s pricing calculator for accurate projections.

Infrastructure Cost (500 Organizations, Year 1)

Compute:
  2× Application servers (4 vCPU, 16GB RAM)    $300/month
  1× Background worker (2 vCPU, 8GB RAM)        $100/month


Database:
  1× PostgreSQL primary (4 vCPU, 32GB RAM)      $400/month
  2× Read replicas                               $600/month
  100 GB storage                                  $12/month

Cache:
  1× Redis cluster (3 nodes, 4GB each)          $250/month

KMS:
  AWS KMS: 1 CMK ($1/month) + API calls          $50/month
  (~150,000 decrypt calls/month at $0.03/10K)

Networking:
  Load balancer + data transfer                  $100/month

Monitoring:
  Logging, metrics, alerting                     $150/month

Total: ~$1,962/month (~$23,500/year)

Cost per Secret

2,250,000 secrets under management
$1,962/month operating cost
Cost per secret: $0.00087/month

Compare: AWS Secrets Manager at $0.40/secret/month
  2,250,000 × $0.40 = $900,000/month

Self-hosted cost advantage: ~460× cheaper at this scale

The cost advantage of self-hosting grows with scale because the major costs (compute, database) increase sub-linearly while managed service pricing is per-secret. The trade-off is operational complexity: you own the uptime, the patching, and the on-call rotation.

Cost at Larger Scale (5,000 Organizations)

Compute (auto-scaled):       $2,000/month
Database (sharded):          $3,500/month
Cache (Redis Cluster):         $800/month
KMS:                           $200/month
Networking + LB:               $500/month
Monitoring:                    $400/month

Total: ~$7,400/month (~$88,800/year)
Cost per secret (22.5M): $0.00033/month

Part 15: Monitoring and Observability

Key Metrics

Business Metrics:

Total secrets under management (gauge)
Secret operations per minute by type (read, reveal, create, update, delete)
Active organizations and users (daily/monthly)
Sync success/failure rate by provider

System Metrics:

API response time (p50, p95, p99) — target: p99 < 100ms for reads
Decryption latency (p50, p95, p99) — target: p99 < 50ms
KMS call latency and error rate
Redis cache hit ratio — target: > 95%
Database connection pool utilization
Background job queue depth

Security Metrics:

Failed authentication attempts per minute
Bulk reveal operations (> 10 secrets in one call)
Access from new IP addresses
Permission escalation events
Decryption failures (possible tampering indicator)

Alerting Rules

CRITICAL (page on-call):
  - API error rate > 1% for 5 minutes
  - KMS provider unreachable for 2 minutes
  - Decryption failure rate > 0.1%
  - Database primary unreachable

WARNING (Slack notification):
  - API p99 latency > 200ms for 10 minutes
  - Redis cache hit ratio < 80%
  - Audit log ingestion lag > 60 seconds
  - Certificate expiry < 14 days

SECURITY (immediate page):
  - Bulk reveal from new IP
  - > 10 failed auth attempts from same source in 1 minute
  - Service account used outside allowed CIDR range

Distributed Tracing

Trace every secret operation end-to-end:

Secret Reveal (45ms total)
├── Auth middleware (3ms)
│   └── JWT validation + permission check (cached)
├── Secret Service (40ms)
│   ├── DB query: fetch secret + version (5ms)
│   ├── KEK resolution (2ms, cache hit)
│   ├── DEK unwrap: AES-GCM decrypt (0.3ms)
│   └── Secret decrypt: AES-GCM decrypt (0.2ms)
└── Audit log write (2ms, async)

The async audit log write is important. Writing to the audit log should never add latency to the secret retrieval path. Use a buffered queue (Redis list or in-memory channel) that flushes to PostgreSQL in batches.

Part 16: Trade-offs Discussed

1. Relational DB vs. Dedicated Secrets Backend PostgreSQL gives relational query power for the complex access control model, versioning and audit joins. A dedicated backend like Consul or etcd would give better clustering semantics. At thousands of secrets, PostgreSQL is the right choice. At millions with high-throughput decryption, you would need to revisit.

2. Per-Secret DEKs vs. Shared Encryption Key Per-secret DEKs increase storage (roughly 120 bytes of wrapped_dek per version). For 10,000 secrets with 5 versions each, that is about 6 MB of additional storage. Negligible. The security benefit of key isolation is worth orders of magnitude more storage.

3. Lazy Key Rotation vs. Immediate Re-encryption Lazy rotation means old secrets stay encrypted with the previous KEK version until they are next updated. The trade-off: if the old KEK is compromised, those secrets remain vulnerable until re-encrypted. Immediate re-encryption eliminates this window but causes a latency spike. For most teams, lazy rotation is the right default with an option to trigger bulk re-encryption when needed.

4. Local KMS vs. Cloud KMS Local KMS has a circular dependency: the system that manages secrets relies on a secret (the application key) stored as an environment variable. Cloud KMS removes this by pushing the trust to HSMs. But cloud KMS adds latency (API call per cache miss), cost ($1/month per key plus $0.03 per 10,000 API calls on AWS KMS) and vendor dependency. Support both and let operators choose.

5. Monolith vs. Microservices A monolithic backend is simpler to deploy, debug and reason about. Splitting into microservices adds network hops, distributed tracing needs and deployment complexity. Start monolithic. Extract services only when you have a concrete scaling or isolation reason (e.g., audit log ingestion overwhelming the main database).

6. Strong Consistency vs. Eventual Consistency Secret reads must be strongly consistent — a revoked secret must be unreadable immediately. Audit logs can be eventually consistent (a few seconds of lag is acceptable). Permission changes need near-immediate consistency, achieved through cache invalidation via pub/sub.

Part 17: Key Takeaways

Envelope encryption is non-negotiable. Per-secret DEKs limit the blast radius of any key compromise. The KEK protects the DEKs. The KMS protects the KEK. Each layer reduces risk independently.
The secure path must be the easy path. Context defaults, runtime injection, share links, leak detection. If the CLI requires more steps than copying a .env file, developers will copy the .env file.
Separate read from reveal. Most operations do not need plaintext. This single permission split reduces attack surface, audit noise and KMS costs simultaneously.
Cache encryption keys, not secrets. Caching decrypted secrets is dangerous (cache compromise = data breach). Caching KEKs with encryption and short TTLs balances performance with security.
Design for key rotation from day one. Every encrypted blob records which key version encrypted it. Lazy rotation means zero downtime. Bulk re-encryption is available when needed.
Audit everything, query it later. Append-only audit logs are cheap to write and invaluable during incident response. Partition by month for retention management.
PostgreSQL is enough. Secrets management is metadata-heavy and low-throughput. Do not over-engineer the storage layer. A single PostgreSQL instance with read replicas handles most deployments comfortably.

Part 18: Homework Assignment

If you want to go deeper, here are extensions worth designing:

Shamir’s Secret Sharing for the root key. Instead of a single environment variable holding the master key, split it into N shares where any K-of-N shares can reconstruct the key (e.g., 3-of-5). Design the unseal ceremony flow: how does the system start if it needs 3 different people to provide their shares?
Cross-region replication with data residency. Some organizations require secrets to stay within specific geographic boundaries (EU data in EU, US data in US). Design a multi-region architecture where each region has its own KMS but organizations can span regions.
Automated secret rotation. Design a system that automatically rotates database passwords: generate new credential, update the target database, verify the new credential works, then update the secret in the platform. Handle the failure case where the database accepts the new password but the platform fails to store it.
Break-glass access. Design an emergency access mechanism that allows a pre-authorized admin to bypass normal RBAC and access any secret, with mandatory audit trail and post-incident review workflow. How do you prevent abuse while enabling legitimate emergency response?
End-to-end encryption. Redesign the system so the server never sees plaintext. The CLI encrypts before sending, the server stores blobs, and only the CLI can decrypt. How does this change the sharing, search and sync features?

28.65 million secrets leaked on GitHub last year because the insecure path was easier than the secure one. The architecture does not fix that by making the insecure path harder. It fixes it by making the secure path effortless.

That is the entire system design in one sentence.

If this article helped you think about secrets management differently, consider sharing it with your team. The best time to design a secrets platform is before the next breach. The second best time is now.

Thank you for being a part of the community

Before you go:

👉 Be sure to clap and follow the writer ️👏️️

👉 Follow us: Linkedin| Medium

👉 CodeToDeploy Tech Community is live on Discord — Join now!

Disclosure: This post includes affiliate and partnership links.