How to design a system that encrypts every secret with its own key, rotates credentials without downtime and gives developers a CLI they actuallyΒ use.

π Land Your Dream Tech Job in Weeks
π° $50β$120/hr | Multiple Roles Open
Β Frontend β’ Backend β’ Full Stack β’ AI/ML β’ DevOps
π Apply Now & Get Hired Faster

GitGuardianβs State of Secrets Sprawl 2026 report found 28.65 million new hardcoded secrets in public GitHub commits during 2025. That is a 34% year-over-year increase. 64% of secrets leaked in 2022 were still active four years later. Nobody revoked them. Nobody rotated them. They just sat there.
IBMβs Cost of a Data Breach Report 2024 puts the average breach at $4.88 million. Stolen credentials remain the most common initial attack vector. Uberβs 2022 breach started with a hardcoded secret in a PowerShell script. CircleCIβs 2023 incident required every customer to rotate every secret stored on the platform.
Most teams eventually land on one of two options: a managed service like AWS Secrets Manager or a self-hosted tool like HashiCorp Vault. But what if you needed to build your own? Maybe you need multi-tenant isolation. Maybe you need per-environment access control that existing tools do not offer. Maybe you need a platform that fits your teamβs workflow instead of forcing your team to fit the platform.
This article walks through the system design of a secrets management platform from scratch. Every architectural decision, every trade-off, every component explained.
Table ofΒ Contents
- Part 1: The Real Problem
- Part 2: Requirements and Constraints
- Part 3: Back-of-the-Envelope Estimation
- Part 4: High-Level Architecture
- Part 5: Encryption Deep Dive (DEK + KEK)
- Part 6: Database Design
- Part 7: Access Control Model
- Part 8: API Design
- Part 9: CLI Architecture
- Part 10: Secret Lifecycle (Versioning, Rotation, Sharing)
- Part 11: Failure Handling and Recovery
- Part 12: Security and Compliance
- Part 13: Scaling Strategies
- Part 14: Cost Analysis
- Part 15: Monitoring and Observability
- Part 16: Trade-offs Discussed
- Part 17: Key Takeaways
- Part 18: Homework Assignment
Part 1: The RealΒ Problem
Most teams go through the same progression with secrets. First, someone hardcodes a database password. A senior engineer catches it in code review and says βuse environment variables.β The team moves toΒ .env files. Everyone feels secure.
ExceptΒ .env files are plain text on disk. They show up in Docker image layers if your Dockerfile copies them before theΒ .dockerignore kicks in. They get committed to repositories when someone typos theΒ .gitignore. They get pasted into Slack when a new developer needs access. They get captured in crash dumps and debug logs.
The real problem is not that developers are careless. The problem is that the insecure path (copy this file, paste this string) requires fewer steps than the secure path (authenticate, fetch from vault, inject at runtime). Any system you design has to invert that equation.
The tools that exist are good but imperfect. HashiCorp Vault is powerful and complex: it requires dedicated operational knowledge most small teams do not have. AWS Secrets Manager is simple but vendor-locked at $0.40 per secret per month. Doppler and Infisical (25,000+ GitHub stars) are strong but may not fit every multi-tenant or compliance model.
If you were designing a secrets management platform from zero, what would the architecture look like?
Part 2: Requirements and Constraints
Functional Requirements
- Store secrets with typed values (text, JSON, YAML, boolean, integer, password) organized by organization, project and environment
- Encrypt at rest using envelope encryption where each secret gets its own unique data encryption key
- Version every change with full history, diff capability and rollback
- Control access at the environment level with separate permissions for viewing metadata vs. decrypting values
- Authenticate via OAuth, email/password with 2FA, scoped API tokens for CI/CD, and CLI tokens
- Audit every access with immutable append-only logs
- Sync secrets to third-party CI/CD providers (GitHub Actions, GitLab CI) on change
- Inject at runtime through a CLI soΒ
.envfiles never need to touch disk
Non-Functional Requirements
- Latency: Secret retrieval under 50ms p99 (dominated by decryption and key resolution time)
- Availability: 99.9% uptime minimum. If the secrets platform is down, no application can cold-start
- Consistency: A revoked secret must be unreadable within seconds, not minutes
- Multi-tenancy: Complete data isolation between organizations at the database level
- Key rotation: Rotate encryption keys without re-encrypting every secret immediately
Constraints
- PostgreSQL as the primary datastore (mature, relational, well understood)
- Must support both self-hosted and cloud-managed deployment
- CLI must work offline for local config, online for secret operations
Core Challenges
- Encryption at scale: Per-secret DEKs mean every read and write involves cryptographic operations
- Key management chicken-and-egg: The system that manages secrets needs its own secret (the root key) to operate
- Multi-tenant isolation: One organization must never see anotherβs data, even at the database level
- CLI developer experience: If the CLI is harder than copying aΒ
.envfile, nobody uses it - Zero-downtime key rotation: Rotating encryption keys without re-encrypting thousands of secrets simultaneously
Part 3: Back-of-the-Envelope Estimation
Before designing the architecture, let us estimate the scale this system needs to handle.
Traffic Estimation
Target: A mid-scale SaaS serving 500 organizations with 50 projects each, averaging 3 environments per project.
Organizations: 500
Projects per org: 50
Environments per project: 3
Total environments: 500 Γ 50 Γ 3 = 75,000
Secrets per environment: 30 (average)
Total secrets: 75,000 Γ 30 = 2,250,000
Secret reads per day:
- CI/CD pipeline runs: ~200,000 inject calls/day
- Developer CLI pulls: ~50,000 pulls/day
- Dashboard views: ~30,000 views/day
Total reads: ~280,000/day β ~3.2 reads/sec (average)
Peak reads (deploy hour): ~10Γ average β 32 reads/sec
Secret writes per day:
- Manual updates: ~5,000/day
- Sync pushes: ~2,000/day
Total writes: ~7,000/day β ~0.08 writes/secThis is not Twitter-scale traffic. Secrets management is read-heavy but low-throughput compared to consumer apps. The bottleneck is not requests per second. The bottleneck is cryptographic operations per request and KMS call latency.
Storage Estimation
Secret version row:
- UUID (16 bytes) + secret_id (16 bytes) + type (10 bytes)
- ciphertext (~200 bytes avg, base64-encoded encrypted value)
- wrapped_dek (~120 bytes, base64-encoded wrapped key)
- version int (4 bytes) + kms metadata (20 bytes)
- timestamps (16 bytes)
Total per version: ~400 bytes
Total secret versions (assume avg 5 versions per secret):
2,250,000 Γ 5 = 11,250,000 versions
11,250,000 Γ 400 bytes = ~4.5 GB
Audit log row: ~300 bytes
280,000 reads/day Γ 365 days Γ 300 bytes = ~30 GB/year
Total storage (Year 1): ~35 GB
Total storage (3-year retention): ~100 GBPostgreSQL handles this comfortably on a single node. No sharding required at this scale.
Compute Estimation
Per secret read (inject call for 30 secrets):
- Auth token validation: ~1ms
- Permission check: ~2ms (cached)
- KEK resolution: ~1ms (cache hit) or ~50ms (cache miss + KMS call)
- 30Γ DEK unwrap + decrypt: ~30 Γ 0.5ms = ~15ms
- DB query (batch): ~5ms
Total: ~25ms (cache hit) or ~75ms (cache miss)
CPU per decrypt operation: ~0.05ms on modern hardware
Peak: 32 req/sec Γ 30 secrets Γ 0.05ms = 48ms CPU time/secA single 4-core server handles peak load at under 2% CPU utilization. The platform is I/O-bound (database, KMS calls), not CPU-bound.
Part 4: High-Level Architecture
The platform needs three client surfaces: a web dashboard for humans, a CLI for developers, and an API for CI/CD service accounts.
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Web UI β β CLI β β CI/CD API β
β (dashboard) β β (binary) β β (tokens) β
ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ
β β β
βββββββββββββββββββββΌββββββββββββββββββββ
β
ββββββββββββββΌβββββββββββββ
β Backend API β
β (auth, encryption, β
β RBAC, audit) β
ββββββββββββββ¬βββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββ
β β β
ββββββββββΌβββββββ ββββββββΌββββββββ βββββββΌβββββββ
β PostgreSQL β β Redis β β KMS Layer β
β (secrets, β β (cache, β β (local or β
β versions, β β queues, β β cloud HSM) β
β audit logs) β β locks) β β β
βββββββββββββββββ ββββββββββββββββ βββββββββββββββWhy PostgreSQL and not a dedicated secrets backend? Secrets platforms are metadata-heavy: versions, audit logs, permissions, organizations, projects, environments. That is relational data. PostgreSQL handles it well. The encryption barrier means PostgreSQL never sees plaintext. It stores ciphertext. The security boundary lives in the application layer, not the database.
Why a standalone CLI binary? Developers need to install the CLI without runtime dependencies. A language like Go compiles to a single static binary, handles cross-platform builds and has standard library support for AES-GCM encryption. The CLI also needs to encrypt its own stored tokens at rest, which means running its own encryption independent of the server.
Why Redis? Two jobs. First, caching resolved encryption keys so the system does not call the KMS provider on every decryption. Second, distributed locks to prevent thundering herd problems when a cached key expires and 50 concurrent requests all try to resolve it at once.
Services Decomposition
At moderate scale, a monolithic backend serves well. As the platform grows, these are the natural service boundaries:
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API Gateway β
β (rate limiting, routing, auth) β
ββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββ
β β β β
ββββββΌββββ ββββββΌββββ ββββββΌββββ ββββββΌβββββ
β Auth β β Secret β β Audit β β Sync β
βService β βService β βService β β Service β
ββββββββββ ββββββββββ ββββββββββ βββββββββββ- Auth Service: Token issuance, OAuth flows, 2FA verification, permission resolution
- Secret Service: CRUD operations, encryption/decryption, version management
- Audit Service: Append-only log ingestion, search, compliance reporting
- Sync Service: Background jobs pushing secrets to GitHub Actions, GitLab CI, etc.
The Secret Service is the only component that touches encryption keys. This minimizes the blast radius: a vulnerability in the Sync Service cannot access the KMS layer.
Part 5: Encryption Deep Dive (DEK +Β KEK)
This is the core of the system. Get this wrong and nothing else matters.
The Envelope Encryption Pattern
Every secret is encrypted with its own unique Data Encryption Key (DEK). The DEK is then encrypted by a Key Encryption Key (KEK). The database stores two blobs per secret version: the encrypted secret and the encrypted DEK.
Why not encrypt everything with one master key? Because a single key compromise exposes every secret in the system. With per-secret DEKs, a compromised DEK exposes one secret. The KEK is the crown jewel, but it never touches the database in plaintext.
The Encryption Flow
Plaintext: "postgres://user:pass@db.prod:5432/app"
β
βΌ
βββββββββββββββββββββββββββββββββ
β 1. Generate random DEK β
β 32 bytes from CSPRNG β
βββββββββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββ
β 2. Encrypt plaintext with DEK β
β Algorithm: AES-256-GCM β
β IV: 12 random bytes β
β Tag: 16 bytes (integrity) β
βββββββββββββββββ¬ββββββββββββββββ
β
ciphertext = base64(IV β Tag β Ciphertext)
β
βΌ
βββββββββββββββββββββββββββββββββ
β 3. Encrypt DEK with KEK β
β AES-256-GCM (separate IV) β
β IV2: 12 random bytes β
β Tag2: 16 bytes β
βββββββββββββββββ¬ββββββββββββββββ
β
wrapped_dek = base64(IV2 β Tag2 β WrappedDEK)
β
βΌ
βββββββββββββββββββββββββββββββββ
β 4. Store in database β
β β ciphertext (secret data) β
β β wrapped_dek (encrypted β
β DEK) β
β β kms_key_version β
β (which KEK version) β
βββββββββββββββββββββββββββββββββImplementation
function encrypt(plaintext, kmsKey):
// Step 1: Unique DEK per secret
dek = crypto.randomBytes(32)
// Step 2: Encrypt plaintext with DEK
iv1 = crypto.randomBytes(12)
(ciphertext, tag1) = AES_256_GCM_encrypt(plaintext, dek, iv1)
// Step 3: Resolve KEK from KMS, then wrap the DEK
kek = kmsLayer.resolveKey(kmsKey)
iv2 = crypto.randomBytes(12)
(wrappedDek, tag2) = AES_256_GCM_encrypt(dek, kek, iv2)
// Step 4: Pack for storage
return {
ciphertext: base64(iv1 + tag1 + ciphertext),
wrapped_dek: base64(iv2 + tag2 + wrappedDek),
key_version: kmsKey.materialVersion
}Pseudocode. The actual implementation depends on your languageβs crypto library (OpenSSL for PHP/Python, crypto/aes for Go, Web Crypto for Node). Verify IV sizes and tag handling against your specific libraryβs API.
Three things to notice:
- Every secret gets a fresh DEK from a cryptographically secure random number generator. No key reuse.
- The IV is packed with the ciphertext, not stored separately. One column, one blob, all the data needed for decryption.
- GCM provides authenticated encryption. The auth tag means tampering with ciphertext in the database causes decryption to fail rather than producing corrupted output.
KEK Resolution andΒ Caching
The KEK must be resolved from a KMS layer, which is the most latency-sensitive operation in the system. You need a caching strategy:
- Check the in-memory cache for the KEK keyed by
kms_key_id:version - If cache miss, acquire a distributed lock (Redis) to prevent thundering herd
- After acquiring the lock, double-check the cache (another request may have populated it)
- If still missing, call the KMS provider (local decryption or cloud API call)
- Store the result in cache with a short TTL (5 minutes is reasonable)
- Encrypt the KEK before storing it in cache. If Redis is compromised, the attacker gets encrypted blobs
The double-check locking pattern is critical. Without it, a cache expiry under high concurrency causes N simultaneous KMS calls, which is both slow and expensive if you are using a cloud KMS billed per API call.
Supported KMS Providers
Design the KMS layer as an abstraction with swappable backends:
Local KMS: The KEK is stored in the database encrypted by the applicationβs master key (an environment variable). This is a chicken-and-egg problem: the system that manages secrets has its own secret stored as an environment variable. For self-hosted deployments where the operator controls the server, this is acceptable.
Cloud KMS (AWS KMS, GCP Cloud KMS, Azure Key Vault): The KEK is wrapped by the cloud providerβs customer master key. The application calls the cloud API to unwrap it at runtime. The key material never leaves the providerβs hardware security module. This removes the chicken-and-egg problem by pushing the trust boundary to the cloud providerβs HSMs.
Key Rotation WithoutΒ Downtime
When you rotate a KEK, new secrets use the new key version. Old secrets continue working because each stored secret version records which KEK version encrypted its DEK.
This is lazy rotation: old secrets are not re-encrypted immediately. They continue decrypting fine with the old KEK version (which remains in the key table but marked as retired). When a secret is next updated, the new version uses the current KEK. No bulk migration. No downtime. No βplease wait while we re-encrypt 50,000 secrets.β
Part 6: DatabaseΒ Design
Core Schema
The data model follows a hierarchical multi-tenant structure:
Organization
βββ Project
βββ Environment (dev, staging, production)
βββ Secret
βββ SecretVersion (encrypted data + metadata)The critical tables:
-- Secrets are key-value pairs scoped to an environment
CREATE TABLE secrets (
id UUID PRIMARY KEY,
env_id UUID REFERENCES environments(id) ON DELETE CASCADE,
key VARCHAR(255),
description TEXT,
sort_order SMALLINT DEFAULT 1,
is_deleted BOOLEAN DEFAULT false,
created_at TIMESTAMP,
updated_at TIMESTAMP,
UNIQUE(env_id, key)
);
-- Every mutation creates a new version (append-only)
CREATE TABLE secret_versions (
id UUID PRIMARY KEY,
secret_id UUID REFERENCES secrets(id) ON DELETE CASCADE,
data_type VARCHAR(50) DEFAULT 'text',
ciphertext TEXT, -- base64(IV + Tag + EncryptedData)
wrapped_dek TEXT, -- base64(IV + Tag + WrappedDEK)
version INTEGER DEFAULT 1,
kms_key_id UUID REFERENCES kms_keys(id),
kms_key_version INTEGER DEFAULT 1,
created_at TIMESTAMP,
UNIQUE(secret_id, version)
);
-- KMS key material with versioning for rotation
CREATE TABLE kms_keys (
id UUID PRIMARY KEY,
organization_id UUID REFERENCES organizations(id),
provider VARCHAR(50), -- 'local' or 'aws'
algorithm VARCHAR(50) DEFAULT 'aes-256-gcm',
material_version INTEGER DEFAULT 1,
wrapped_material TEXT, -- encrypted key material
activated_at TIMESTAMP,
retired_at TIMESTAMP
);
-- Immutable audit log
CREATE TABLE audit_logs (
id UUID PRIMARY KEY,
org_id UUID REFERENCES organizations(id),
actor_id UUID,
actor_type VARCHAR(20), -- 'user', 'service_account', 'system'
action VARCHAR(50), -- 'secret.read', 'secret.reveal', 'secret.create'
resource_type VARCHAR(50),
resource_id UUID,
metadata JSONB, -- request IP, user agent, etc.
created_at TIMESTAMP
);Design Decisions
UUIDs over auto-increment. UUIDs prevent enumeration attacks (an attacker cannot guess the next secret ID). They also allow distributed ID generation without sequence coordination if you ever shard.
Soft deletes on secrets. When someone accidentally deletes DB_PASSWORD at 3 AM, you want a restore command, not a post-mortem. Soft-deleted secrets are invisible to normal queries but recoverable through a restore endpoint. A separate permanent-delete operation exists for compliance where data must be fully purged.
Append-only versioning. Every secret change creates a new secret_version row. The current value is the highest version number. This gives you a full audit trail, instant rollback by pointing to a previous version, and the ability to diff between versions during an incident. The secret row is a container. The versions hold the encrypted data.
Two separate columns for ciphertext and wrapped DEK. You could combine them into one blob, but separating them makes key rotation operations cleaner. When re-encrypting with a new KEK, you only touch wrapped_dek. The ciphertext column (encrypted by the DEK, not the KEK) does not change.
Separate audit log table. Audit logs are append-only and never updated. They grow fast and are queried differently from operational data (time-range scans, not point lookups). Keeping them in a separate table allows independent indexing, partitioning by month, and archival to cold storage without touching the secrets tables.
Part 7: Access ControlΒ Model
The Permission Hierarchy
Access control operates at three levels, each narrowing the scope:
Organization (admin, billing, team management)
βββ Project (view, manage environments)
βββ Environment (read, reveal, create, update, delete, export)The critical design decision is separating read from reveal. A developer with read permission can see that DB_PASSWORD exists, view its type, version number and last-updated timestamp. A developer with reveal permission can decrypt and view the actual value.
Why separate them? Because most operations do not need the plaintext. Listing secrets for a diff, checking version history, running audit reports: none of these require decryption. Separating the permissions reduces the number of users who trigger decryption, which shrinks both the attack surface and the audit noise.
Identity Sources
Different consumers authenticate differently. Design for multiple auth methods:
- Human users via OAuth (Google, GitHub) or email/password with 2FA
- CI/CD pipelines via scoped service account tokens that grant access to specific projects and environments
- CLI sessions via short-lived tokens with refresh capability, stored encrypted on the developerβs machine
A service account token for a deployment pipeline might have read, reveal and export on production but no create, update or delete. If the token leaks, the attacker can read secrets from one environment but cannot modify them or access other environments.
Protected Environments
Production environments can be marked as protected. Any secret change in a protected environment requires an approval workflow: a second team member reviews and approves the change before it takes effect. This prevents a single compromised account from silently modifying production credentials.
Part 8: APIΒ Design
Core Endpoints
GET /secrets β List secrets (metadata only)
GET /secrets/{key} β Get single secret (with optional reveal)
POST /secrets/{key} β Create or update a secret
DELETE /secrets/{key} β Soft delete
DELETE /secrets/{key}/permanent β Irreversible purge
GET /secrets/{key}/history β Version history
POST /secrets/{key}/rollback β Restore a previous version
POST /secrets/{key}/restore β Undelete a soft-deleted secretSync Endpoints
POST /sync/push β Upload local .env/json/yaml to server
GET /sync/pull β Download secrets as .env/json/yaml
POST /sync/diff β Compare local file against remote stateThe push endpoint parses the uploaded file, diffs against existing secrets and returns a summary: { created: 3, updated: 1, unchanged: 12, errors: 0 }. It does not blindly overwrite. The developer sees what will change before confirming.
Inject Endpoint
POST /inject β Returns all decrypted secrets as key-value pairsThis powers the CLIβs runtime injection. The CLI calls this endpoint, receives all secrets for the target environment and injects them as environment variables into a subprocess:
# Secrets injected into the process, never written to disk
secretscli inject -- npm start
# Or start a subshell with secrets loaded
secretscli inject --shellThe inject response should filter dangerous environment variables (LD_PRELOAD, PATH, LD_LIBRARY_PATH, DYLD_INSERT_LIBRARIES) that could be abused for privilege escalation if an attacker managed to write malicious values to the secrets store.
Response Envelope
Every response follows a consistent envelope:
{
"error": false,
"message": "Secrets retrieved successfully",
"data": { ... }
}Consistent envelopes simplify CLI parsing and error handling. The CLI does not need to guess whether a 200 response contains data or an error.
Part 9: CLI Architecture
Token Encryption atΒ Rest
The CLI stores authentication tokens in a local config file. Those tokens must be encrypted at rest because a compromised laptop should not give an attacker plaintext API tokens.
The pattern: generate a 32-byte AES key on first use, store it in a key file with restricted permissions (0600 on Unix), and use AES-256-GCM to encrypt each token before writing the config file. Encrypted tokens use a prefix like enc: so the CLI can distinguish them from plaintext during migration.
function encryptToken(token, keyPath):
key = loadOrCreateKey(keyPath) // 32-byte AES key, file perms 0600
nonce = crypto.randomBytes(12)
ciphertext = AES_256_GCM_seal(key, nonce, token)
return "enc:" + base64(nonce + ciphertext)
function decryptToken(encryptedToken, keyPath):
key = loadKey(keyPath)
if key file permissions != 0600:
ABORT("Insecure key file permissions")
payload = base64.decode(encryptedToken.removePrefix("enc:"))
nonce = payload[0:12]
ciphertext = payload[12:]
return AES_256_GCM_open(key, nonce, ciphertext)Pseudocode. Goβscrypto/aes+cipher.NewGCM, Python'scryptography.hazmat, and Node'scrypto.createCipherivall support this pattern.
The permission check on the key file is important. If a backup tool or misconfigured deployment changes the permissions, the CLI should refuse to operate rather than silently using a key file that anyone on the system can read.
Context Defaults
Typing --org acme --project api --env production on every command is a developer experience failure. The CLI should store a default context:
# Set once
secretscli context set --org acme --project api --env production
# Use everywhere (no flags needed)
secretscli secret list
secretscli secret get DB_PASSWORD
secretscli inject -- npm startThis is the difference between a CLI that developers tolerate and one they actually use. Flags can override the saved context for one-off operations across different environments.
Request Integrity
Every CLI request should include an HMAC checksum header computed from the request body. The server validates the checksum to detect tampering in transit. If the checksum does not match, the request is rejected before any secret operation occurs. This adds a layer of integrity verification on top of TLS.
Part 10: Secret Lifecycle
Versioning
Every mutation creates a new version row. The secret itself is a container; the versions hold the encrypted data.
Secret: DB_PASSWORD (env: production)
βββ v1: created Jan 15 by alice (encrypted blob A)
βββ v2: created Feb 01 by bob (encrypted blob B)
βββ v3: created Mar 10 by alice (encrypted blob C) β currentRolling back to v2 creates v4 with the same plaintext as v2 but a fresh encryption (new DEK, current KEK version). The history is append-only. You always see that a rollback happened and who triggered it.
Secret Sharing
Sometimes you need to share a credential with someone outside the platform: a contractor, a partner team, a support engineer. Design time-limited share links with:
- Optional password protection
- Maximum view count (βviewable 3 times, then self-destructsβ)
- Expiration timestamp
- Audit trail for every view
The share link decrypts the secret server-side and displays it once. The recipient never gets permanent access. Once it expires or hits the view limit, the link is dead.
Leak Detection
When a developer sets a secret typed as password, hash it against the Have I Been Pwned API using k-anonymity (send only the first 5 characters of the SHA-1 hash, so the full password never leaves your server). If the password appears in a known breach database, return a warning. Require explicit confirmation before storing a known-compromised credential.
Sync to Third-Party Providers
Secrets can be synced to GitHub Actions, GitLab CI or other providers through integration connections. When a secret changes, a background job pushes the updated value to the configured provider via their API. Developers update secrets in one place and the CI/CD pipeline picks up changes automatically. No manual copy-paste between platforms.
Part 11: Failure Handling andΒ Recovery
A secrets platform sits on the critical path of every deployment. When it fails, applications cannot start. Failure handling must be designed, not bolted on.
Scenario 1: KMS Provider Unavailable
Problem: Cloud KMS API (AWS KMS, GCP) returns errors or times out.
Impact: Cannot resolve KEKs β cannot decrypt any secrets.
Mitigation:
1. KEK cache (Redis) serves requests during outage (5-min TTL)
2. Extend cache TTL automatically when KMS health check fails
3. Circuit breaker opens after 3 consecutive KMS failures
4. Alert fires immediately - this is a P0 incident
Recovery:
When KMS recovers, circuit breaker closes automatically.
No data loss - all encrypted data remains intact in PostgreSQL.
Cache repopulates on next request.Scenario 2: Redis CacheΒ Failure
Problem: Redis cluster goes down.
Impact: Every decrypt request hits KMS directly (slow + expensive).
Mitigation:
1. Fall back to in-memory LRU cache on the application server
2. In-memory cache has shorter TTL (60 seconds) to limit stale data
3. Rate-limit KMS calls per organization to prevent bill shock
4. Queue non-urgent decrypt requests, process urgent ones only
Recovery:
Redis recovery is automatic with Sentinel or Cluster failover.
In-memory cache continues serving during failover window.Scenario 3: Write Fails Mid-Operation
Problem: Application creates a new secret version, encrypts the DEK,
but the database write fails after encryption.
Impact: Orphaned encryption operation. Wasted KMS call.
Mitigation:
1. Wrap encrypt + DB write in a database transaction
2. If DB write fails, the transaction rolls back
3. The generated DEK is discarded (never stored)
4. Return error to client: "Secret update failed, please retry"
5. No partial state: either the full version row exists or nothing does
This is why encryption happens in the application layer, not the database.
The DB transaction guarantees atomicity of the write.Scenario 4: Corrupt Ciphertext
Problem: Ciphertext or wrapped_dek is corrupted (bit flip, storage error).
Impact: AES-GCM decryption fails β the auth tag check catches corruption.
Mitigation:
1. GCM's authentication tag detects tampering/corruption automatically
2. Return clear error: "Decryption failed for secret X version Y"
3. Fall back to previous version if available
4. Alert on repeated decryption failures (possible attack indicator)
This is one of the key benefits of AES-GCM over AES-CBC: corruption
is detected rather than producing silently wrong plaintext.Idempotency
CLI retries and network timeouts can cause duplicate requests. Every secret write operation must be idempotent:
POST /secrets/{key}with the same payload and a client-generated idempotency key returns the existing version instead of creating a duplicate- The idempotency key is stored in Redis with a 10-minute TTL
- If the server processed the request but the client never received the response, the retry returns the cached result
Part 12: Security and Compliance
Authentication Flow
Human User β OAuth/Email+2FA β JWT (1-hour expiry, refresh token)
CLI β OAuth device flow β Access token + Refresh token (encrypted at rest)
CI/CD β Scoped API token β No expiry, revocable, env-locked permissionsEach identity type gets the minimum access it needs. A CI/CD token for the staging environment cannot see production. A developer with read on production cannot reveal without explicit grant.
Defense inΒ Depth
The system has multiple independent security layers:
- Transport: TLS 1.3 for all connections. HSTS headers. Certificate pinning in the CLI.
- Authentication: Multi-factor for humans. Scoped tokens for machines. Short-lived JWTs with refresh rotation.
- Authorization: Per-environment RBAC with read/reveal separation. Protected environment approval workflows.
- Encryption at rest: Envelope encryption (DEK + KEK). Per-secret isolation. KEK never stored plaintext in application database.
- Encryption in cache: KEKs encrypted before caching in Redis. If Redis is compromised, attacker gets encrypted blobs.
- Audit: Every access logged immutably. Alert on anomalous patterns (bulk reveal, off-hours access, new IP).
Secret Scanning Integration
Run pre-commit hooks with tools like gitleaks (v8.21+) or truffleHog to catch secrets before they reach the repository. The platform can expose a webhook that receives alerts from GitHubβs secret scanning partner program and automatically rotates the leaked credential.
Part 13: Scaling Strategies
Vertical Scaling (Do ThisΒ First)
At the scale estimated in Part 3, a single PostgreSQL instance with read replicas handles the load. Before adding complexity:
- Add read replicas for dashboard queries and audit log searches
- Use connection pooling (PgBouncer) to handle connection spikes from CI/CD bursts
- Optimize the hot path: batch-fetch secrets in a single query for inject calls instead of N+1 queries
Horizontal Scaling (When You NeedΒ It)
Trigger: > 500 req/sec sustained or > 10,000 organizations
Strategy:
1. Stateless API servers behind a load balancer
2. PostgreSQL read replicas per region
3. Redis Cluster for distributed caching
4. Audit log partitioning by month (drop old partitions for retention)Database ScalingΒ Path
Stage 1 (< 1,000 orgs): Single primary + 2 read replicas
Stage 2 (< 10,000 orgs): Shard by organization_id (tenant isolation)
Stage 3 (< 100,000 orgs): Per-region deployments with data residencySharding by organization_id is natural because secrets never cross organization boundaries. Each shard contains a complete set of data for a set of organizations. Cross-shard queries are never needed for operational endpoints.
Caching Strategy
Layer 1: Application in-memory (LRU, 60-second TTL)
β KEKs, permission lookups, org metadata
Layer 2: Redis (distributed, 5-minute TTL for KEKs)
β KEKs (encrypted), session data, rate limit counters
Layer 3: PostgreSQL (source of truth)
β All persistent dataCache invalidation on permission changes must be immediate. When an admin revokes a developerβs reveal permission, the cached permission must be evicted. Use Redis pub/sub to broadcast invalidation events across application instances.
Part 14: CostΒ Analysis
Disclaimer: These are rough estimates using publicly available cloud pricing as of early 2026. Actual costs vary significantly based on reserved instances, enterprise agreements, region, and optimization. Use your cloud providerβs pricing calculator for accurate projections.
Infrastructure Cost (500 Organizations, YearΒ 1)
Compute:
2Γ Application servers (4 vCPU, 16GB RAM) $300/month
1Γ Background worker (2 vCPU, 8GB RAM) $100/month
Database:
1Γ PostgreSQL primary (4 vCPU, 32GB RAM) $400/month
2Γ Read replicas $600/month
100 GB storage $12/month
Cache:
1Γ Redis cluster (3 nodes, 4GB each) $250/month
KMS:
AWS KMS: 1 CMK ($1/month) + API calls $50/month
(~150,000 decrypt calls/month at $0.03/10K)
Networking:
Load balancer + data transfer $100/month
Monitoring:
Logging, metrics, alerting $150/month
Total: ~$1,962/month (~$23,500/year)Cost perΒ Secret
2,250,000 secrets under management
$1,962/month operating cost
Cost per secret: $0.00087/month
Compare: AWS Secrets Manager at $0.40/secret/month
2,250,000 Γ $0.40 = $900,000/month
Self-hosted cost advantage: ~460Γ cheaper at this scaleThe cost advantage of self-hosting grows with scale because the major costs (compute, database) increase sub-linearly while managed service pricing is per-secret. The trade-off is operational complexity: you own the uptime, the patching, and the on-call rotation.
Cost at Larger Scale (5,000 Organizations)
Compute (auto-scaled): $2,000/month
Database (sharded): $3,500/month
Cache (Redis Cluster): $800/month
KMS: $200/month
Networking + LB: $500/month
Monitoring: $400/month
Total: ~$7,400/month (~$88,800/year)
Cost per secret (22.5M): $0.00033/monthPart 15: Monitoring and Observability
Key Metrics
Business Metrics:
- Total secrets under management (gauge)
- Secret operations per minute by type (read, reveal, create, update, delete)
- Active organizations and users (daily/monthly)
- Sync success/failure rate by provider
System Metrics:
- API response time (p50, p95, p99)βββtarget: p99 < 100ms for reads
- Decryption latency (p50, p95, p99)βββtarget: p99 < 50ms
- KMS call latency and error rate
- Redis cache hit ratioβββtarget: > 95%
- Database connection pool utilization
- Background job queue depth
Security Metrics:
- Failed authentication attempts per minute
- Bulk reveal operations (> 10 secrets in one call)
- Access from new IP addresses
- Permission escalation events
- Decryption failures (possible tampering indicator)
Alerting Rules
CRITICAL (page on-call):
- API error rate > 1% for 5 minutes
- KMS provider unreachable for 2 minutes
- Decryption failure rate > 0.1%
- Database primary unreachable
WARNING (Slack notification):
- API p99 latency > 200ms for 10 minutes
- Redis cache hit ratio < 80%
- Audit log ingestion lag > 60 seconds
- Certificate expiry < 14 days
SECURITY (immediate page):
- Bulk reveal from new IP
- > 10 failed auth attempts from same source in 1 minute
- Service account used outside allowed CIDR rangeDistributed Tracing
Trace every secret operation end-to-end:
Secret Reveal (45ms total)
βββ Auth middleware (3ms)
β βββ JWT validation + permission check (cached)
βββ Secret Service (40ms)
β βββ DB query: fetch secret + version (5ms)
β βββ KEK resolution (2ms, cache hit)
β βββ DEK unwrap: AES-GCM decrypt (0.3ms)
β βββ Secret decrypt: AES-GCM decrypt (0.2ms)
βββ Audit log write (2ms, async)The async audit log write is important. Writing to the audit log should never add latency to the secret retrieval path. Use a buffered queue (Redis list or in-memory channel) that flushes to PostgreSQL in batches.
Part 16: Trade-offs Discussed
1. Relational DB vs. Dedicated Secrets Backend PostgreSQL gives relational query power for the complex access control model, versioning and audit joins. A dedicated backend like Consul or etcd would give better clustering semantics. At thousands of secrets, PostgreSQL is the right choice. At millions with high-throughput decryption, you would need to revisit.
2. Per-Secret DEKs vs. Shared Encryption Key Per-secret DEKs increase storage (roughly 120 bytes of wrapped_dek per version). For 10,000 secrets with 5 versions each, that is about 6 MB of additional storage. Negligible. The security benefit of key isolation is worth orders of magnitude more storage.
3. Lazy Key Rotation vs. Immediate Re-encryption Lazy rotation means old secrets stay encrypted with the previous KEK version until they are next updated. The trade-off: if the old KEK is compromised, those secrets remain vulnerable until re-encrypted. Immediate re-encryption eliminates this window but causes a latency spike. For most teams, lazy rotation is the right default with an option to trigger bulk re-encryption when needed.
4. Local KMS vs. Cloud KMS Local KMS has a circular dependency: the system that manages secrets relies on a secret (the application key) stored as an environment variable. Cloud KMS removes this by pushing the trust to HSMs. But cloud KMS adds latency (API call per cache miss), cost ($1/month per key plus $0.03 per 10,000 API calls on AWS KMS) and vendor dependency. Support both and let operators choose.
5. Monolith vs. Microservices A monolithic backend is simpler to deploy, debug and reason about. Splitting into microservices adds network hops, distributed tracing needs and deployment complexity. Start monolithic. Extract services only when you have a concrete scaling or isolation reason (e.g., audit log ingestion overwhelming the main database).
6. Strong Consistency vs. Eventual Consistency Secret reads must be strongly consistentβββa revoked secret must be unreadable immediately. Audit logs can be eventually consistent (a few seconds of lag is acceptable). Permission changes need near-immediate consistency, achieved through cache invalidation via pub/sub.
Part 17: Key Takeaways
- Envelope encryption is non-negotiable. Per-secret DEKs limit the blast radius of any key compromise. The KEK protects the DEKs. The KMS protects the KEK. Each layer reduces risk independently.
- The secure path must be the easy path. Context defaults, runtime injection, share links, leak detection. If the CLI requires more steps than copying aΒ
.envfile, developers will copy theΒ.envfile. - Separate read from reveal. Most operations do not need plaintext. This single permission split reduces attack surface, audit noise and KMS costs simultaneously.
- Cache encryption keys, not secrets. Caching decrypted secrets is dangerous (cache compromise = data breach). Caching KEKs with encryption and short TTLs balances performance with security.
- Design for key rotation from day one. Every encrypted blob records which key version encrypted it. Lazy rotation means zero downtime. Bulk re-encryption is available when needed.
- Audit everything, query it later. Append-only audit logs are cheap to write and invaluable during incident response. Partition by month for retention management.
- PostgreSQL is enough. Secrets management is metadata-heavy and low-throughput. Do not over-engineer the storage layer. A single PostgreSQL instance with read replicas handles most deployments comfortably.
Part 18: Homework Assignment
If you want to go deeper, here are extensions worth designing:
- Shamirβs Secret Sharing for the root key. Instead of a single environment variable holding the master key, split it into N shares where any K-of-N shares can reconstruct the key (e.g., 3-of-5). Design the unseal ceremony flow: how does the system start if it needs 3 different people to provide their shares?
- Cross-region replication with data residency. Some organizations require secrets to stay within specific geographic boundaries (EU data in EU, US data in US). Design a multi-region architecture where each region has its own KMS but organizations can span regions.
- Automated secret rotation. Design a system that automatically rotates database passwords: generate new credential, update the target database, verify the new credential works, then update the secret in the platform. Handle the failure case where the database accepts the new password but the platform fails to store it.
- Break-glass access. Design an emergency access mechanism that allows a pre-authorized admin to bypass normal RBAC and access any secret, with mandatory audit trail and post-incident review workflow. How do you prevent abuse while enabling legitimate emergency response?
- End-to-end encryption. Redesign the system so the server never sees plaintext. The CLI encrypts before sending, the server stores blobs, and only the CLI can decrypt. How does this change the sharing, search and sync features?
28.65 million secrets leaked on GitHub last year because the insecure path was easier than the secure one. The architecture does not fix that by making the insecure path harder. It fixes it by making the secure path effortless.
That is the entire system design in one sentence.
If this article helped you think about secrets management differently, consider sharing it with your team. The best time to design a secrets platform is before the next breach. The second best time is now.
Thank you for being a part of the community
Before you go:

π Be sure to clap and follow the writer οΈποΈοΈ
π Follow us: Linkedin| Medium
π CodeToDeploy Tech Community is live on DiscordβββJoin now!
Disclosure: This post includes affiliate and partnership links.


