PromptVault

Prompt Engineering and Operations Platform

-- Still in Development --

PromptVault is a prompt operations workspace for individuals and teams who need to treat AI prompts as managed, production-grade assets rather than scattered text snippets. It brings prompt authoring, version control, multi-model execution, structured evaluation, release lifecycle management, analytics, budgeting, and team collaboration into a single platform.

At its core, PromptVault solves the gap between "writing a prompt in a text file" and "running prompts reliably in production." It gives prompt engineers and AI teams the same rigor they already apply to software — versioning, testing, review gates, rollback, regression monitoring, and cost governance — applied directly to the prompt layer.

Prompt Library & Versioning — Organize prompts in folders with tags, search, and favorites. Every edit creates an immutable version with full diff history.
Multi-Model Execution — Run prompts against any model available through OpenRouter, with side-by-side comparison, streaming output, and cost estimation before execution.
Evaluation & Experimentation — Build reusable datasets and rubrics, run structured evaluations, launch multi-model experiments, and score outputs with LLM-as-judge, exact match, similarity, and manual grading methods.
Release Lifecycle — Promote prompt versions through Draft → Staging → Production with configurable release gates, approval workflows, and automated rollback.
Scheduled Regression Monitoring — Set up automated evaluation runs on a daily, weekly, monthly, or cron schedule with baseline thresholds and regression alerting.
Team Workspaces & Governance — Collaborative workspaces with role-based access (Owner, Admin, Member), pooled credits, shared API keys, budget controls, and review workflows.
Analytics & Cost Controls — Track usage, spend, and latency across models and prompts with budget thresholds, anomaly detection, and workspace health scoring.

Technology Stack

PromptVault is built as a TypeScript monorepo with four workspace packages and a contract-first architecture where shared Zod schemas enforce the API boundary between frontend and backend at both compile time and runtime.

Frontend — React 19, Vite, Wouter (routing), TanStack React Query (async state), Tailwind CSS, shadcn/ui, Radix UI primitives
Backend — Express 5, Node.js 22, Better Auth (email/password authentication), Redis (rate limiting, session state, job coordination)
Data Layer — PostgreSQL (Drizzle ORM, migrations on startup), Redis 7
Shared Contract — Zod schemas in a dedicated shared package consumed by both client and server
LLM Execution — OpenRouter (platform-managed credits and bring-your-own-key modes)
Billing — Stripe integration (optional; the platform degrades gracefully when unconfigured)
Email — Resend or SMTP (optional; verification mode is environment-driven)
Testing — Vitest (server unit/integration), Playwright (browser E2E)

Platform Features

Workspace Modes

PromptVault supports two workspace contexts that users can switch between from a global workspace switcher. Personal workspaces are the default for solo work and private drafts. Team workspaces enable shared prompts, datasets, rubrics, judges, schedules, pooled credits, and collaborative governance. Resources can be moved between personal and team workspaces where permissions allow, and the platform remembers the last active workspace across sessions.

Pricing & Billing

The platform offers four tiers designed around different usage patterns:

Free — 25 prompts, 5 comparisons/month, 20,000 monthly platform credits, limited run history, no BYO key or team features.
Maker — Unlimited prompts and comparisons, run history and export, BYO OpenRouter key support, optional credit top-ups. Designed for builders who want to use their own API key.
Pro — Everything in Maker plus 100,000 monthly platform credits with BYO optional rather than required.
Team — Everything in Pro plus shared team workspaces, 100,000 monthly credits per seat, pooled credits, and a team-level BYO key option.

Credit top-ups are available as a paid add-on. When Stripe is not configured in a deployment, the Free plan and BYO flows still function — paid checkout and portal actions are simply hidden.

Execution Modes

All model execution routes through OpenRouter with three practical funding paths:

Platform Credits — PromptVault pays the model cost and debits the user or team credit balance.
Personal BYO Key — The user stores their own OpenRouter API key and runs against their own account.
Team BYO Key — A shared team-level OpenRouter key used as workspace fallback.

Key routing follows a clear precedence: a personal BYO key overrides platform billing for that user; in a team workspace, the personal key takes priority first, the shared team key second, and team credits are consumed only if neither key exists. Stored API keys are never exposed back to the client.

Prompt Library

The Library is the central inventory for all prompts in a workspace. It supports folder trees with nesting, full-text search across title, description, and body, tag filtering and management, favorites via starring, grid and list views, and bulk operations including multi-select tagging, folder moves, and workspace transfers. Context actions let users jump directly to Playground or Compare Models from any prompt card.

Templates

A built-in template gallery provides starter-kit prompts searchable by concept, category, technique, and difficulty. Each template includes the full prompt body, system prompt framing, variable definitions with examples, example dataset rows, rubric overview, output schema, and documentation on how the template is structured. Templates can be imported directly into the active workspace as editable prompts.

Prompt Detail & Lifecycle Management

The Prompt Detail page is the core product surface where PromptVault behaves most clearly as a version-control and release platform for prompts. From this single page, users can:

Author and Edit — Edit title, description, body, system prompt, and notes. Template variables are detected automatically from the prompt body and system prompt, with configurable metadata (type, default value, allowed values, required flag). A rendered message preview shows how the prompt will look at execution time.

AI Assist — Generate or refresh descriptions, suggest tags, suggest variable descriptions, and run AI similarity checks against other prompts in the library.

Reusable Blocks — Insert reusable snippets and system prompts from a shared block library, save current content into reusable blocks, and link prompts to blocks for dependency tracking.

Output Contracts — Define an output schema contract using a visual schema builder or raw JSON mode, validate schemas, and infer schemas from prior playground outputs.

Version Control — Every save creates a new immutable version. Users can view full version history, compare any two versions with a diff view, and restore older versions. Versions track Draft, Staging, Production, and Archived environment states.

Sharing — Set prompts as Private or Public, generate share URLs, toggle author attribution, and track public views and import counts.

Collaborative Review — Request a review between two versions, assign reviewers from the team, set required approval counts, add descriptions and target environments. The review workflow supports threaded comments, structured suggestions, change requests, approvals, and merge actions.

Promotion & Release Gates — Promote versions to Staging or Production (directly when policy allows, or through an approval request when required). View promotion history, export the promotion changelog as Markdown, and roll back Production to a prior version. Release gates support configurable conditions including overall eval score thresholds, rubric criterion score thresholds, schema compliance, approval counts, time-in-environment, review approval state, and regression passing status. Workspace-default gates can auto-promote when all conditions pass, and prompt-specific gates can override workspace defaults.

Automation — Add scheduled regression runs directly from the prompt page, review optimization history generated from failed evaluation evidence, and see which AI suggestions were proposed, tested, and applied.

Playground

The Playground is the ad-hoc execution environment for running prompts against any supported model. It supports streaming and non-streaming execution, model selection and parameter controls, estimated token counts and costs before execution, variable form resolution, schema validation reporting, environment-aware runs against Draft/Staging/Production versions, saved run presets, and optional dataset row prefill for variables. The execution mode (platform credits vs. BYO) is always visible.

Model Comparison

The Comparison page runs the same prompt across multiple models in parallel, displaying outputs side by side with streaming support. Users can save comparison history, pin important comparisons, choose a winner run, and see estimated and actual credit costs. Comparisons operate in both personal and team context.

Datasets

Datasets are reusable test inventories for prompt evaluation. The dataset system supports CSV and JSON import, row-level editing with input variables, expected output, tags, and notes, bulk operations, and export. Data quality governance includes validation rules (required columns, uniqueness, min/max rows, null thresholds, regex patterns, type constraints), validation scoring, quick-fix actions, and automatic revalidation. Coverage analysis identifies data gaps by column and type, with generate-to-fill actions. Dataset snapshots enable rollback safety, and synthetic data generation can create new rows from existing examples, failure harvesting, or coverage-fill patterns with duplicate detection and review before application.

Rubrics

Rubrics define what good output looks like using weighted criteria with configurable scoring methods: Manual, LLM Judge, Custom Judge, Exact Match, Contains, and Similarity. Criteria can point to specific judge models or saved judges. Rubrics support both single-response and conversation-oriented evaluation patterns, and team-scoped rubrics can be shared across a workspace.

Judges

Judges are reusable evaluators that turn qualitative judgment into repeatable scoring logic. The judge builder supports configuring a judge model and temperature, writing system and scoring prompts, choosing scoring types (Binary, Scale, Categorical), saving versioned judges, and manual test runs against sample input/output. Calibration capabilities include building labeled calibration sets, tracking calibration history with accuracy metrics, and recalibrating as standards evolve.

Scenarios

Scenarios support multi-turn evaluation and conversation testing. Users build turn-by-turn conversations with speaker roles, variable placeholders, and expected behavior conditions per turn. Conversation evals execute a scenario against a prompt, rubric, model, and scoring mode, producing transcript-level rubric scores and per-turn scoring.

Experiments

Experiments are batch evaluations that run the same prompt and dataset across multiple models. The experiment system supports search, filtering, status tracking, baseline pinning, side-by-side comparison of two experiments (with score, cost, and latency deltas), aligned dataset row comparison, re-runs, and cancellation of in-progress experiments.

Eval Runs

The Eval Runs page is the central history and analysis surface for all evaluations. It supports browsing, searching, and filtering across experiments and standalone runs, score distribution views, inline expansion, bulk selection and export, and adding selected runs to a dataset. Run detail views show progress, schema compliance, live streaming of case activity during execution, criterion-level scoring, variation impact analysis, expected vs. actual output, manual grading, case retry, and failure harvesting into dataset-building flows.

Scheduled Runs

Scheduled Runs provide automated regression monitoring. Schedules support Daily, Weekly, Monthly, or custom Cron frequency with configurable baseline scores, regression threshold percentages, and alert channels (in-app and email). The system monitors next-run timing, recent history, sparkline trends, and detects regressions and errors. Schedules auto-pause after repeated failure conditions and integrate with release gate logic.

Run History

Run History preserves the operational record of prompt executions. Users can browse and filter by prompt name and billing mode (BYO vs. platform), inspect outputs and metadata, and create datasets from historical runs to convert exploratory work into formal test cases.

Analytics & Budgets

Analytics provides cost, usage, and execution visibility across 7-day, 30-day, 90-day, or custom date ranges. It tracks total runs, cost, latency, and tokens with daily volume and spend curves, cost breakdowns by prompt and model, top-consumer rankings, promotion markers overlaid on usage trends, and budget pacing with end-of-period projections. Budgets can be scoped to workspace, team, user, or model level with weekly or monthly periods, warning thresholds, and soft or hard enforcement. Budget alerts cover warnings, limit-reached events, and anomaly conditions.

Health Dashboard

The Health Dashboard provides an executive-level answer to "how healthy is the prompt estate right now?" It surfaces a composite workspace health score with weighted contributing factors, utilization snapshots, coverage-gap indicators, a prompt health matrix (last eval score, staleness, regression status, schema compliance, open promotions, recent errors, schedule/eval existence), and a recent activity feed across version commits, promotions, eval runs, and team changes.

Reusable Blocks

Blocks are reusable prompt building components in three types: Snippet, System Prompt, and Variable Preset. They support creation, editing, archiving, search, team sharing, content preview, usage tracking across linked prompts, and dependency visibility.

Dependencies

The Dependencies view provides a visual relationship graph of prompts, blocks, datasets, rubrics, judges, scenarios, and schedules within the workspace. It supports search and filtering by node type, node metadata inspection, unused asset identification, and archive-unused workflows to support workspace hygiene.

Notifications

PromptVault delivers both in-app notifications (with a global bell and unread count) and a full Notifications page with active/archived views, search and filtering by type, and configurable channel preferences. Supported notification categories span the full product surface: experiment and eval completions/failures, dataset validation issues, regression detection, schedule pauses, auto-promotions, team invites and joins, low credit balance, billing changes, prompt shares, and the full review lifecycle (requested, commented, approved, changes requested, merged).

Creators can publish prompts publicly through a dedicated shared prompt page that exposes the title, description, body, system prompt, variable definitions, author attribution (optional), view and import counts, and an import flow for bringing the prompt into any PromptVault workspace.

Onboarding

New users receive a guided onboarding experience with a welcome modal, main product tour, editor tour, milestone tracking, and page-level tooltips. Milestones cover first prompt, first run, first variables, first comparison, first dataset, first rubric, first eval, first saved version, first promotion, and first team member.

Team Administration

Team Settings is the workspace administration center. It covers team profile management (name, slug, member and invitation counts), lifecycle governance (promotion approval modes, pending promotion queues, release gate configuration, auto-promotion settings), billing and pooled credits (team credit pool, subscription state, seat management, billing portal, top-ups, billing contacts, per-member credit usage, team credit ledger), shared API key management, cost controls (team budgets applied before prompt runs, evals, and scheduled regressions), and member management (invite by email as Admin or Member, revoke invitations, change roles, remove members, transfer ownership, leave with ownership transfer logic).

Technical Architecture

Monorepo Structure

The codebase is organized as a TypeScript monorepo with four npm workspace packages:

client — React 19 SPA built with Vite. Uses Wouter for routing, TanStack React Query for async state, Better Auth React client for session-aware auth, and Zod-backed response validation through the shared schema layer. Lazy-loads major pages to reduce initial bundle size.
server — Express 5 API with Better Auth, Redis integration, email delivery, OpenRouter execution, Stripe billing, and all background schedulers. Feature routers are mounted by domain. The server runs as a single process that handles both API requests and background jobs.
shared — The contract-first layer. Exports Zod schemas for every domain (prompts, versions, evals, teams, billing, analytics, etc.) consumed by both client and server. This makes the API contract executable at runtime, not just descriptive at compile time.
db — Drizzle ORM schema definitions, connection management (lazy singleton pool), migrations (run on startup), and database helpers. Architecture tests enforce that this package never imports from server or client.

Frontend Architecture

The SPA bootstraps with QueryClientProvider, ToastProvider, ErrorBoundary, and ReauthProvider for global data caching, toast feedback, runtime error containment, and reauthentication flows. API access routes through a centralized apiRequest() function that handles endpoint prefixing, credentials, JSON defaults, standard error envelope parsing, and Zod schema validation. Invalid response shapes throw a non-retriable ApiValidationError to surface schema drift rather than hide it. React Query uses a 15-second stale time, no window-focus refetch, single retry (except for schema errors), and no mutation retry.

The protected application shell provides sidebar navigation, workspace switching, quick actions, recent prompts, notification bell, onboarding widgets, banners, credit balance, and a lazy-loaded command palette with keyboard shortcuts.

Backend Architecture

On startup, the server validates environment variables with Zod, connects Redis, runs database migrations, seeds launch models and prompt templates, then starts background schedulers for account deletion, notification cleanup, scheduled eval polling, daily analytics aggregation, daily budget anomaly detection, and periodic model catalog synchronization.

Every API request passes through sessionBootstrap to resolve the current Better Auth session. A global rate limiter applies unless disabled for E2E testing. Route handlers validate params, query strings, and bodies with Zod-backed middleware. Team-aware authorization is enforced through membership lookups and role checks. Sensitive actions require recent authentication, and the auth system clears recent-auth state on session revocation or credential changes.

Authentication & Security

Authentication is built on Better Auth with a Drizzle adapter and Redis-backed secondary storage. Sessions use a 30-day max age with short cookie caching. Redis-backed login lockout tracks failed sign-ins and returns 429 with Retry-After. Auth routes have tighter rate limits than the general API. The server can block signup when verification is required but email delivery is unavailable. Request IDs, structured error envelopes, and runtime validation make incidents traceable.

Data Layer

PostgreSQL is the system of record with Drizzle ORM for schema definitions, query building, and migrations. The database pool is a lazy singleton with bounded pool size, timeouts, and keepalive settings. Most business objects support dual ownership (personal via userId, team via teamId). History-heavy domains (versions, runs, experiments, schedule history) are modeled as append-oriented records. Schema domains span auth, teams, folders, prompts, versions, blocks, comparisons, playground, billing, model catalog, evaluations, judges, datasets, rubrics, scenarios, experiments, notifications, automation, analytics, intelligence, reviews, templates, and conversations.

Model Execution

The execution layer resolves each request into one of three modes: platform, byo_user, or byo_team. Platform runs use the platform OpenRouter key and charge credits. BYO runs decrypt stored keys and skip credit charging. The service estimates tokens and cost, constructs model messages, executes the request, optionally validates output against a configured schema, and persists the run record. Streaming uses SSE with per-user concurrency limits and timeout handling for overall duration and idle stalls. Execution errors are normalized into product-level errors (invalid BYO key, no balance, rate limiting, model unavailability, upstream timeout).

Evaluation & Scheduling

The eval API supports direct eval runs, conversation evals, and experiments. Eval progress streams via SSE for live status updates. Scheduled runs are modeled in the database with cadence fields, thresholds, run history, and alert channels. The schedule service validates workspace compatibility between prompt, dataset, and rubric before creation. The runtime polls for due schedules, starts eval runs automatically, records history, and integrates with notification and email channels. Gate logic connects scheduled evaluations to the broader quality-governance workflow.

External Integrations

OpenRouter — Primary upstream for model execution and catalog synchronization. Error mapping is productized so UI flows react to upstream states without exposing raw provider semantics.
Redis — Used for Better Auth secondary storage, rate limiting, login lockout, streaming concurrency, and job/scheduler coordination. Treated as operational state, not source of truth.
Stripe — Optional billing integration for subscriptions, top-ups, and webhooks. Mounted with raw-body handling before JSON parsing for webhook signature verification.
Email (Resend/SMTP) — Verification emails and schedule alerting through a shared sender abstraction that is environment-driven rather than transport-specific.

Environment & Development

Local infrastructure runs via Docker Compose with PostgreSQL 16 (port 5434) and Redis 7 (port 6381). Development uses promptvault_dev database and Redis DB 0; tests use promptvault_test database and Redis logical DB 15 on the same infrastructure. Isolation is database-name-driven, not port-driven. The server derives capability flags (stripeEnabled, emailEnabled, emailVerificationMode, openrouterConfigured) from validated environment state so behavior differences across environments are centralized rather than scattered through feature code.

Testing Strategy

Server tests run with Vitest under the test environment. Browser E2E uses Playwright with serial execution (shared infrastructure requires it). Playwright global setup creates a fresh auth user and team, saves auth state, and global teardown flushes Redis DB 15. Tests monitor console errors, page exceptions, error boundaries, toast failures, inline query errors, and unexpected ApiValidationError signals. Architecture tests enforce package boundaries (db cannot import server/client, shared cannot import server/client), database env access isolation, frontend accessibility patterns, and safe duck-typing guards.

Current Verified State

658 server tests passing
34 Playwright E2E specs passing
Zero as any casts in application code
Client entry chunk reduced approximately 62% from baseline

PromptVault is developed by RJHC Development LLC.

PromptVault

PromptVault

Prompt Engineering and Operations Platform

Technology Stack

Platform Features

Workspace Modes

Pricing & Billing

Execution Modes

Prompt Library

Templates

Prompt Detail & Lifecycle Management

Playground

Model Comparison

Datasets

Rubrics

Judges

Scenarios

Experiments

Eval Runs

Scheduled Runs

Run History

Analytics & Budgets

Health Dashboard

Reusable Blocks

Dependencies

Notifications

Public Sharing

Onboarding

Team Administration

Technical Architecture

Monorepo Structure

Frontend Architecture

Backend Architecture

Authentication & Security

Data Layer

Model Execution

Evaluation & Scheduling

External Integrations

Environment & Development

Testing Strategy

Current Verified State

Brookside Wiffle