Challenger v3 — Self-Improving Credit Compliance Agent

Implementation spec for a document-trained, gap-aware, version-controlled credit decisioning platform. Grounds the existing v2 debate architecture in hierarchical regulatory knowledge (EU → BG → practice) with a closed-loop human feedback system and a regulator-ready audit trail.

Version v3.0 · implementation Owner Elvin + partner Target Local dev → AWS prod Based on v2 factor-based debate

0 · What's changed since the first draft

The first draft of this spec described Phase 0 (knowledge layer) and Phase 1 (citations, gaps, maker/checker, manifest) as the MVP. Both shipped. Since then we've added two operator-facing features and tightened the agent runtime. Read this first — the rest of the spec still describes the original target architecture, with the additions noted in their respective sections.

Citation breadcrumbs (P0.2 from the YC application action plan)

Citation chips on the analyst page used to render f7f7d0f0 · 87% — the first eight chars of the chunk UUID and the relevance score. Useless to a human; actively confusing in a video demo. Now they render EBA-GL-2020-06 · §5.2 · 87%.

Where the metadata comes from. The retriever already returns each chunk's sourceId, section, and breadcrumb on the RetrievedPassage shape. The agent's structured JSON output only carries chunkId (plus optional relevance + quote) — that's all the agent needs to choose a citation. The orchestrator does the last-mile enrichment: enrichCitations() walks the agent's citation array, looks each chunk id up in the passagesForFactor set that was retrieved for that factor, and stitches in sourceId, section, breadcrumb before BOTH the SSE emit and the persisted factorDebates.turns entry. Live viewers and refreshed-replay viewers see the same breadcrumbs.

Hallucination guard. If the agent invents a chunk id not in the retrieved set, the lookup misses and the citation passes through with only the raw chunkId. The frontend's CitationChip falls back to chunkId.slice(0, 8) in that case AND switches the chip tone to badge-warning so the operator can see at a glance that the citation isn't grounded. Tooltip explains.

Display fallback chain. Each chip picks the most-specific label it has:

  1. sourceId + sectionEBA-GL-2020-06 · §5.2 (preferred)
  2. sourceId only → EBA-GL-2020-06 (chunker didn't capture section)
  3. breadcrumb only → Chapter 5 > Article 5.2 (older chunks)
  4. chunkId.slice(0, 8) → fallback for unenriched / hallucinated chunks; chip switches to warning tone

Click-through behaviour is unchanged: the chip still opens the existing CitationDialog with the source passage, agent quote, relevance score, and a deep link to the source PDF / URL when present.

Side-by-side debate cards (P0.1 from the YC application action plan)

Each FactorBlock now renders the agents' debate as a proper visual artifact — not a list of italic prose with + / prefixes. The chronological turn list is partitioned by role (partitionFactorTurns()) into approver arguments, rejector arguments, judge verdict, guidance, and per-agent clarifications. The two argument cards then sit in a grid grid-cols-1 md:grid-cols-2 gap-4 with the judge verdict full-width below.

AgentArgumentCard is the hero. Approver gets a success-tinted left border (border-l-4 border-success/70 bg-success/5) and Rejector gets the error tone — colour alone tells the operator who's arguing what within a glance. The header carries an SVG ConfidenceRing (44px, stroke-dasharray driven from the agent's self-reported confidence; the ring colour matches the agent's tone). The body is the argument prose at text-sm leading-relaxed — readable, not crammed. Footer surfaces the three-state grounding badge and citation chips on a divider row, so the regulatory grounding story is part of every argument the operator scans.

AgentArgumentSkeleton covers the initial-stream case: the rejector's column shows a skeleton card with a pulsing "thinking…" cue while the approver lands first, so the layout doesn't reflow when the second argument arrives. The skeleton is gated by the SSE phase event — only pulses when the orchestrator is actually mid-call on that agent.

JudgeVerdictCard renders below the two columns once the per-factor judge has picked. Verdict tone (positive / negative / neutral) drives the wrapper colour so the page reads like a sequence of debates with clear outcomes. Judge runs after both arguments land, so this card never shows a skeleton — the slot stays empty until the verdict exists.

Multi-round clarifications. If an agent ran more than once (initial low-confidence answer → operator clarification → refined answer), only the LATEST argument is the hero card. A small round X of Y chip in the card header signals the argument was refined. The system / human clarification turns sit chronologically below the card in the same column, keeping the mini-thread under its parent agent rather than floating between the two debate sides.

Timeline → sticky sidebar · solid badges

Timeline placement. The DebateTimeline moved from a top banner above the factor stack into a sticky right sidebar (w-72, lg:sticky lg:top-20). The analyst page is now a two-column flex (lg:flex-row lg:items-start): main column with factor cards / final decision / panels, and an aside that stays pinned to the topbar as the operator scrolls. On screens narrower than lg the layout collapses and the timeline drops above the main content via flex-col-reverse. Outer container widened from max-w-3xl to max-w-6xl to give the sidebar room without squeezing the factor cards.

Solid badges. Every badge-soft modifier across the app — Debates, Clarifications, Evals, Corpus, the analyst page, the timeline, the inbox — was stripped. The plain badge badge-{semantic} classes now render as filled coloured pills with the theme's --color-*-content as text (light text on the coloured background; warning is the deliberate exception because corporate's warning-content is dark, on yellow). The previous tinted look was inverted from the corporate theme's intent.

Debate timeline · analyst chrome aligned to corporate theme

New DebateTimeline component renders a vertical daisyUI timeline (timeline timeline-vertical timeline-compact timeline-snap-icon) above the factor stack on both the live and replay analyst views. One row per factor plus a final "Aggregation" step. Each step has three states: pending (hollow base-300 dot), active (filled primary dot + animated loading-dots sub-label like "Approver thinking…"), or done (filled primary check icon). The connector segment between two done steps colours primary; otherwise base-300. State derives directly from the existing SSE phase event + the per-factor verdict the SSE hook already tracks — no new backend events.

Theme alignment sweep across the analyst chrome. The Re-run button moved from a hand-rolled bordered pill to a proper btn btn-outline btn-sm with an iconify rotate icon and inline spinner — picks up the corporate theme's square corners. Every text-gray-N, bg-emerald-50, border-blue-200, bg-violet-50, etc., on the analyst page and its companion components (FactorDebateAccordion, ReviewPanel, FinalDecisionCard, StreamingProgressBar, ClarificationInlineForm, CitationDialog, GapsPanel, ManifestPanel, AgentCard) was bulk-converted to daisyUI semantic tokens (text-base-content, bg-success/10, border-info/30, bg-secondary/10). Stray rounded-md / rounded-lg / rounded-full utilities stripped — the theme's --radius-*: 0rem tokens are now load-bearing across the whole debate view. "High Risk" and "Needs review" pills swapped to badge badge-error badge-soft / badge badge-warning badge-soft.

Case inbox replaces the JSON-form launcher

/decision/new used to be a single JSON editor with six preset buttons up top. It is now an operator-shaped inbox: a daisyUI table where each row is a credit case with applicant signals at a glance (income, credit score, DTI%, loan amount, employment + purpose snippet) and per-row Run / Edit / Delete actions. Last-run status renders as a click-through link to the analyst replay so the operator can re-open whatever a case produced last.

The case data layer (src/lib/cases.ts) seeds the six original presets as immutable rows and stores any operator- created drafts in localStorage. A second last-run map (also localStorage) keeps the per-case session id + timestamp so the inbox never has to call the backend to render its status column.

The JSON editor moved to a secondary route at /decision/new/case/[id] — three modes: "new" creates a blank custom case, preset-… renders read-only with a "Save as new case" button (presets stay pristine), and custom-… is fully editable with Save / Run / Delete. Structured form fields up top, raw JSON editor collapsed below for paste-from-outside flows. Sidebar label updated to "Cases" (lucide-inbox icon). Backend untouched — Run still POSTs to /decision exactly as before.

Cancel debate · live activity indicator · corporate theme

Three control + visibility upgrades shipped together.

Cancel debate. New POST /decision/:id/cancel endpoint sets a per-session flag in an in-process Set; the orchestrator checks the flag at safe checkpoints (between factors, before each agent call, before final aggregation) and throws a CancelledError. The outer catch identifies it via the brand and takes a different path from real failures — no rule-based fallback, just emit a 'cancelled' SSE event, mark the session CANCELLED (new SessionStatus variant), drop any pending clarifications so an awaitClarification() currently blocking resolves immediately, write the manifest, exit. Frontend Cancel button lives in the analyst topbar and unmounts the moment the SSE acknowledgement lands.

Live activity indicator. The StreamingProgressBar is gone. Replaced by a daisyUI loading-dots chip in the analyst page header ("Approver thinking on Income vs Loan Amount…") plus an inline placeholder turn that mounts inside the FactorBlock matching the currently-active factor. Both share state via a new 'phase' SSE event the orchestrator emits at every transition (retrieving / thinking / aggregating / idle). Stateless — every emission overwrites the prior value; the SSE hook's phase field drives both views off one source of truth so they never desync.

DaisyUI corporate theme. The two custom themes (light + dark) were replaced by daisyUI's stock corporate theme — neutral whites/grays, confident blue primary, square corners, no shadows. Light-only for now; dark mode is a follow-up if a customer asks. The legacy ThemeToggle remains in the UI but currently a no-op since only one theme is registered (cycling to "dark" falls back to the default). Removing the toggle is a small polish task.

UI overhaul — DaisyUI 5 / Tailwind 4 / Scalo template foundation

The frontend has been rebuilt on top of a purchased Scalo daisyUI/Tailwind 4 template (Denish Navadiya). Every page now renders inside a consistent shell: a sidebar + topbar AppShell for the authenticated views, a sticky-blur LandingTopbar for the public site, daisyUI tokens for every colour/badge/button/alert. New ConfigContext + ThemeToggle drive light / dark / system theme via the data-theme attribute on <html>; daisyUI reads it and swaps colour tokens.

Stack changes: Tailwind 3 → 4 (CSS-based @theme config, no JS config needed), added daisyUI 5, @iconify/tailwind4 with the Lucide icon set, tailwindcss-motion, simplebar-react, swiper. PostCSS plugin swapped to @tailwindcss/postcss. Next.js stays at 14, React stays at 18 — the template's stack is forward-compatible and the upgrades aren't required.

New routes: / is now the public landing (Hero + Process + Features + Capabilities + Pricing + CTA); the demo decision form moved to /decision/new. Existing routes (/admin, /admin/login, /analyst/decision/:id, /decision/:id) are wrapped in the new layout but keep the same logic, hooks, and SSE flows — no functional regressions. Inline-clarification flow, citation chips, eval dashboard, maker/checker review, version manifest viewer all still work; only chrome changed.

Embedding provider switched to Voyage AI

Anthropic does not ship an embedding API of its own and recommends Voyage. We default to voyage-3-lite, 512-dim, with a 200M-tokens/month free tier — comfortably more than enough for the entire current corpus. The knowledge_chunks.embedding column is VECTOR(512) after migration 012_voyage_embeddings.sql. Adapter is a 30-line REST shim in backend/src/knowledge/embedder.ts — no SDK dependency.

Three-state grounding badge replaces the binary "known"

The original Phase 1 spec described a green Grounded / amber No grounding binary. Real debates exposed a third state: the retriever DID return passages but the agent declined to cite any. Conflating "no corpus material existed" with "agent chose not to lean on it" was misleading. The badge is now:

  • ✓ Grounded (green) — agent cited at least one chunk
  • ⚠ Declined (amber) — retrieval returned passages, agent didn't cite
  • ✕ No retrieval (red) — corpus had nothing for this factor; also fires a knowledge gap

Backend emits retrievalCount on every factor_turn event so the UI can compute the three states without an extra round-trip.

Clarification flow — agents can ask for human help

New in this iteration. When an Approver or Rejector returns confidence < 0.65 AND populates clarification_request: { question }, the orchestrator pauses the debate, emits a clarification_request SSE event, and awaits a human answer (or skip / 5-min timeout / cap). The answer is appended to the prompt and the agent re-runs.

Cap is configurable (default 2 rounds per agent per factor). Both the question and the answer are persisted as system / human turns in the factor's timeline, alongside the agent arguments — full audit trail. See section 5 (Clarification Flow) below.

Clarification UX: inline form replaces the modal

The pause prompt no longer pops up a backdrop dialog. The system question already lands in the timeline as a speaker: 'system' turn; the answer form now mounts right under it as part of that same blue mini-thread. Same submit/skip behaviour, same 409-on-late-answer handling, same SSE-driven unmount when clarification_resolved fires — but the operator's eyes never leave the debate. New ClarificationInlineForm component; the old ClarificationModal is removed.

Operator console (admin panel) at /admin

New in this iteration. Lives behind a shared bcrypt-hashed password + signed cookie. Two tabs:

  • Corpus — table of ingested bundles (source, tier, jurisdiction, version, chunks, ingest time), inline upload form (file + tier dropdown with explainer + jurisdiction + source-id + optional metadata), inline-confirm delete that cascades to chunks via FK.
  • Settings — runtime-tunable knobs (retrieval threshold, top-K, jurisdiction, retrieval on/off, clarification threshold, max clarifications per factor, early-exit threshold). Each row has a type-aware widget (toggle / slider / dropdown) and a plain-English description. Saves persist to a new runtime_settings table — no redeploy needed.

See section 6 (Operator Console) below.

Runtime settings table — config without redeploy

New runtime_settings table (migration 013) stores tunable values keyed by the same names as the env vars they replace. The retriever and orchestrator read through a 60s in-memory cache; writes invalidate immediately on the writer's machine and propagate via TTL to others. Static config.ts values become fallbacks for unset rows. This is what the Settings tab edits.

Clickable citation deep-dive

Every citation chip in the analyst view is now a button. Clicking it opens a modal with the full retrieved passage, source ID (linked to the original PDF if a source_url was provided at ingest), tier badge, jurisdiction, version, breadcrumb, and the agent's quoted snippet for comparison. Backed by GET /knowledge/chunks/:id.

Robustness: graceful degradation + JSON sanitiser

LLMs occasionally produce malformed structured output — overshoot the quote length cap, wrap the JSON in ```json fences, jam two quotes together with " and ". The orchestrator now wraps each agent call in a per-factor try/catch: if parse still fails, that one speaker degrades to a stub with known=false, confidence=0 and the debate continues. The sanitiser strips fences and collapses the multi-quote pattern before Zod sees it.

Deployment: Fly.io + Neon (replaces the AWS plan for early testing)

The original spec described an AWS deployment topology. For partner-testing we shipped on Fly.io instead — two apps (frontend, backend) + Neon Postgres for pgvector. ~30 minutes to deploy, ~$0/mo on hobby tiers. Dockerfiles and fly.backend.toml / fly.frontend.toml live at the monorepo root. DEPLOY.md contains the full runbook. AWS path remains the eventual target; Fly is the proving ground.

Five factors, configurable early-exit

The factor list is the same five from v2: Income vs Loan Amount, Credit Score, Existing Debt / DTI, Missed Payments, Employment Stability. The orchestrator's "skip remaining factors after N negative verdicts" optimisation is now keyed off the EARLY_EXIT_NEGATIVE_THRESHOLD runtime setting (default 3). Set it to a value ≥ 5 in the admin panel to force a full sweep — useful for analyst review at the cost of double the LLM tokens on rejections.

Clarification learning loop — every Q&A is now persisted

Every clarification interaction is recorded in a new clarification_events table (migration 015): question, answer, status (answered / skipped / timeout / capped), confidence_before, confidence_after backfilled when the agent re-runs, computed confidence_delta as a generated column, and the wall-time the operator took to respond. Reserved analyst_feedback column is stubbed for the next phase.

A new admin Clarifications tab surfaces these events with per-factor / per-status filters and four roll-up stats (total, answered %, skipped+capped, average confidence Δ). Read-only for now; analyst feedback marks and similarity-based prompt injection (the actual "learning") are the next two steps. The data we collect now is what makes those work — building the substrate before the smarts.

Calibration: prompts rebalanced to fix the rejection bias

The first Bondora eval (20 rows) revealed the system rejected 90% of applicants and never produced an APPROVE via the LLM debate — both approvals in the run came from the rule-based fallback. Defaulter recall was 92% (great); good-loan recall was 14% (commercially non-viable). Diagnosed three structural causes in the prompts:

  • Asymmetric framing. The Rejector's prompt told it that regulator text was its natural ammunition ("prudential standards that can support a risk-rejection argument — find them"). The Approver's said "acknowledge the risk while arguing the borrower clears the bar" — defensive from the start. Rebalanced: the Approver is now told to find passages establishing the bar and argue the applicant CLEARS it; the Rejector is now told NOT to manufacture risks from generic regulator language and to use neutral / weak-negative when the applicant genuinely clears the bar on a factor.
  • Bare factor-judge prompt. Previously: "evaluate the two arguments." That's it. Now teaches the judge how to weigh concrete vs generic claims (concrete with numbers/citations beats generic), and when 'neutral' is the right verdict (genuinely balanced; mixed application data; neither side made a quantified case). Neutrals are now first-class outcomes, not fence-sits.
  • Bare final-judge prompt. Previously "be consistent with the majority signal" with no rule for neutrals. Now uses an explicit count rule: 3+ positives AND 0-1 negatives → APPROVE; 3+ negatives AND 0-1 positives → REJECT; everything else → REVIEW. Neutrals don't count as negatives. Default-to-REJECT bias is explicitly called out as a failure mode; REVIEW is the safe action under uncertainty.

Operational change to make alongside the prompt rebalance: set EARLY_EXIT_NEGATIVE_THRESHOLD to 99 (or any value ≥ FACTORS.length) in the admin Settings tab. The default (3) was bailing out before the positive-leaning factors at the END of the list (Missed Payments, Employment Stability) ever got a vote. Cost: doubles LLM tokens on rejections. Benefit: positive signals get heard.

Re-run the Bondora eval after deploying these changes. Expectation: the LLM should now produce some APPROVE decisions on rows where credit + DTI + clean history clearly support it (e.g. row 11 from the first run — credit 720, DTI 9.9%, employed, 0 missed).

Manifest is now recorded on the fallback path too

Bug fix surfaced via re-runs. Previously, the version manifest was only written in the orchestrator's success path. If the debate threw — for any reason — the rule-based fallback engine produced a decision but no decision_version_manifest row got written, leaving an audit-trail hole for exactly the cases that need it most.

The settings snapshot (prompt-set hash, guardrail-set hash, retrieval params, bundle IDs touched) is now hoisted out of the inner try block and a single writeManifest() helper is called from BOTH the success path AND the fallback path. Best-effort either way — a manifest-write failure logs but never affects the decision delivered to the user.

Guidance is now visible to the analyst (not just to the agents)

First version of the guidance-injection feature put the prior Q&As into the prompt invisibly — the operator couldn't tell whether the system had used institutional memory or just got lucky. Now the orchestrator emits a new factor_turn with clarificationKind: 'guidance' + a guidanceItems array right before the approver runs. The analyst view renders it as a violet "📚 Prior operator guidance · N items" row — collapsed by default, click to expand the actual Q&A pairs the agents had access to. Persisted into factorDebates.turns so refresh / replay shows the same context.

Cross-session guidance injection — agents now have institutional memory

First piece of the clarification learning loop's "actual learning" phase. Before each factor runs, the orchestrator queries clarification_events for recent ANSWERED Q&As on the same factor (default lookback 30 days, top 5 by recency, filtered to rows where the answer measurably moved confidence). These are rendered into a new HISTORICAL OPERATOR GUIDANCE block in both Approver and Rejector prompts. The agents are instructed to treat the operator's standing answer as institutional policy and stop re-asking what's already been resolved.

Tunable via three new runtime settings (migration 017): GUIDANCE_INJECTION_ENABLED (default on), GUIDANCE_LOOKBACK_DAYS (1–365, default 30), GUIDANCE_MAX_ITEMS_PER_FACTOR (default 5). Set the first to false for stateless A/B-test runs against the original behaviour.

Where this fits in the wider learning roadmap: this is step 1 of 4 (recent guidance injection). Steps 2–4 — analyst feedback marks, question quality scoring, similarity retrieval — wait for more labelled data to be worth building.

Re-run button: disabled while the source debate is still running

Small UX fix. The RerunDebateButton component now accepts a sourceRunning prop fed from the SSE hook — true while there's no final_decision and no error — and renders disabled with a tooltip until the original debate settles. Prevents a class of double-debate confusion on the same applicant.

Re-run Debate now actually works

Migration 016 adds application_payload JSONB to debate_sessions. The POST /decision handler now persists the full application JSON on the session row at creation time. New POST /decisions/:id/rerun route reads it back, claims a fresh session (deliberately skipping idempotency so each rerun is its own row), and launches a new debate on the same applicant. The analyst page's "Re-run Debate" button — previously a permanently disabled stub across four hard-coded copies — is now a single RerunDebateButton component wired to the new endpoint, with a graceful 409 path for sessions that predate the migration.

Useful beyond the button: per-decision A/B testing (change a setting, re-run, compare), and the eval harness can now self-replay against new prompts without touching the source dataset.

Debates history tab — every session, replayable

New Debates tab in the admin console lists every debate_sessions row newest-first, joined with the four-eyes review status, with filters by session status and final decision. Click-through opens the analyst route in replay mode (/analyst/decision/:id?replay=true) which renders the full timeline from the persisted judge_output — including the system/human clarification turns we now save inline.

Bondora ground-truth eval — first measurable accuracy numbers

New npm run eval:bondora CLI reads the public Bondora P2P loan-data CSV, maps each historical loan to a CreditApplication, runs the orchestrator, and scores the agent's decision against the realised outcome (Repaid vs Default). Outputs a confusion-matrix scoreboard with accuracy, false-approval rate, expected loss per €1000 lent, and average debate latency. First time the system has been measured against ground truth instead of vibes. See section 5.4 for the metrics list and a data-leakage caveat.

Eval dashboard — admin-driven, cursor-based, replayable

The eval CLI has graduated to an admin-panel feature. New Evals tab at /admin with three blocks: a dataset list (CSV uploads with a per-dataset cursor showing X/Y rows processed), a run launcher (pick dataset + row limit, fire), and a runs history. Click any run to see its confusion matrix and every per-row result, with a one-click link into the live debate replay for the underlying session. New tables: eval_datasets, eval_runs, eval_run_rows (migration 018).

Cursor-based, no duplicates. Each run consumes the next N mappable rows past dataset.processed_count, advances the cursor at the end, and records the slice (cursor_startcursor_start + row_limit) on the run row. Hitting "Start" five times in a row scores rows 0-19, 20-39, 40-59, 60-79, 80-99 — never the same row twice. Failed and unmappable rows still consume the cursor (the LLM cost was paid; replaying won't help). Re-uploading the same CSV is a no-op via SHA-256 dedupe, so the cursor isn't lost.

The runner is the same code path as the existing CLI (extracted into backend/src/eval/{csv,datasets,runner}.ts), kicked off async via setImmediate so the POST returns the run id immediately. The UI polls every 4 s while any run is queued/running.

Logging: colored, categorised, scannable

Backend stdout now uses ANSI-colored category tags ([db], [llm], [embed], [route], [gaps], [retrieval], [orch], [manifest], [ingest], [citation], [admin], [eval]) plus magnitude-graded duration colors (green <100 ms, yellow <1 s, red ≥1 s). Auto-disables on non-TTY pipes; force on with FORCE_COLOR=1.

5b · Clarification Flow (operator-in-the-loop)

A junior analyst asks senior staff when stuck. The Approver and Rejector now do the same. Each agent's structured output gained an optional fourth field:

{
  argument: string,
  known: boolean,
  confidence: number (0..1),
  citations: Citation[],
  clarification_request: { question: string } | null   // NEW
}

The agent populates clarification_request when it would otherwise have to guess — specifically, when there's a focused, answerable thing a senior credit/risk officer could tell it that would change the argument. The system prompts include examples of good vs bad questions ("Does our policy treat 6-month gig income as stable?" vs "Is the credit score good?").

Mid-debate pause-and-resume

When the orchestrator sees confidence < CLARIFICATION_THRESHOLD (default 0.65, runtime-tunable) AND a non-null clarification_request, it:

  1. Emits two SSE events: factor_turn with speaker: 'system' + the question (timeline persistence), and clarification_request with round metadata (which the analyst view's useDebateStream hook stores keyed by ${factor}::${speaker}).
  2. Calls awaitClarification(), which registers a Promise resolver in an in-memory map keyed by (sessionId, factor, speaker), with a 5-minute timeout.
  3. The analyst view sees the new pending entry and mounts an ClarificationInlineForm directly underneath the system-question turn — no modal, no backdrop. The form lives inline in the blue mini-thread. Human types an answer or hits Skip.
  4. Form POSTs to /decision/:id/clarify with { factor, speaker, answer, reason } (JSON) or, when the operator attaches evidence, the same fields plus an attachment file part (multipart). The route handler calls resolveClarification() which fires the Promise.
  5. Orchestrator unblocks, emits a clarification_resolved event (the hook removes the pending entry, which unmounts the inline form) plus a factor_turn with speaker: 'human' + the answer (timeline persistence), then re-runs the same agent with the Q&A appended to the prompt.
  6. Loop until: agent reaches confidence ≥ threshold, OR doesn't ask a new question, OR the per-factor cap (MAX_CLARIFICATIONS_PER_FACTOR, default 2) is hit.
  7. If the cap fires, an amber ⚑ Needs review pill appears on that factor in the analyst view.

Audit trail (two layers)

Per-decision timeline: question and answer are first-class turns in factor.turns[], with speaker: 'system' | 'human' and a clarificationKind: 'request' | 'response' tag. They render as a blue-bordered mini-thread in the analyst view ("✋ System asked" / "🗣 Operator answered" + reason badge); while a request is still pending, the ClarificationInlineForm mounts inside that same mini-thread directly under the question, so the operator answers in-context. Turns persist into judge_output.factorDebates so a refresh shows the same chronological story. The version manifest is unaffected — clarifications are operator interventions, not corpus material.

Cross-decision dataset: every clarification also writes a row to clarification_events (migration 015) capturing session_id, factor, speaker, round, question_text, answer_text, answer_status, confidence_before, confidence_after (backfilled when the agent re-runs), confidence_delta (generated column), and time_to_answer_ms. The admin Clarifications tab reads from this table; the next phase of the learning loop uses it for analyst feedback marks and per-factor "recent guidance" prompt injection. Writes are best-effort — failure logs but never blocks the debate.

Operator-attached evidence (PDFs and images)

The clarification form also accepts an optional file alongside the typed answer — used when the borrower has supplied evidence (payslip, bank statement, ID photo, screenshot of HMRC tax record). One file per answer, capped at 10 MB, MIME restricted to application/pdf, image/jpeg, image/png, and image/webp. Both client and server enforce the cap; oversized uploads return 413, wrong types return 415.

Files land on the backend's persistent volume (CLARIFICATION_ATTACHMENTS_DIR, e.g. /data/clarification-attachments/{sessionId}/{slug}), and migration 020_clarification_attachment.sql adds four nullable columns to clarification_events: attachment_path, attachment_mime, attachment_size_bytes, attachment_original_name. A CHECK constraint enforces all-or-nothing — a half-populated row can't slip through.

How the LLM consumes the file: when the orchestrator re-prompts the agent on the next round, an ATTACHED EVIDENCE block is synthesised into the operator's answer. For PDFs the file is run through pdf-parse and the extracted text is inlined (capped at ~12k chars, with a "…[truncated]" marker), so the agent reads the actual document content. For images the agent sees only a labelled note (filename + size + "the operator attached visual evidence; rely on the typed answer") — vision-aware re-prompts are a future upgrade. PDF parse failures fall back to the image/note path so a corrupted upload never kills the debate.

Audit surfaces: the human-response timeline turn gains an attachment field ({eventId, mimeType, sizeBytes, originalName}) so the analyst page renders an inline link to the file alongside the typed answer. The admin Clarifications tab renders the same link on each event card. The backend exposes GET /decision/:id/clarifications/:eventId/attachment which streams the file with Content-Disposition: inline so PDFs and images open in a new tab. The agent prompt + the auditor's view stay decoupled — the LLM never sees image bytes; humans always see the original file.

State management

In-memory only. Works because Fly's request affinity keeps the SSE stream and the /clarify POST on the same machine. If we ever shard runDebate across machines, this graduates to Redis pub/sub or a DB-backed coordination primitive. For two-operator testing on a single Fly machine, the in-memory map is fine. Attachment storage carries the same single-machine assumption — when we shard, the volume mount becomes S3/R2.

6b · Operator Console (Admin Panel)

Lives at /admin on the frontend. Five tabs (Corpus, Debates, Clarifications, Evals, Settings), shared bcrypt-hashed password, signed httpOnly cookie session. Replaces the prior workflow where every corpus tweak required SSH + CLI + restart.

Auth

Corpus tab

Table of every knowledge_bundle row joined with chunk metadata (representative tier / jurisdiction / source_id / source_url / version pulled via MIN(c.<col>) across the bundle's chunks). Source ID is linked to the external source_url when present. Inline confirm-delete cascades to chunks via FK.

Upload form: PDF picker, tier dropdown with inline explainer ("Tier 1 = regulator, Tier 4 = academic"), jurisdiction text, source-id text, optional version + source-url. Submits multipart to POST /knowledge/upload, which streams the file to /tmp, runs the existing ingestDocument() pipeline (loader → chunker → embedder → DB writer), and returns the standard IngestReport. Authenticated callers only — an attacker can't fill our DB or burn our embedding tokens via curl.

Settings tab

The tab now ships in three sections, all reading from the same runtime_settings table:

Every row shows the plain-English description seeded by migration 013 / 014. Save button activates only when the value differs from the saved one. Updates POST to PUT /admin/settings/:key; the response writes the new updated_at + updated_by fields, displayed under the widget.

Currently exposed knobs: RETRIEVAL_ENABLED, RETRIEVAL_TOP_K, RETRIEVAL_MIN_SIMILARITY, DEFAULT_JURISDICTION, CLARIFICATION_THRESHOLD, MAX_CLARIFICATIONS_PER_FACTOR, EARLY_EXIT_NEGATIVE_THRESHOLD, GUIDANCE_INJECTION_ENABLED, GUIDANCE_LOOKBACK_DAYS, GUIDANCE_MAX_ITEMS_PER_FACTOR, LOG_SQL.

Debates tab

Newest-first listing of every debate session, joined with decision_reviews for the four-eyes status. Filters: by session status (Completed / Failed / Running) and by final decision (APPROVE / REJECT / REVIEW). An "Include eval runs" checkbox surfaces sessions produced by eval batches; off by default so live operator activity isn't drowned by them. Backed by GET /admin/debates?status=&decision=&includeEval=&limit=.

Roll-up strip at the top: total / approved / rejected / review / fallback-used / cost-loaded counts over the loaded set. Each row shows session UUID prefix, application id, status pill, decision pill (with a fallback tag if the rule engine had to step in, plus a from eval tag when surfaced via the toggle), the first few decision tags, four-eyes review status, duration, USD cost (with input/output token breakdown on hover), and created-at. View action is an eye icon in the rightmost column — opens replay (or live, for in-flight debates). Eval-vs-live is tracked via the debate_sessions.eval_run_id column added in migration 021 — eval runner sets it; the live /decision route leaves it NULL. Cost columns are populated by migration 022 + withUsageTracking in the orchestrator: each LLM call records token usage to an AsyncLocalStorage scope, and the finally block sums + persists at debate end (success, fallback, or cancellation) using the per-model pricing table in agents/usage.ts.

Failed debates render a translated banner instead of the raw ERROR_TRACE when one of the well-known LLM provider errors fires (out of credit, rate limit, invalid API key). The friendlyLlmFailure() helper in the orchestrator tags these into the fallback reasoning; the analyst page's FinalDecisionCard renders a coloured alert with the operator-readable message and tucks any unrecognised stack trace behind a "Show debug trace" disclosure.

Click-through navigates to the existing analyst route: /analyst/decision/:id?replay=true for completed/failed debates (loads from persisted judge_output), or the live streaming view for in-flight ones. No new view code — replay was already built; this tab just makes it discoverable.

Clarifications tab

Read-only feed of every clarification event written to clarification_events, newest first. Filters: factor (dropdown of the five debate factors), status (answered / skipped / timeout / capped), and an "Include eval runs" toggle (off by default). Eval batches run with maxRounds=0 and produce a flood of capped rows when an agent reports low confidence; hiding them by default keeps the operator's own clarification history readable. The toggle is implemented as a LEFT JOIN debate_sessions ON s.id = e.session_id in listEvents with a default WHERE s.eval_run_id IS NULL. Cards from eval runs render a from eval badge when surfaced. Backed by GET /admin/clarifications?factor=&status=&includeEval=&limit=.

Header strip shows four roll-up stats over the loaded set: total events, answered count + percentage, skipped+capped count, and average confidence Δ. The avg-Δ figure is the closest thing to "are these questions actually useful?" you can read at a glance — green if ≥+5%, red if negative.

Each event renders as a card: status pill + factor + speaker + round, the question and answer text in a blue mini-thread, and a footer row with confidence_before, confidence_after, the delta, and the operator's response latency. Future enhancements (analyst feedback marks, similarity-based prompt injection) build on this surface.

Evals tab

Three-block layout. Datasets at the top: list of uploaded CSVs (one row per eval_datasets entry), each showing processed_count / total_rows as a progress bar so you can see at a glance how much of the dataset is left. Inline upload button — multipart POST to /admin/eval/datasets, files written to EVAL_DATA_DIR (a Fly volume in prod, OS tmp in dev), SHA-256 deduped so re-uploads no-op.

Run launcher: dataset dropdown + row-limit input + a single Start button. Submits POST /admin/eval/runs with { datasetId, rowLimit }. The backend creates an eval_runs row in queued status, fires the worker on setImmediate, and returns the row id. The runner skips cursor_start CSV rows (= dataset.processed_count at POST time), then collects up to row_limit mappable rows and runs each through runDebate(...,{ noClarifications: true }) sequentially. Per-row results land in eval_run_rows as they complete; aggregate metrics (TP/TN/FP/FN, accuracy, expected loss, net €) are computed and written to eval_runs at the end. The cursor advances by the number of CSV rows consumed (not just successful) so failed rows aren't retried on the next run.

Runs history: newest-first table of all runs (optionally filtered by clicking a dataset above). Each row shows started timestamp, status pill, slice indices, the four confusion-matrix counts, accuracy, and net €. Clicking a run opens the run detail view: confusion-matrix grid up top (TP/TN/FP/FN + REVIEW + failed + accuracy + net €) and a per-row table below with the agent's decision, the realised outcome, the bucket, and a one-click link to the live debate replay (/analyst/decision/:session_id?replay=true) so you can drill into why any individual call went the way it did. While the run is queued/running the UI polls every 4 s.

What's NOT in the MVP: agent-written interpretation of the results (deferred — we want analysts to look at the per-row data first), file attachments on clarifications, and any kind of automatic dataset selection. The cursor is per-dataset only — there is no "split into train/eval" mode.

Caching policy

Reads use a 60s in-memory map. Writes invalidate the map immediately on the writer's Fly machine; other machines pick up changes on next cache miss. So a setting saved at T+0 takes effect on the writer's debates immediately, on other machines within ≤60s. This was the right trade-off for two-person testing — no Redis dependency, negligible read cost.

1 · Overview & Product Thesis

Challenger v2 is a static multi-agent debate system (Approver → Rejector → Judge) that produces explainable credit decisions — but the intelligence lives entirely in the LLM's pretraining. There is no regulatory grounding, no memory of prior decisions, no way to improve from feedback, and no audit trail a European bank regulator would accept.

v3 keeps the debate harness and wraps it in three new layers: a hierarchical knowledge store (EU → BG → practice), a grounded reasoning layer where every argument carries citations and a self-reported confidence, and a self-improvement layer where the agent logs what it did not know, humans resolve gaps, and the system is version-pinned so every historical decision can be reproduced.

Positioning

Not "smarter credit decisions." That market is commoditising. Regulator-ready AI audit trails for credit origination in the EU. Four-eyes (Maker/Checker) aligned with EBA and the EU AI Act high-risk-system obligations. The moat is the BG+CEE practice corpus, the immutable citation trail, and the gap log.

Success criteria for v3 MVP

2 · High-Level Architecture

graph TB subgraph KL["Knowledge Layer"] L1["Tier 1 — EU
EBA · ECB · AMLD6 · PSD2"] L2["Tier 2 — National
BNB acts · BG directives"] L3["Tier 3 — Practice
Bondora · German Credit · internal cases"] L4["Tier 4 — Literature
FSB/BIS · OeNB · BoE"] ING["Ingestion Pipeline
loader → chunker → embedder"] VDB[("pgvector
(local · RDS prod)
+ metadata filters")] L1 --> ING L2 --> ING L3 --> ING L4 --> ING ING --> VDB end subgraph RL["Reasoning Layer"] RET["Hierarchical Retriever
hybrid search + reranker"] MK["Maker Agent
proposes + cites"] CH["Checker Agent
validates vs regulation"] JD["Judge Agent
reconciles + emits verdict"] GP["Gap Detector
retrieval + confidence"] GR["Hard Guardrails"] end subgraph SI["Self-Improvement Layer"] GQ[("Gap Queue")] FB[("Feedback Store")] EV["Eval Harness
golden set + diff report"] VM[("Version Manifest")] end API["Fastify API
+ SSE"] UI["Next.js UI
Applicant · Analyst · Operator"] API --> RET VDB --> RET RET --> MK RET --> CH MK --> JD CH --> JD JD --> GP JD --> GR GP --> GQ GR --> API JD --> API API --> UI GQ --> UI FB --> RET FB --> EV EV --> VM VM --> MK VM --> CH VM --> JD UI -->|resolve gap| ING UI -->|correction| FB

Layer boundaries

LayerResponsibilityOwns
KnowledgeWhat the agent knowsIngestion · vector DB · retriever
ReasoningHow it decidesMaker · Checker · Judge · guardrails · gap detector
Self-ImprovementHow it gets betterGap queue · feedback store · evals · version manifest
PlatformDeliveryFastify API · SSE · Next.js UI · auth

3 · Knowledge Layer

3.1 Ingestion pipeline

sequenceDiagram participant Op as Operator participant API as Fastify API participant Ing as Ingestion Worker participant S3 as Storage (local FS / S3) participant LLM as Embedding Model participant DB as pgvector Op->>API: POST /knowledge/ingest (file + metadata) API->>S3: store raw document API->>Ing: enqueue ingestion job Ing->>S3: fetch document Ing->>Ing: loader (PDF / HTML) Ing->>Ing: section-aware chunker Ing->>LLM: embed chunks (batched) LLM-->>Ing: vectors Ing->>DB: insert chunks + vectors + metadata Ing->>DB: commit bundle_id + manifest row Ing-->>API: job_id → COMPLETED API-->>Op: bundle summary

3.2 Chunking strategy

3.3 Knowledge bundle (immutable artefact)

Every ingestion batch produces a knowledge_bundle row with a content hash. Decisions reference bundle_id. Bundles are append-only; a "new version" of EBA guidelines creates a new bundle, never mutates the old one.

knowledge_bundles
├── id                UUID, primary key
├── label             VARCHAR  — e.g. "kb-2026-04-20-eu+bg"
├── content_hash      VARCHAR  — SHA256 of sorted chunk_ids
├── source_manifest   JSONB    — [{source_id, version, effective_from, sha256}]
├── chunk_count       INTEGER
├── created_at        TIMESTAMPTZ
└── created_by        VARCHAR

3.4 Chunks table

knowledge_chunks
├── id                UUID, primary key
├── bundle_id         UUID  → knowledge_bundles.id
├── source_id         VARCHAR  — e.g. "EBA-GL-2020-06"
├── source_url        TEXT
├── tier              SMALLINT  — 1=EU, 2=national, 3=practice, 4=literature
├── jurisdiction      VARCHAR  — "EU" | "BG" | "DE" | ...
├── version           VARCHAR  — "2020/06"
├── effective_from    DATE
├── section           VARCHAR  — "Article 4 §2"
├── breadcrumb        TEXT     — "Title II > Chapter 3 > Article 4 > §2"
├── language          VARCHAR  — "en" | "bg"
├── text              TEXT
├── embedding         VECTOR(1024)  — pgvector
├── token_count       INTEGER
└── chunk_hash        VARCHAR   — SHA256 of text

INDEX ivfflat (embedding vector_cosine_ops)
INDEX (bundle_id, tier, jurisdiction)

3.5 Retrieval

Hybrid search

Dense (pgvector cosine) + sparse (tsvector BM25-style over text). Regulatory language is keyword-heavy (PD, LGD, EAD, Article numbers) — embeddings alone fumble these.

Tier-weighted fusion

Retrieval merges tier 1 + tier 2 + tier 3 with configurable weights. BG-specific queries bias tier 2; fallback to tier 1 when tier 2 returns nothing above threshold.

Reranker

Top-20 → cross-encoder reranker → top-5. Latency cost (~200ms) is worth it for compliance. Local dev: bge-reranker-v2-m3 via HuggingFace inference. AWS: SageMaker endpoint or Bedrock if available.

Coverage signal

Retriever returns {passages[], max_similarity, tier_coverage, gaps[]}. gaps[] = jurisdictions or tiers the query expected but didn't find above threshold — fed directly into the gap detector.

3.6 Retrieval contract

// packages/shared/src/retrieval.ts

export type RetrievalQuery = {
  text: string;
  factor: FactorName;                 // "Credit Score" | ...
  jurisdiction: "EU" | "BG";          // application's jurisdiction
  tiers?: Array<1 | 2 | 3 | 4>;       // default: [1,2,3,4]
  topK?: number;                      // default: 5 (after reranking)
  minSimilarity?: number;             // default: 0.72
};

export type RetrievedPassage = {
  chunkId: string;
  sourceId: string;
  sourceUrl: string;
  breadcrumb: string;
  tier: 1 | 2 | 3 | 4;
  jurisdiction: string;
  version: string;
  similarity: number;
  rerankScore: number;
  text: string;
};

export type RetrievalResult = {
  passages: RetrievedPassage[];
  maxSimilarity: number;
  tierCoverage: Record<string, number>;  // tier → count above threshold
  gaps: Array<{
    expectedTier: 1 | 2 | 3 | 4;
    expectedJurisdiction: string;
    reason: "no_match_above_threshold" | "jurisdiction_missing";
  }>;
};

4 · Reasoning Layer (Grounded Debate)

4.1 What changes from v2

Areav2v3
Agent inputsapplication, factorapplication, factor, retrieved_passages
Agent outputsFree-text argument{claim, citation_ids[], confidence, known}
Rejector roleCounter-argument onlyKeep Rejector for UX explainability; add Checker for compliance validation
KnowledgeLLM pretraining onlyRetrieval-grounded; every claim cites a chunk
Gap handlingSilent fallback to LLM priorsEmits KNOWLEDGE_GAP event; surfaces in Analyst UI
Version pinningImplicit (code commit)Explicit version_manifest tuple per session

4.2 Orchestration (updated)

sequenceDiagram participant Orc as Orchestrator participant Ret as Retriever participant Mk as Maker participant Ch as Checker participant Rj as Rejector participant Jd as Judge participant Gd as Gap Detector participant Gr as Guardrails participant DB as Postgres participant SSE as SSE Buffer Note over Orc: For each of 5 factors Orc->>Ret: query(factor, jurisdiction) Ret-->>Orc: passages + gaps Orc->>Mk: run(application, factor, passages) Mk-->>SSE: tokens Mk-->>Orc: {claim, citation_ids, confidence, known} Orc->>Ch: validate(Maker claim, passages) Ch-->>Orc: {status: OK | CONFLICT, cited_conflict?} Orc->>Rj: counter(Maker claim, passages) Rj-->>Orc: {counter_claim, citation_ids} Orc->>Jd: reconcile(Mk, Ch, Rj, passages) Jd-->>SSE: tokens Jd-->>Orc: {verdict, reasoning, citations} Orc->>Gd: assess(passages.gaps, Mk.known, Mk.confidence) Gd-->>Orc: [KnowledgeGap] Orc->>Gr: apply(application, Jd.verdict) Gr-->>Orc: GuardrailResult Orc->>DB: persist(session, version_manifest, gaps) Orc-->>SSE: final_decision event

4.3 Agent contracts

// packages/shared/src/agents-v3.ts

export const MakerOutputSchema = z.object({
  claim: z.string(),
  citation_ids: z.array(z.string()).min(1),   // must cite ≥1 passage
  confidence: z.number().min(0).max(1),
  known: z.boolean(),
  missing_knowledge: z.string().optional(),   // populated when known=false
});

export const CheckerOutputSchema = z.object({
  status: z.enum(["OK", "CONFLICT", "INSUFFICIENT_EVIDENCE"]),
  conflicting_citation_id: z.string().optional(),
  conflict_explanation: z.string().optional(),
});

export const JudgeFactorOutputSchema = z.object({
  verdict: z.enum(["positive", "negative", "neutral"]),
  summary: z.string(),
  cited_passages: z.array(z.string()),
  maker_confidence: z.number(),
  checker_status: z.enum(["OK", "CONFLICT", "INSUFFICIENT_EVIDENCE"]),
});

4.4 Prompt contract — Maker (excerpt)

SYSTEM
You are a credit origination analyst for a European bank.

You MUST:
- Base your claim on the RETRIEVED PASSAGES only.
- Cite every claim with passage IDs from the list below.
- Set `known: false` and fill `missing_knowledge` when no passage supports
  the claim at the given jurisdiction and tier.
- Never invent regulation names, article numbers, or dates.

HUMAN
FACTOR: {factor_name}
JURISDICTION: {jurisdiction}
APPLICATION: {application_json}
RETRIEVED PASSAGES:
{passages_with_ids}

FORMAT:
{format_instructions}

4.5 Gap detector

Trigger conditions (OR-joined)

  • Retriever returns maxSimilarity < 0.72 for the factor query.
  • Retriever returns no passage in the application's jurisdiction tier.
  • Maker emits known: false.
  • Maker self-reports confidence < 0.6.
  • Checker returns INSUFFICIENT_EVIDENCE.

Each trigger creates one knowledge_gaps row. Duplicates on (factor, missing_topic_hash) are collapsed — we only alert operators once per novel gap per 24h window.

5 · Self-Improvement Layer

5.1 Closing the loop

graph LR D["Decision produced"] --> G{"Gap detected?"} G -->|No| S["Session complete"] G -->|Yes| GQ[("Gap Queue")] GQ --> OP["Operator UI
(Analyst)"] OP -->|Upload doc| ING["Ingestion"] OP -->|Write note| NT["Note becomes
synthetic chunk"] OP -->|Override
decision| FB[("Feedback Store")] ING --> VDB[("Vector DB")] NT --> VDB FB --> FS["Few-shot
Retriever"] FS --> MK2["Maker prompt
(next call)"] FB --> EV["Eval Harness"] EV --> VM[("Version Manifest")] VM --> MK2

5.2 Feedback taxonomy

TypeTriggerStorageReuse
CorrectionHuman overrides decisionfeedback_correctionsFew-shot retrieval at inference
KnowledgeHuman answers a gapknowledge_chunks (new row) + feedback_knowledge auditLive in future retrievals
CalibrationRealised loan outcome knownfeedback_outcomesEval harness accuracy metric

5.3 Version manifest

version_manifests
├── id                UUID, primary key
├── label             VARCHAR  — "v3.4.1"
├── model             VARCHAR  — "claude-sonnet-4-6"
├── prompt_set_id     VARCHAR  — "ps-2026-04-17"
├── knowledge_bundle  UUID  → knowledge_bundles.id
├── guardrail_set_id  VARCHAR  — "gr-v2"
├── eval_score        NUMERIC  — 0.87
├── eval_run_id       UUID
├── status            ENUM: DRAFT | ACTIVE | DEPRECATED
├── released_at       TIMESTAMPTZ
└── released_by       VARCHAR

-- debate_sessions gains a column
ALTER TABLE debate_sessions
  ADD COLUMN version_manifest_id UUID NOT NULL
  REFERENCES version_manifests(id);
Only ONE manifest can be ACTIVE at a time. Rollback is a single UPDATE away. Every historical decision is reproducible because the bundle, prompt set, and guardrails are all content-addressed.

5.4 Eval harness

Bondora ground-truth eval (shipped)

A first-pass eval harness is in the codebase: npm run eval:bondora -- --csv path/to/LoanData_Bondora.csv --limit 50. It reads the public Bondora P2P lending dataset, maps each historical loan to a CreditApplication, runs it through runDebate(), and scores the agent's APPROVE / REJECT / REVIEW against the realised outcome (Repaid vs Default).

Output is a confusion-matrix scoreboard:

Mapping is documented inline in the script: rating letter → synthetic FICO, employment status code → enum, missed-payments proxy from NewCreditCustomer + PreviousEarlyRepaymentsCountBeforeLoan. Rows with no terminal outcome (Status=Current) or missing required fields are skipped.

Eval calls bypass the clarification flow. The script passes { noClarifications: true } to runDebate(), which forces maxClarifications = 0 for that call only — the cap fires immediately on any clarification request, no awaitClarification pause, no 5-minute timeout per row. This is the only honest way to measure the system's autonomous decision-making against ground truth: feeding placeholder answers would measure how the agents react to fake operator input, not what they decide on their own. Production debates running concurrently with the eval are unaffected.

Watch for data leakage Bondora is a test set, not training data. If you ever ingest Bondora summaries into the corpus or wire it into the prompt as "similar past cases," you'll be measuring memorisation instead of generalisation. Hold ~20% of rows as a clean test set that NEVER touches retrieval.

Domain mismatch caveat: Bondora is unsecured P2P consumer lending in EE/ES/FI, mostly pre-2020. Findings are a useful sanity check, not a precise predictor of EU regulated bank lending performance. The script is designed for "is the direction right" answers, not "is the FP rate exactly 4.7%" answers.

6 · Data Model

classDiagram class knowledge_bundles { UUID id VARCHAR label VARCHAR content_hash JSONB source_manifest TIMESTAMPTZ created_at } class knowledge_chunks { UUID id UUID bundle_id VARCHAR source_id SMALLINT tier VARCHAR jurisdiction TEXT text VECTOR embedding } class debate_sessions { UUID id VARCHAR application_id UUID version_manifest_id ENUM status JSONB maker_output JSONB checker_output JSONB rejector_output JSONB judge_output JSONB guardrail_result ENUM final_decision } class knowledge_gaps { UUID id UUID session_id VARCHAR factor VARCHAR trigger_signal TEXT missing_topic SMALLINT tier_needed ENUM status } class feedback_corrections { UUID id UUID session_id VARCHAR field JSONB agent_value JSONB human_value TEXT reason } class feedback_outcomes { UUID id UUID session_id ENUM realised_outcome INTEGER months_observed } class version_manifests { UUID id VARCHAR label VARCHAR model VARCHAR prompt_set_id UUID knowledge_bundle VARCHAR guardrail_set_id NUMERIC eval_score ENUM status } knowledge_bundles "1" --> "*" knowledge_chunks version_manifests "*" --> "1" knowledge_bundles : bundle debate_sessions "*" --> "1" version_manifests : pinned debate_sessions "1" --> "*" knowledge_gaps debate_sessions "1" --> "*" feedback_corrections debate_sessions "1" --> "0..1" feedback_outcomes

6.1 knowledge_gaps

knowledge_gaps
├── id                UUID, primary key
├── session_id        UUID  → debate_sessions.id
├── factor            VARCHAR
├── trigger_signal    ENUM: low_retrieval | maker_unknown | low_confidence |
│                           checker_insufficient | jurisdiction_missing
├── missing_topic     TEXT        — agent's own description
├── suggested_sources TEXT[]      — agent-proposed resources
├── tier_needed       SMALLINT
├── jurisdiction_needed VARCHAR
├── topic_hash        VARCHAR     — for deduplication
├── status            ENUM: OPEN | ANSWERED | INGESTED | DISMISSED
├── resolution_note   TEXT, nullable
├── resolved_by       VARCHAR, nullable
├── resolved_at       TIMESTAMPTZ, nullable
├── resulting_chunk_id UUID, nullable  → knowledge_chunks.id
├── created_at        TIMESTAMPTZ

UNIQUE (topic_hash, status) WHERE status = 'OPEN'

6.2 feedback_corrections

feedback_corrections
├── id                UUID, primary key
├── session_id        UUID  → debate_sessions.id
├── field             VARCHAR     — "final_decision" | "factor.verdict.Credit Score" | ...
├── agent_value       JSONB
├── human_value       JSONB
├── reason            TEXT
├── embedding         VECTOR(1024)  — embedded(application + reason) for few-shot retrieval
├── created_by        VARCHAR
├── created_at        TIMESTAMPTZ

7 · API Surface

MethodPathPurpose
POST/decisionCreate session, start grounded debate. Returns sessionId + version_manifest_id.
GET/decision/:id/streamSSE — agent tokens, factor turns, gap events, final decision.
GET/decision/:idApplicant view (filtered).
GET/analyst/decision/:idAnalyst view (full, incl. citations + gaps).
POST/knowledge/ingestUpload doc → enqueues ingestion job → returns bundle_id when done.
GET/knowledge/bundlesList bundles with metadata.
GET/knowledge/searchDebug retrieval — returns passages for a query.
GET/gapsList open knowledge gaps (operator queue).
POST/gaps/:id/resolveResolve gap — accepts uploaded doc OR text note OR dismiss reason.
POST/feedback/correctionSubmit human override for a session.
POST/feedback/outcomeRecord realised loan outcome.
GET/versionsList version manifests.
POST/versions/:id/activatePromote manifest to ACTIVE (atomic).
POST/eval/runRun golden set against a version; returns eval_run_id.
GET/eval/:id/diff/:otherDiff two eval runs.

7.1 New SSE event types

EventWhenPayload
retrieval_doneAfter retriever returns for a factorfactor, passage_count, max_similarity, tier_coverage
maker_doneMaker structured output parsedfactor, claim, citation_ids, confidence, known
checker_doneChecker completesfactor, status, conflict_citation_id?
gap_detectedGap detector tripsgap_id, factor, trigger_signal, missing_topic
final_decisionJudge + guardrails completeFull payload incl. version_manifest_id

8 · Local Dev & AWS Infrastructure

One codebase, two deployment targets. Adapters behind interfaces let us swap infra without touching business logic. local = laptop dev, aws = production.

8.1 Component mapping

ComponentLocalAWSSwap via
Relational DBPostgres in Docker + pgvectorRDS Postgres 16 + pgvector extensionDATABASE_URL
Vector storepgvector (same DB)pgvector on RDS · or OpenSearch Serverless at scaleAdapter interface
Object storage (raw docs)Local FS ./data/raw/S3 bucket · SSE-KMS@aws-sdk/client-s3 behind BlobStore
LLMAnthropic API directBedrock anthropic.claude-sonnet-4LangChain provider env var
EmbeddingsOpenAI text-embedding-3-large · or local bge-m3Bedrock amazon.titan-embed-text-v2 OR SageMaker bge-m3Embedder interface
RerankerHuggingFace inference (bge-reranker-v2-m3)SageMaker endpointReranker interface
Ingestion workersNode worker in same process (dev only)SQS → Lambda OR ECS Fargate taskQueue adapter
API servernpm run dev on :3001ECS Fargate behind ALB · or Lambda + API Gateway for lower trafficNone — 12-factor app
FrontendNext.js dev on :3000Amplify Hosting OR Vercel (prefer Vercel — faster iteration)None
AuthDev token in headerCognito → JWT at ALBFastify auth plugin
Secrets.envAWS Secrets Manager · loaded at startupConfig loader
Observabilitypino → stdoutCloudWatch Logs · OpenTelemetry traces → X-Raypino transport
Eval runsCLI in repoScheduled ECS task · results to S3 + PostgresNone

8.2 AWS deployment topology

graph TB Users["Applicants · Analysts · Operators"] CF["CloudFront"] Amplify["Amplify / Vercel
Next.js"] APIGW["ALB (public)"] ECS["ECS Fargate
Fastify API + Orchestrator"] Cognito["Cognito
(Analyst/Operator auth)"] SQS["SQS
ingestion queue"] Worker["ECS Fargate
Ingestion Worker"] S3["S3
raw docs · eval artefacts"] RDS[("RDS Postgres
+ pgvector")] Bedrock["Bedrock
Claude + Titan embed"] SM["SageMaker endpoint
reranker (bge-m3)"] Secrets["Secrets Manager"] CWL["CloudWatch Logs
+ X-Ray"] Users --> CF --> Amplify Amplify --> APIGW --> ECS Users -->|API direct| APIGW ECS <--> Cognito ECS --> RDS ECS --> Bedrock ECS --> SM ECS --> SQS SQS --> Worker Worker --> S3 Worker --> Bedrock Worker --> RDS ECS --> S3 ECS --> Secrets Worker --> Secrets ECS --> CWL Worker --> CWL

8.3 Local dev stack (docker-compose)

# docker-compose.dev.yml — run: docker compose up

services:
  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_PASSWORD: postgres
      POSTGRES_DB: challenger_dev
    ports: ["5432:5432"]
    volumes: ["pgdata:/var/lib/postgresql/data"]

  minio:            # S3-compatible local storage
    image: minio/minio
    command: server /data --console-address ":9001"
    ports: ["9000:9000", "9001:9001"]
    environment:
      MINIO_ROOT_USER: dev
      MINIO_ROOT_PASSWORD: devpassword

  # Reranker served via HuggingFace's TEI for local parity with SageMaker
  reranker:
    image: ghcr.io/huggingface/text-embeddings-inference:latest
    command: ["--model-id", "BAAI/bge-reranker-v2-m3"]
    ports: ["8080:80"]

volumes: { pgdata: {} }

8.4 Environment variables

Updated since first draft The current shipped config defaults to EMBED_PROVIDER=voyage + voyage-3-lite (512 dims). Tunable runtime parameters (RETRIEVAL_*, DEFAULT_JURISDICTION, CLARIFICATION_*, etc.) have moved out of env vars and into the runtime_settings table — change them from the admin Settings tab without a redeploy. Env vars below are the static infrastructure config that genuinely belongs at boot time.
# Shared
NODE_ENV=development
LLM_PROVIDER=anthropic              # anthropic | openai
LLM_MODEL=claude-haiku-4-5-20251001
LLM_API_KEY=sk-ant-...

# Embeddings — Voyage is Anthropic-recommended; voyage-3-lite has a
# generous free tier and produces 512-dim vectors.
EMBED_PROVIDER=voyage               # voyage | openai | bedrock
EMBED_MODEL=voyage-3-lite
EMBED_API_KEY=pa-...
EMBED_DIMENSIONS=512

# Database — local Postgres + pgvector for dev, Neon for Fly deploy.
DATABASE_URL=postgresql://postgres:postgres@localhost:5433/debate_db

# Static fallbacks for the runtime_settings rows (used only if a row
# is missing from the table — normally the table wins).
RETRIEVAL_ENABLED=true
RETRIEVAL_TOP_K=5
RETRIEVAL_MIN_SIMILARITY=0.55
DEFAULT_JURISDICTION=EU

# Admin panel
ADMIN_PASSWORD_HASH=$2b$12$...      # bcrypt; generate with: npm run admin:hash -- 'pwd'
COOKIE_SECRET=...                   # 32+ random hex chars; openssl rand -hex 32

# CORS — comma-separated allowlist. Wildcards in dev only.
CORS_ORIGIN=https://your-frontend.fly.dev

# Observability
LOG_SQL=1                           # SQL preview + timing in stdout
LOG_COLOR=1                         # ANSI category tags in stdout
FORCE_COLOR=1                       # force color when piped (e.g. fly logs)

# AWS-only (when we eventually flip from Fly to AWS)
AWS_REGION=eu-central-1
BEDROCK_LLM_MODEL=anthropic.claude-sonnet-4-v1:0
BEDROCK_EMBED_MODEL=amazon.titan-embed-text-v2:0
COGNITO_USER_POOL_ID=...
COGNITO_CLIENT_ID=...
S3_BUCKET_RAW=challenger-raw-docs
Region choice Use eu-central-1 (Frankfurt) for EU data residency. Bedrock has Claude available there. RDS and ECS are native. This matters for both compliance positioning and EBA outsourcing guidelines.

9 · Directory Structure

challenger/
├── backend/
│   ├── src/
│   │   ├── agents/
│   │   │   ├── maker.ts              # NEW
│   │   │   ├── checker.ts            # NEW (replaces rejector in compliance mode)
│   │   │   ├── rejector.ts           # kept for UX explainability
│   │   │   ├── judge.ts
│   │   │   ├── prompts/
│   │   │   │   ├── maker.ts
│   │   │   │   ├── checker.ts
│   │   │   │   ├── judge.ts
│   │   │   │   └── index.ts          # exports prompt_set_id
│   │   │   └── llmClient.ts          # provider switch (Anthropic | Bedrock)
│   │   ├── knowledge/
│   │   │   ├── ingest/
│   │   │   │   ├── loader.ts         # PDF | HTML
│   │   │   │   ├── chunker.ts        # section-aware
│   │   │   │   ├── embedder.ts       # provider switch
│   │   │   │   └── worker.ts         # SQS consumer in prod
│   │   │   ├── retriever.ts          # hybrid + rerank
│   │   │   ├── bundles.ts
│   │   │   └── index.ts
│   │   ├── engines/
│   │   │   ├── guardrail.ts
│   │   │   ├── fallback.ts
│   │   │   └── gapDetector.ts        # NEW
│   │   ├── orchestrator/
│   │   │   ├── index.ts              # v3 orchestrator
│   │   │   └── factorLoop.ts
│   │   ├── feedback/
│   │   │   ├── corrections.ts
│   │   │   ├── outcomes.ts
│   │   │   └── fewShot.ts            # retrieves top-k corrections
│   │   ├── versioning/
│   │   │   ├── manifest.ts
│   │   │   └── activate.ts
│   │   ├── eval/
│   │   │   ├── goldenSet/            # JSON cases
│   │   │   ├── runner.ts
│   │   │   └── diff.ts               # CLI-callable
│   │   ├── platform/
│   │   │   ├── blobStore.ts          # local FS | S3 adapter
│   │   │   ├── queue.ts              # in-proc | SQS adapter
│   │   │   ├── auth.ts
│   │   │   ├── config.ts
│   │   │   └── logger.ts
│   │   ├── routes/
│   │   │   ├── decision.ts
│   │   │   ├── knowledge.ts
│   │   │   ├── gaps.ts
│   │   │   ├── feedback.ts
│   │   │   ├── versions.ts
│   │   │   └── eval.ts
│   │   ├── db/
│   │   │   ├── migrations/
│   │   │   │   ├── 001_init.sql
│   │   │   │   ├── 010_knowledge.sql
│   │   │   │   ├── 011_gaps.sql
│   │   │   │   ├── 012_feedback.sql
│   │   │   │   └── 013_versioning.sql
│   │   │   ├── pool.ts
│   │   │   ├── sessions.ts
│   │   │   ├── chunks.ts
│   │   │   ├── gaps.ts
│   │   │   └── versions.ts
│   │   ├── buffer.ts                 # SSE buffer (unchanged)
│   │   └── index.ts
│   └── package.json
├── frontend/
│   └── src/
│       ├── app/
│       │   ├── decision/[id]/
│       │   ├── analyst/decision/[id]/
│       │   ├── operator/
│       │   │   ├── gaps/              # NEW — gap queue
│       │   │   ├── knowledge/         # NEW — ingestion + bundles
│       │   │   └── versions/          # NEW — manifest admin
│       ├── components/
│       │   ├── FactorDebate/
│       │   ├── CitationChip/          # NEW
│       │   ├── GapBanner/             # NEW
│       │   ├── ResolveGapDialog/      # NEW
│       │   └── VersionBadge/          # NEW
│       └── hooks/
│           ├── useDebateStream.ts
│           └── useGapStream.ts        # NEW
├── packages/
│   └── shared/
│       └── src/
│           ├── types.ts
│           ├── schemas/
│           │   ├── maker.ts
│           │   ├── checker.ts
│           │   ├── judge.ts
│           │   ├── gap.ts
│           │   └── feedback.ts
│           └── retrieval.ts
├── infra/
│   ├── docker-compose.dev.yml
│   └── aws/
│       ├── cdk/                        # CDK app (TS) — one stack per concern
│       │   ├── bin/challenger.ts
│       │   ├── lib/
│       │   │   ├── network-stack.ts     # VPC + subnets
│       │   │   ├── data-stack.ts        # RDS · S3 · SQS
│       │   │   ├── compute-stack.ts     # ECS services · ALB
│       │   │   ├── ai-stack.ts          # Bedrock IAM · SageMaker endpoints
│       │   │   └── observability-stack.ts
│       │   └── cdk.json
│       └── README.md
├── docs/
│   ├── Architecture_Recommendation_v3_Self_Improving_Agent.md
│   ├── Core_Flows_—_Factor-Based_Debate_(v2).md
│   └── Tech_Plan_—_Multi-Agent_Credit_Decision_System.md
├── challenger-spec-v3.html              # this file
└── README.md

10 · MVP Build Order & Milestones

Phase Goal Deliverables Weeks
Phase 0 Knowledge foundation Ingest 3 top-tier docs (EBA/GL/2020/06, ECB Guide, OeNB) · pgvector table · hybrid retriever · /knowledge/ingest + /knowledge/search endpoints · Maker/Rejector prompts accept retrieved_passages. No UI changes. 1–2
Phase 1 Grounded debate + gap detection New agent output schemas with citation_ids · Checker agent · gap detector · knowledge_gaps table · SSE gap_detected event · Analyst UI: citation chips + gap banner. 3–4
Phase 2 Feedback capture — first shippable demo Operator UI: gap queue + resolve dialog · correction dialog on Analyst view · few-shot retriever injects corrections into Maker prompt · single-tenant auth (Cognito dev pool). Show to 3 compliance officers. 5–6
Phase 3 Eval + versioning 200-case golden set · version_manifests table · eval runner CLI · diff report · version_manifest_id persisted on every session · version badge in UI. 7–10
Phase 4 Self-improvement automation Scheduled re-run of failing eval cases after gap resolution · "what improved" weekly report · prompt-evolution helper · jurisdiction-aware retrieval routing · multi-bank tenant isolation (if pilot demands). 11–14
Deferred Fine-tuning · multi-tenancy · on-prem · classical-ML scoring model Revisit after 3 paying pilots and ≥5k labelled corrections. Fine-tuning via Bedrock model customisation only when the business case is measurable.

10.1 Phase 0 definition of done

  • Postgres with pgvector extension running locally via docker-compose.
  • 3 documents ingested; SELECT count(*) FROM knowledge_chunks ≥ 500.
  • GET /knowledge/search?q=missed+payments&jurisdiction=EU returns 5 passages in < 300ms.
  • Existing v2 debate runs unchanged; Approver prompt now includes retrieved passages section; output unchanged.
  • Latency budget: debate completes in < 8s end-to-end (v2 was 3–5s; retrieval adds ≤ 300ms × 5 factors).

10.2 Phase 2 success metric

3 compliance officers test it. Each runs 10 cases. We capture: did they trust the citations? Did the gap queue capture real unknowns? Did they use the correction UI? Without their validation, Phase 3 is premature.

11 · Observability, Evals, Security

11.1 Observability

11.2 Eval cadence

11.3 Security baseline

12 · Key Decisions & Trade-offs

DecisionChoiceTrade-off accepted
Vector storepgvector in same RDSScales to ~10M chunks comfortably; swap to OpenSearch later. Saves one service.
LLM providerAnthropic dev → Bedrock prodRegion-locked, compliance-friendly. Costs ~15% more; acceptable for regulated buyers.
Debate vs single-passDebate kept, with cheap-path escapeHigh-confidence cases (Maker confidence ≥ 0.9 AND Checker OK) skip Rejector. Cost control.
Fine-tuningDeferred indefinitelyRAG + few-shot corrections capture ~90% of value for ~10% of cost.
Rejector vs CheckerBoth, different rolesRejector = UX explainability. Checker = compliance validation. Judge reads both.
Multi-tenancySingle-tenant in MVPPostpone row-level security + per-bank knowledge isolation until paying pilot demands it.
Ingestion workerIn-process locally, SQS+ECS in prodQueue adapter keeps code identical.
AuthCognitoStandard AWS pathway; OIDC for future bank SSO.
Frontend hostingVercel (Amplify as alternative)Faster iteration than Amplify. Move to Amplify only if data-residency becomes blocking.
Classical ML scoring modelOut of scopeIf ever needed, Checker calls it as a tool — do not entangle with agent pipeline.

13 · Week-1 Concrete Actions

  1. Stand up the dev stack. docker compose -f infra/docker-compose.dev.yml up — Postgres+pgvector, MinIO, TEI reranker. Migration 010_knowledge.sql creates knowledge_bundles and knowledge_chunks with ivfflat index.
  2. Write the ingestion pipeline. backend/src/knowledge/ingest/ — PDF loader (via pdf-parse) · section-aware chunker (regex on Article \d+ / §) · embedder (OpenAI for dev) · insert chunks with bundle_id. Target: process EBA/GL/2020/06 end-to-end.
  3. Ingest three documents: EBA/GL/2020/06, ECB Guide to Internal Models, OeNB Guidelines on Credit Risk Management. Confirm ≥500 chunks; spot-check 10 random chunks for correct section breadcrumbs.
  4. Build the retriever. backend/src/knowledge/retriever.ts — hybrid (dense + tsvector) · tier-weighted fusion · reranker POST to http://localhost:8080. Expose as GET /knowledge/search?q=&jurisdiction=&factor= for debug.
  5. Wire retrieval into existing prompts. Update backend/src/agents/prompts.ts — Approver and Rejector templates accept retrieved_passages. Orchestrator calls retriever before each factor. Ship, smoke-test, verify existing eval still passes.
  6. Output-schema change (Phase 1 prep). Draft the new Maker output schema with citation_ids and known — land it as a separate PR after Phase 0 is green.
Definition of "Week 1 done" A loan application runs through the existing v2 debate, and the orchestrator logs show retrieved passages were passed into every factor's Approver prompt. No UI changes, no schema breakage, no citations yet — just retrieval wired in. That unlocks Phase 1.