Marvin is a software credit committee. Eight specialised agents — five working live cases, two keeping the bank’s policy library current, and one (Marvin himself) watching the other seven from above. Every decision is regulator-explainable; every approval is grounded in a citation; every system slowdown gets caught before it ships a bad call. This document walks you through how to use the system and what happens behind the scenes when you do.
This document is a guided tour of Marvin. Read it top to bottom on your first visit and you’ll know how to submit a case, follow a debate, answer the system’s questions, read a verdict, and verify the system on your own historical data. Come back to specific sections when you need them.
We’ll keep this practical. Every concept gets a concrete example, every screen has a pointer to where you find it, and every technical detail is explained in plain language first.
Marvin is a credit-decision platform built like a real bank credit committee — except every member is a specialised AI agent, every disagreement is resolved by a piece of code that an auditor can read line by line, and every call leaves a paper trail of citations back to source policy.
The eight agents are split across three departments:
The five front-office agents that work every credit case in real time. Compliance, Underwriter, Quant, Orchestrator, Archivist. They run synchronously per case and finish in about a minute.
Two background agents that keep the bank’s policy library current. Scout reads the open web for new regulations; Architect drafts policy revisions for the operator to approve.
One agent — Marvin — sits above the other seven and watches their conversations, traces, citations, and clarification logs. He never talks to applicants and never writes policy. He surfaces drift before it becomes a bad decision. The product is named after him because this watcher-of-watchers role is the thing that makes the rest of the system safe to run unattended.
A real credit committee works because each member has narrow expertise and clear authority. A compliance officer can hard-stop a deal a salesperson loves. A quant can produce numbers without caring whether the deal goes through. The chair listens to all three and follows a written hierarchy when they disagree. We ported that structure into software because it’s the structure auditors and regulators already understand.
A monolithic LLM, no matter how good, has to be one entity holding contradictory roles in its head. Marvin’s agents are specialised by design — the Compliance Guardian can’t accidentally soften a sanctions block to chase a sale, because the Compliance Guardian’s prompt and document allowlist physically don’t reach the policy clauses that an Underwriter would use to argue.
Each agent has a written persona, a fixed allowlist of documents it can read, and a narrow output contract. You’ll see them on the live committee view (Department A) and in the admin tabs (Departments B and C). Click any agent in the app to see its current system prompt under Settings → the operator can edit a prompt without a redeploy and the change is visible to the next debate.
RAG provider · the librarian
“You are the keeper of the keys. You don’t form opinions. You hand the right document to the agent that asked.”
Reads: the entire corpus on behalf of others. Speaks: never — the Archivist is invisible to the operator. Behind the scenes, every other Department-A agent calls the Archivist to retrieve passages from its own slice of the corpus.
AML / KYC / sanctions / PEP · hard-stop authority
“Paranoid prosecutor. Cites article, not opinion. When I say block, the deal stops — no override.”
Reads: the AML/KYC instruction. Verdicts: pass, review-required, hard-reject. A hard-reject ends the pipeline immediately; the other agents don’t even run.
Deal structurer · finds compensating factors
“Sales-leaning optimist. I argue the applicant’s case to the chair. I don’t fear bad numbers; I look for legal ways to work around them.”
Reads: credit policy, risk-appetite exclusion list, application-package requirements, self-employed annexes. Output: a recommendation (APPROVE / REJECT / REVIEW), a list of compensating factors, and an application-completeness check.
Numbers · DSTI / DSCR / credit score / default probability
“Cold mathematician. I see the world in ratios. Sentimentality is noise. I produce numbers and flag whether they fit the methodology.”
Reads: scoring methodology, refinancing/consolidation rules, self-employed annex § 3 (DSCR formula). Verdicts: within-policy, breach-soft, breach-hard.
Chair · applies the conflict tree, writes the verdict
“Wise judge. I don’t compute formulas or read identity documents. I weigh the three dossiers and explain the call in language a regulator can read.”
Reads: the credit-committee precedents log + strategic-bonus / compensating-factor sections of the relevant annexes. Output: the final decision narrative.
Web + RSS monitor
“Hyperactive digital reporter. I rove the open web for changes the bank should care about — new regulations, EBA guidelines, macro shifts. I don’t analyse deeply; I bring back the link.”
Reads: nothing internal — the open web only. Every finding lands in the Newsroom tab with a relevance score, and you can dismiss, flag, or hand it to the Architect.
Drafts internal-policy revisions
“Proactive bureaucrat-visionary. I love matrices. Give me a Scout finding or an orchestrator gap, and I’ll draft the policy update for you to approve. Clean Markdown, ready to ingest.”
Reads: existing internal policies (for reference) and Scout’s findings. Output: a draft in the Policy Lab tab with title, body, target tier, and proposed bundle label. The Architect can never write Tier 1 (regulator) drafts — that’s reserved for the regulator’s actual text.
AI auditor · the system watching itself
“I don’t care about banking. I care about the agent layer. I read the other seven’s logs, traces, citation lookups, eval runs. When something drifts, I write you a recommendation.”
Reads: system logs only — agent_outputs, decision_version_manifest, eval_runs, clarification_events. Output: findings in the System Health tab, sorted by severity. Marvin recommends — he never deploys anything himself.
runtime_settings and you can edit it
from the admin Settings tab. The default personas above are good for most banks;
if your house style differs, change the prompt and the next debate uses the new wording.
Let’s follow a single application end to end. Imagine Maria, a self-employed consultant, applying for a €30,000 loan to consolidate two existing consumer loans into a single facility. Her credit score is 620 (borderline), her income is €38,000/year, and she has one missed payment from 18 months ago.
Click Cases in the left sidebar to open the inbox. Pick a preset that looks similar to Maria’s profile (or click + New case to start from scratch), fill the borrower fields, and click Run decision at the top right.
You’re sent to the analyst page where the live committee starts streaming
dossiers. The URL is /analyst/decision/<sessionId> — bookmark
it; that’s your audit-trail link for this case.
The Compliance Guardian opens the case. He runs sanctions and PEP screening, retrieves relevant AML clauses, and produces a verdict. Three outcomes are possible:
For Maria the screening comes back clean. Compliance returns pass.
Now two agents run side by side to keep wall-clock time under a minute. The Underwriter
reads policy passages, looks at Maria’s package, and produces a dossier with a
recommendation, a list of compensating factors, and a check on the application
package’s completeness. Compensating factors live in a fixed enumeration —
additional_collateral, co_signer, income_history,
liquidity_buffer, tenure, low_dti,
down_payment, relationship_value, regulatory_exception
— so the Orchestrator can branch on them later without surprises.
The Quant computes Maria’s DSTI (debt-service-to-income), and because she’s
self-employed, also her DSCR (debt-service coverage ratio). It pulls credit-bureau and
social-security data through stubbed clients (real APIs in production), checks her score
bucket, computes default probability, and emits a verdict: within-policy,
breach-soft (within 10% of a threshold), or breach-hard.
For Maria, the Quant reports a breach-soft on DSTI — she’s 4 points over the 35% ceiling for self-employed applicants under 680 score — but the Underwriter has flagged a strong compensating factor: her consolidation actually reduces her monthly burden by 18%.
The three dossiers go into a deterministic TypeScript function called the
conflict tree. It walks down a five-level hierarchy and returns one of eleven
branches (we’ll cover the full tree in §7). For Maria, the branch is
quant-breach-soft-with-comp: a soft Quant breach overcome by a strong
compensating factor. The decision is APPROVE with conditions.
The Orchestrator runs last. Its job is not to make the decision — the conflict tree already did that. Its job is to write the regulator-explainable narrative that names each agent’s contribution, references the precedent log, and states the conditions of approval. For Maria the verdict ends with something like:
The analyst page is split into three columns:
€25,000 · Self-employed · 720 score · Business expansion) plus a colour-coded status chip so the operator can tell which case they’re looking at, and how it ended, at a glance: lime-pulse while running, green for an actual decision (Approved / Rejected), amber when human action is required (Needs review / Needs rework), red on Failed, neutral on Cancelled.Underwriter · queued · waiting for Compliance); the cell whose agent is currently working expands into a full-height “thinking…” skeleton with the inline cancel chip; once an agent reports, its cell becomes the dossier card itself with the agent’s avatar, verdict pill, citation chips, and duration figure. Reserving full height only for active and finished agents keeps the page short and the eye drawn to the live action.The whole thing typically completes in 30 to 60 seconds per case.
A small cancel chip appears next to the “thinking…” indicator on whichever agent is currently running — co-located with the activity, not buried in a page top bar. Clicking it stops the orchestrator at the next safe checkpoint; the partial dossiers stay attached to the session for audit. The chip moves automatically as the pipeline advances and disappears once the verdict lands.
Every case ends on one of four outcomes. Each maps to a different operator workflow.
| Outcome | What it means | Operator action |
|---|---|---|
| APPROVE | The committee agreed to lend, with or without conditions. Conditions are listed verbatim on the verdict card. | Forward to disbursal. If conditions are listed, ensure they’re met before funds release. |
| REJECT | The committee refused. The verdict names the rule that fired — sanctions hit, policy exclusion, hard methodology breach. | Send the rejection notice. The reason is regulator-citeable. |
| REVIEW | The committee couldn’t decide. Signals are mixed or compliance asked for human eyes (e.g. PEP exposure). | Open the case, read the dossiers, and decide. Use the Approve / Reject buttons in the four-eyes panel. |
| REWORK | The case is fundamentally fine — compliance OK, policy OK, risk within methodology — but the application package is incomplete. Missing forms, missing signatures, wrong template. | Not a credit decline. The verdict carries a checklist of missing items. Send the list back to the branch or the customer to complete the package, then resubmit. |
Sometimes the committee never delivers a verdict — the LLM provider is down, an API key is invalid, the network drops mid-debate. In that case the verdict card is replaced by a clear “Committee unavailable” error block and the session is marked FAILED, not COMPLETED. The four-eyes review form is suppressed (there is no automated decision to sign off on) and the case can be re-run via the Re-run debate button once the underlying issue is fixed. No work is lost — the application package, the corpus reads, and any partial dossiers stay attached to the original session for audit.
Sometimes an agent isn’t confident enough to commit. When confidence drops below the threshold (default 0.6) and the agent has a single specific question that an operator could realistically answer in a sentence, the agent pauses and asks. This is by design — you want the system to flag genuine uncertainty rather than fabricate confidence.
The dossier card on the analyst page replaces its body with a question banner: “✋ Agent paused for clarification — round 1 of 2.” Below that, the agent’s question (often multi-part, e.g. “(1) How long has the business been registered? (2) What’s the appraised collateral value? (3) Were the missed payments recent or historical?”) and a small text-area for your answer.
Each agent gets up to two clarification rounds per case. If the agent is still uncertain after round 2, the loop caps and the case proceeds to the conflict tree with whatever best-effort dossier the agent produced — usually a REVIEW recommendation, which routes to a human anyway.
A Q&A you provide on one case becomes standing guidance for similar future cases. The next time an agent is about to ask the same question, it gets your earlier answer injected into its prompt as institutional memory. You can see this on the analyst page as a “📚 Prior operator guidance” row above the agent’s argument — expand it to read what context the agent was given.
When the three live agents disagree — Compliance says pass, Quant says breach-hard, Underwriter says approve — somebody has to decide who wins. In a real bank committee, that’s the chair’s job, and the chair follows a written hierarchy. We wrote the hierarchy in TypeScript. It’s a function, it’s reproducible, and it has unit tests. The LLM only writes the prose that explains which branch fired.
| Level | Owner | Question it answers |
|---|---|---|
| L1 · Absolute Veto | Compliance Guardian | Is the deal legal? Sanctions, AML, statutory limits. |
| L2 · Corporate Veto | Underwriter | Does the deal fit bank policy / risk appetite? |
| L3 · Quantitative Risk | Data Quant | Do the numbers fit the methodology? |
| L4 · Operational & Process | Underwriter | Is the application package complete? |
| L5 · Heuristic / STP | Orchestrator | Anything else? Straight-through approval or fallthrough review. |
Higher levels never give way to lower ones. A Compliance hard-reject ends the conversation regardless of how attractive the numbers look. A bank-policy decline can only be overturned by an explicit regulatory_exception factor (which routes the case to senior management, not to auto-approval).
The function picks one of these branches and returns its name. The branch name is the audit trail.
| L | Branch | Decision | What it means |
|---|---|---|---|
| L1 | compliance-hard-reject | REJECT | Sanctions / AML / forbidden sector. Hard stop. No override. |
| L1 | compliance-review-required | REVIEW | PEP exposure or weak fuzzy match. Human compliance officer signs. |
| L2 | policy-decline | REJECT | Underwriter rejects on bank-policy grounds; numbers are fine, the block is policy-driven. |
| L2 | policy-escalation | REVIEW | Same as policy-decline but Underwriter cited a regulatory_exception — routes to senior sign-off. |
| L3 | quant-breach-hard-no-comp | REJECT | Hard methodology breach with no strong compensating factor. |
| L3 | quant-breach-hard-with-comp | REVIEW | Hard breach but Underwriter cited a strong factor — non-trivial, defer to a human. |
| L3 | quant-breach-soft-with-comp | APPROVE | Soft breach + strong factor — approve with conditions. |
| L3 | quant-breach-soft-no-comp | REVIEW | Borderline numbers, no clear save — defer. |
| L4 | documents-rework | REWORK | Case otherwise clean, package incomplete — return for paperwork. |
| L5 | all-clear | APPROVE | Everything passes — straight-through approval. |
| L5 | fallthrough-review | REVIEW | Signals disagree in a way the explicit branches don’t cover — defer. |
Auditors will ask: “why did the system approve this loan?” The answer must not depend on which model version was running on the day. By implementing the tree in TypeScript, we get reproducibility (same inputs always produce the same branch), testability (every branch has a unit test), and a function a regulator can read line by line before any LLM is involved. The LLM still runs — the Orchestrator agent — but only to write the prose explanation of the branch the tree picked.
The agents’ opinions are only as good as the documents they read. The corpus is the bank’s knowledge layer: every internal policy, every regulatory text, every annex, every precedent log. The Corpus tab in the admin console is where you upload, review, and remove bundles.
Open Admin console → Corpus. The upload form has four sections, each answering a question about the document:
The PDF itself. Drag in or browse.
How authoritative is this content. Five tiers, T1 highest. Drives retrieval ranking.
Which agent gets to read this. Marvin’s least-privilege RAG means a document is invisible to every other agent. Global fans the bundle out to all eight agents (rare; reserved for shared glossaries).
Source ID (the citation label, e.g. EBA-GL-2020-06), jurisdiction (EU, BG, Internal/Bank, etc.), volatility (static / evolving / dynamic) and version. All four propagate into every chunk produced from the document.
| Tier | Name | Examples | Conflict-tree role |
|---|---|---|---|
| T1 | Absolute Veto | EBA guidelines, BNB ordinances, statute law | Hard stop. No override. Authored only by the regulator. |
| T2 | Corporate Veto | Internal credit policy, risk appetite, sector exclusions | Vetoes anything below it. Authored by the bank. |
| T3 | Quantitative Risk | DSTI/DSCR formulas, scoring model, threshold tables | Numeric inputs. Override-able only by T1/T2 with documented exception. |
| T4 | Operational & Execution | Application form layout, document checklist | How-to material. Conflicts here are clerical, not policy. |
| T5 | Heuristics & Precedents | Precedents log, training material, analyst rules-of-thumb | Lowest authority. Useful as soft signal; never decisive on its own. |
When the embedder is unreachable. RAG is an enhancement, not a hard
dependency. If the embedding service times out (Voyage / OpenAI flake, network blip,
bad outbound DNS), retrieval returns an empty passage set with an explicit
embedder_unavailable gap. The agent runs without grounding, declares
known: false in its dossier, and the conflict tree typically lands the
case in REVIEW with the missing-grounding signal in the audit trail. The
committee still produces a real verdict; it just stays cautious. No silent
degradation, no fallback-engine placeholder.
Retriever weights descend with tier (1.00 → 0.92 → 0.80 → 0.70 → 0.60). A T1 chunk and a T5 chunk with identical similarity rank T1 first; a T5 chunk has to clear a meaningfully higher bar to compete.
Every bundle row carries a Preview button next to Delete. Clicking it
expands a strip showing the first few chunks of text the agents will see — the
breadcrumb / section header for each chunk, a 600-character excerpt, the tier badge, and
the token count. The goal is fast verification: you can scroll the corpus list and confirm
“yes, the AlfaBank consumer-loan policy actually has a chapter on DSTI in here”
without trusting the chunk count alone or opening the source PDF in a separate tab.
Preview is read-only; chunk text comes straight from knowledge_chunks.
Each agent reads only what its allowlist explicitly grants. The admin’s Corpus tab shows you, on every bundle row, which agents currently have access via the “Read by” column — small chips with each agent’s avatar.
| Agent | Reads |
|---|---|
| Archivist | Everything (privileged) |
| Underwriter | Credit policy, risk-appetite list, application-package requirements, self-employed annexes (income verification + compensating factors) |
| Data Quant | Scoring methodology, refinancing/consolidation rules (DTI sections), self-employed annex (DSCR formula) |
| Compliance Guardian | AML/KYC instruction only |
| Orchestrator | Precedents log, refinancing strategic-bonus sections, self-employed annex compensating-factor sections |
| Scout | None internal — web only |
| Architect | Old versions of internal docs (so it knows what to revise) plus regulatory reading material |
| Marvin | None — reads system logs, never policies |
Every claim an agent makes carries a chunkId citation. On the analyst page,
these render as small purple chips next to each argument. Click one to open the source
passage in a side dialog, with full text, breadcrumb, jurisdiction, and tier. If a
citation chip ever fails to resolve (the chunk vanished from the corpus), Marvin’s
next overnight run flags it as a hallucination finding.
A regulator changes a guideline; a new sector appears on a watchlist; an academic paper shifts the consensus on a methodology. Most credit-AI products freeze at deployment and drift quietly. Marvin has Department B watching the open world so the corpus stays current.
Open Admin console → Newsroom. The Scout polls regulatory feeds (BNB, EBA, Държавен вестник, ECB, BIS) on a schedule and runs targeted web searches for textbooks and working papers. Each finding lands here with a relevance score, category pill, and source link. You’ll see four states:
Click a finding to open the full body. From here you can Draft policy — which sends the finding to the Architect — or Dismiss if it’s noise. Click Run Scout now to pull a fresh batch on demand.
When the Architect has already produced a draft from a finding, the row shows a small
→ View draft in Policy Lab link with the draft’s current status
(proposed / approved / rejected). Clicking it
jumps to Policy Lab, expands the matching draft card, and tints it briefly so you can
see what you were brought to. The two tabs stay separate — different agents,
different actions — but the loop between them is now legible.
Open Admin console → Policy Lab. The Architect produces drafts in response to Scout findings (or to gaps the Orchestrator flagged). Each draft is an expandable card with title, target tier, proposed bundle label, and a Markdown body you can edit inline. Three actions on each draft:
Each draft card carries a ← From news: “…” back-link to the
originating Newsroom row, joined from policy_drafts.source_news_id. Click it
to jump back to the Scout’s finding so you can see what triggered the draft before
you sign off on it. Drafts produced from an internal knowledge gap (rather than external
news) read “orchestrator gap” instead.
CHECK constraint enforce this. If you want to ingest a regulator document,
you upload it directly via the Corpus tab as T1.
End to end — click Run Scout now, three findings appear; click Draft policy on one, a new draft lands in Policy Lab; edit the Markdown, click Approve & ingest, the new bundle appears in Corpus. Total wall-clock: 30–60 seconds.
The eighth agent is the one the product is named after. Marvin watches the other seven — their conversations, their dossiers, their citations, their clarification logs, their evals — and surfaces problems before they touch live applicants. The System Health tab is his dashboard.
| Category | What Marvin watches for |
|---|---|
| Bottleneck | An agent’s p95 latency drifting above the cross-agent median. Usually means a retrieval timeout or a misconfigured allowlist. |
| Hallucination | Citation chunkIds that don’t resolve to a real chunk in
knowledge_chunks. The agent invented the source. |
| Knowledge gap | The same clarification question being asked across multiple unrelated cases. Either the corpus is missing a policy or the prompt isn’t pointing the agent at the right section. |
| Conflict loop | The Orchestrator’s REVIEW rate climbing over a window. Possible causes: a retrieval threshold drifted, a prompt tweak softened an agent’s commitment, methodology changed. |
| Prompt drift | An eval accuracy drop tied to a specific prompt-set hash. The last edit you made to a system prompt regressed something measurable. |
Each finding shows in System Health with severity (block / warn / info), title, Markdown recommendation, and the evidence Marvin used (specific session ids, log excerpts). Three actions per finding:
Click Run Marvin now to fire an immediate analysis pass. Otherwise Marvin runs nightly on a schedule.
Before you trust the system on a live portfolio, you want to know how it would have performed on cases where the outcome is already known. The Evals tab is for that.
CSV with one row per historical loan: the application fields plus the realised outcome (defaulted / repaid). The system stores it and shows row counts.
Click Run eval on a dataset. The system processes rows in order through the same pipeline a live case uses (Compliance → Underwriter ‖ Quant → Orchestrator). Cursor-based, so re-running picks up where it left off without duplicates.
Confusion matrix (predicted vs realised), accuracy, precision/recall, expected loss per €1k lent. Per-row replay link — click any row to open the full analyst view of that historical case as the system would have decided it today.
An eval is the closest you get to a back-test. It’s also the input that lets Marvin
detect prompt drift — if you edit a system prompt and the same dataset’s accuracy
drops, the next overnight run fires a prompt-drift finding linking the eval
regression to the prompt-set hash that produced it.
Every case opens with a maker (the operator who submitted) and a slot for a checker (a second operator who must approve). This is the standard regulatory-bank control: no single person can move a case from REVIEW to APPROVE/REJECT.
The analyst page shows a Four-eyes Review block under the dossier grid with four states:
The maker can never check their own case — the button is disabled with an
explanation. If you see the wrong checker name on a case, log out and log back in as
the correct user; the maker/checker identity comes from the X-User-Id
header which the admin login flow sets.
The Settings tab in the admin console is where operators tune the system without touching code or redeploying. Five categories of knob:
Caching note: runtime settings are cached in-process for 60 seconds. The
machine that takes the write invalidates immediately, but on a multi-machine deploy any
other backend can serve a stale value for up to 60s after a save. The current single-machine
Fly setup makes this invisible; revisit the TTL or move to LISTEN/NOTIFY /
Redis when adding a second machine.
RETRIEVAL_TOP_K — how many passages each agent gets. Default 5. Higher = more context but slower and more expensive.RETRIEVAL_MIN_SIMILARITY — cosine-similarity floor. Drop and you risk noise; raise and you risk gaps.CLARIFICATION_THRESHOLD — agent confidence below this triggers a pause if the agent has a question. Default 0.6.MAX_CLARIFICATIONS_PER_AGENT — cap on rounds. Default 2.Each of the eight agents has an editable system prompt. Click any agent’s row to expand the textarea, edit, and save. The change is visible to the next debate. The version manifest records the prompt-set hash on every decision, so an auditor can tell you exactly which prompt version produced a given verdict.
Switch between Anthropic Claude variants (Haiku / Sonnet / Opus) or OpenAI models without a redeploy. Each model has its own pricing tier shown in the dropdown.
The LIVE_PIPELINE flag toggles between the v3 factor debate (legacy) and the
v4 committee. New deployments default to v4; v3 stays available for backwards compatibility
with sessions that already started under it.
| Term | Meaning |
|---|---|
| Agent | One of the eight specialised AI workers (Compliance, Underwriter, Quant, Orchestrator, Archivist, Scout, Architect, Marvin). |
| Department | A grouping of agents by purpose: A (live), B (strategy), C (evolution). |
| Dossier | An agent’s structured output for a single case — verdict, reasoning, citations, confidence. |
| Conflict tree | The deterministic TypeScript function that picks the decision branch from the three Department-A dossiers. |
| Branch | One of the eleven leaves of the conflict tree. The branch name is the audit trail. |
| Tier | The authority level of a corpus document. T1 (regulator) through T5 (precedents). Drives retrieval ranking. |
| Allowlist | Per-agent list of bundles the agent is permitted to read. Implements least-privilege RAG. |
| Bundle | A single ingested document (or document section) in the corpus. The unit of retrieval and allowlisting. |
| Clarification | A pause-and-ask round where an agent surfaces a focused question to the operator. |
| Manifest | The version pin saved per decision — prompt hash, retrieval settings, embedder, corpus bundles. Auditors love these. |
| Maker / Checker | Two-person rule. Maker submits; checker (must be different operator) approves or rejects the final call. |
| Finding | One of Marvin’s observations in the System Health tab. Severity-sorted; recommends only. |
The admin console is grouped into four buckets so the eight tools don’t feel equally urgent: Operate (live work), Knowledge (the corpus and what feeds it), Quality (testing and observability), and Settings. The path column below names the group then the leaf.
| I want to… | Go to |
|---|---|
| Submit a case | Cases (left sidebar) → pick a preset row or hit the lime + Submit new case button (top-right of the list) |
| Watch a debate | The analyst page — opens automatically after submit |
| Review live debates | Admin console → Operate → Debates |
| Resolve a clarification request | Admin console → Operate → Clarifications |
| Upload a policy document | Admin console → Knowledge → Corpus → Upload |
| See what Scout found | Admin console → Knowledge → Newsroom |
| Approve a draft policy | Admin console → Knowledge → Policy Lab |
| Run an eval | Admin console → Quality → Evals |
| Check on system health | Admin console → Quality → System Health |
| Edit an agent’s prompt | Admin console → Settings → expand the agent’s row |
| Tune retrieval / clarification thresholds | Admin console → Settings |
v3 was a single-debate model: an Approver agent and a Rejector agent argued over a fixed list of factors, a Judge agent picked sides, and the operator read the transcript. It worked, but it was hard to map onto a real bank’s controls and the audit trail was hard to read.
v4 is the committee model documented in this guide. Briefly:
| v3 | v4 |
|---|---|
| 3 generic agents (Approver / Rejector / Judge) | 8 specialised agents across 3 departments — including Marvin watching the watchers |
| 5-factor debate; one transcript per case | Per-agent dossiers; conflict tree picks the verdict deterministically |
| 3 outcomes (APPROVE / REJECT / REVIEW) | 4 outcomes — REWORK added for missing-paperwork cases that aren’t credit declines |
| Single corpus pool, shared by all agents | Per-agent allowlists; least-privilege RAG; 5-tier authority scheme |
| Manual policy updates only | Strategy loop — Scout pulls news, Architect drafts, operator approves & ingests |
| No system-level monitoring | Marvin watches drift, hallucinations, bottlenecks, prompt regressions |
| Clarifications were one-shot, factor-keyed | Clarifications are agent-keyed, multi-round, attachment-aware, and feed cross-session memory |
v3 sessions remain queryable on the analyst page — the schema_version
column on debate_sessions drives the layout, so historical v3 cases keep
their original visual treatment when you click Replay.
If you’re evaluating both versions side by side: pick the same input case from the
inbox, run it once with LIVE_PIPELINE=v3 and once with
LIVE_PIPELINE=v4, then compare the analyst pages. The v4 page will tell you
which agent saw what, why, and on the basis of which citation. The v3 page will show you
a debate transcript. Both reach a verdict; v4’s is regulator-citeable end to end.