Marvin — How it works

1 · Welcome — what this guide covers

This document is a guided tour of Marvin. Read it top to bottom on your first visit and you’ll know how to submit a case, follow a debate, answer the system’s questions, read a verdict, and verify the system on your own historical data. Come back to specific sections when you need them.

We’ll keep this practical. Every concept gets a concrete example, every screen has a pointer to where you find it, and every technical detail is explained in plain language first.

Three things to know up front Marvin doesn’t replace human judgement — it produces an explainable recommendation and routes the genuinely uncertain cases to a human. Every approval comes with a citation that points back to the exact policy clause it relied on. And the system watches itself: Marvin (the agent the product is named after) flags drift, hallucinations, and bottlenecks before they touch real applicants.

2 · What Marvin is, in one minute

Marvin is a credit-decision platform built like a real bank credit committee — except every member is a specialised AI agent, every disagreement is resolved by a piece of code that an auditor can read line by line, and every call leaves a paper trail of citations back to source policy.

The shape of the system

The eight agents are split across three departments:

Department A · Live execution

The five front-office agents that work every credit case in real time. Compliance, Underwriter, Quant, Orchestrator, Archivist. They run synchronously per case and finish in about a minute.

Department B · Strategy & development

Two background agents that keep the bank’s policy library current. Scout reads the open web for new regulations; Architect drafts policy revisions for the operator to approve.

Department C · Evolution · the god view

One agent — Marvin — sits above the other seven and watches their conversations, traces, citations, and clarification logs. He never talks to applicants and never writes policy. He surfaces drift before it becomes a bad decision. The product is named after him because this watcher-of-watchers role is the thing that makes the rest of the system safe to run unattended.

Why a committee, and not one big model?

A real credit committee works because each member has narrow expertise and clear authority. A compliance officer can hard-stop a deal a salesperson loves. A quant can produce numbers without caring whether the deal goes through. The chair listens to all three and follows a written hierarchy when they disagree. We ported that structure into software because it’s the structure auditors and regulators already understand.

A monolithic LLM, no matter how good, has to be one entity holding contradictory roles in its head. Marvin’s agents are specialised by design — the Compliance Guardian can’t accidentally soften a sanctions block to chase a sale, because the Compliance Guardian’s prompt and document allowlist physically don’t reach the policy clauses that an Underwriter would use to argue.

3 · Meet the eight agents

Each agent has a written persona, a fixed allowlist of documents it can read, and a narrow output contract. You’ll see them on the live committee view (Department A) and in the admin tabs (Departments B and C). Click any agent in the app to see its current system prompt under Settings → the operator can edit a prompt without a redeploy and the change is visible to the next debate.

Department A · the live committee 5 agents

📚 Internal Archivist

RAG provider · the librarian

“You are the keeper of the keys. You don’t form opinions. You hand the right document to the agent that asked.”

Reads: the entire corpus on behalf of others. Speaks: never — the Archivist is invisible to the operator. Behind the scenes, every other Department-A agent calls the Archivist to retrieve passages from its own slice of the corpus.

🛡 Compliance Guardian

AML / KYC / sanctions / PEP · hard-stop authority

“Paranoid prosecutor. Cites article, not opinion. When I say block, the deal stops — no override.”

Reads: the AML/KYC instruction. Verdicts: pass, review-required, hard-reject. A hard-reject ends the pipeline immediately; the other agents don’t even run.

🤝 Underwriter

Deal structurer · finds compensating factors

“Sales-leaning optimist. I argue the applicant’s case to the chair. I don’t fear bad numbers; I look for legal ways to work around them.”

Reads: credit policy, risk-appetite exclusion list, application-package requirements, self-employed annexes. Output: a recommendation (APPROVE / REJECT / REVIEW), a list of compensating factors, and an application-completeness check.

🧮 Data Quant

Numbers · DSTI / DSCR / credit score / default probability

“Cold mathematician. I see the world in ratios. Sentimentality is noise. I produce numbers and flag whether they fit the methodology.”

Reads: scoring methodology, refinancing/consolidation rules, self-employed annex § 3 (DSCR formula). Verdicts: within-policy, breach-soft, breach-hard.

⚖ Orchestrator

Chair · applies the conflict tree, writes the verdict

“Wise judge. I don’t compute formulas or read identity documents. I weigh the three dossiers and explain the call in language a regulator can read.”

Reads: the credit-committee precedents log + strategic-bonus / compensating-factor sections of the relevant annexes. Output: the final decision narrative.

Department B · strategy & development 2 agents

📡 External Scout

Web + RSS monitor

“Hyperactive digital reporter. I rove the open web for changes the bank should care about — new regulations, EBA guidelines, macro shifts. I don’t analyse deeply; I bring back the link.”

Reads: nothing internal — the open web only. Every finding lands in the Newsroom tab with a relevance score, and you can dismiss, flag, or hand it to the Architect.

🏗 Policy Architect

Drafts internal-policy revisions

“Proactive bureaucrat-visionary. I love matrices. Give me a Scout finding or an orchestrator gap, and I’ll draft the policy update for you to approve. Clean Markdown, ready to ingest.”

Reads: existing internal policies (for reference) and Scout’s findings. Output: a draft in the Policy Lab tab with title, body, target tier, and proposed bundle label. The Architect can never write Tier 1 (regulator) drafts — that’s reserved for the regulator’s actual text.

Department C · evolution 1 agent

🤖 Marvin (Meta-Architect)

AI auditor · the system watching itself

“I don’t care about banking. I care about the agent layer. I read the other seven’s logs, traces, citation lookups, eval runs. When something drifts, I write you a recommendation.”

Reads: system logs only — agent_outputs, decision_version_manifest, eval_runs, clarification_events. Output: findings in the System Health tab, sorted by severity. Marvin recommends — he never deploys anything himself.

Personas are editable Each agent’s system prompt is a row in runtime_settings and you can edit it from the admin Settings tab. The default personas above are good for most banks; if your house style differs, change the prompt and the next debate uses the new wording.

4 · A walkthrough — one case from submit to verdict

Let’s follow a single application end to end. Imagine Maria, a self-employed consultant, applying for a €30,000 loan to consolidate two existing consumer loans into a single facility. Her credit score is 620 (borderline), her income is €38,000/year, and she has one missed payment from 18 months ago.

Step 1 — Submit a case

Click Cases in the left sidebar to open the inbox. Pick a preset that looks similar to Maria’s profile (or click + New case to start from scratch), fill the borrower fields, and click Run decision at the top right.

You’re sent to the analyst page where the live committee starts streaming dossiers. The URL is /analyst/decision/<sessionId> — bookmark it; that’s your audit-trail link for this case.

Step 2 — Compliance runs first, alone

The Compliance Guardian opens the case. He runs sanctions and PEP screening, retrieves relevant AML clauses, and produces a verdict. Three outcomes are possible:

hard-reject — e.g. a clean sanctions match. The pipeline stops here. Underwriter and Quant don’t even run. The deal is dead, the audit trail says exactly which sanctions list and which entry, and the case lands as REJECT.
review-required — e.g. a weak-confidence fuzzy match (0.72) on the EU consolidated list. The pipeline continues, but the case will end up needing a human reviewer.
pass — clean across the board; the Underwriter and Quant get to work.

For Maria the screening comes back clean. Compliance returns pass.

Step 3 — Underwriter and Quant work in parallel

Now two agents run side by side to keep wall-clock time under a minute. The Underwriter reads policy passages, looks at Maria’s package, and produces a dossier with a recommendation, a list of compensating factors, and a check on the application package’s completeness. Compensating factors live in a fixed enumeration — additional_collateral, co_signer, income_history, liquidity_buffer, tenure, low_dti, down_payment, relationship_value, regulatory_exception — so the Orchestrator can branch on them later without surprises.

The Quant computes Maria’s DSTI (debt-service-to-income), and because she’s self-employed, also her DSCR (debt-service coverage ratio). It pulls credit-bureau and social-security data through stubbed clients (real APIs in production), checks her score bucket, computes default probability, and emits a verdict: within-policy, breach-soft (within 10% of a threshold), or breach-hard.

For Maria, the Quant reports a breach-soft on DSTI — she’s 4 points over the 35% ceiling for self-employed applicants under 680 score — but the Underwriter has flagged a strong compensating factor: her consolidation actually reduces her monthly burden by 18%.

Step 4 — The conflict tree picks a branch

The three dossiers go into a deterministic TypeScript function called the conflict tree. It walks down a five-level hierarchy and returns one of eleven branches (we’ll cover the full tree in §7). For Maria, the branch is quant-breach-soft-with-comp: a soft Quant breach overcome by a strong compensating factor. The decision is APPROVE with conditions.

Step 5 — The Orchestrator writes the verdict

The Orchestrator runs last. Its job is not to make the decision — the conflict tree already did that. Its job is to write the regulator-explainable narrative that names each agent’s contribution, references the precedent log, and states the conditions of approval. For Maria the verdict ends with something like:

Final decision · APPROVE with conditions Compliance returned a clean pass on sanctions, PEP, and prohibited-sector checks. The Quant flagged a soft DSTI breach (39%, ceiling 35% for self-employed below score 680), but the Underwriter cited a strong compensating factor: post-consolidation debt service drops from 41% to 33% of declared income, satisfying the mitigation logic in Annex 03-C § 3. Approval is conditional on the new loan being disbursed directly to the two existing creditors, per the consolidation rule in Annex 03-C § 4.

What the operator sees

The analyst page is split into three columns:

Page header — carries the case identity as a one-line subhead beneath the “Live committee” title (e.g. €25,000 · Self-employed · 720 score · Business expansion) plus a colour-coded status chip so the operator can tell which case they’re looking at, and how it ended, at a glance: lime-pulse while running, green for an actual decision (Approved / Rejected), amber when human action is required (Needs review / Needs rework), red on Failed, neutral on Cancelled.
Live committee timeline on the left — a vertical line ticking off the five pipeline stages as they happen.
Dossier grid in the middle — one cell per Department-A agent. Cells start as compressed one-line strips (e.g. Underwriter · queued · waiting for Compliance); the cell whose agent is currently working expands into a full-height “thinking…” skeleton with the inline cancel chip; once an agent reports, its cell becomes the dossier card itself with the agent’s avatar, verdict pill, citation chips, and duration figure. Reserving full height only for active and finished agents keeps the page short and the eye drawn to the live action.
Conflict tree panel appears beneath the dossier grid as soon as the first agent starts — three input nodes on the left (Compliance, Underwriter, Quant), the Orchestrator in the centre, and the verdict on the right. While the Orchestrator is mid-call the centre node pulses with a “Building conflict tree…” indicator; once the branch lands, the connector path from the inputs that drove it lights up in solid colour, the alternative paths fade to dashed grey, and the verdict node lands with the decision tone. A one-line resolution caption underneath names which inputs drove the branch in plain English. The panel is the cinematic counterpart to the FinalDecisionCard’s prose — the picture of the disagreement, not the writeup.
Final decision card appears below the conflict tree panel once the Orchestrator finishes — verdict, branch label, conditions, full reasoning.

The whole thing typically completes in 30 to 60 seconds per case.

A small cancel chip appears next to the “thinking…” indicator on whichever agent is currently running — co-located with the activity, not buried in a page top bar. Clicking it stops the orchestrator at the next safe checkpoint; the partial dossiers stay attached to the session for audit. The chip moves automatically as the pipeline advances and disappears once the verdict lands.

[Submit case] ↓ Compliance Guardian ← runs first; sanctions+PEP+RAG ↓ hard-reject? ─────────→ short-circuit: REJECT, end ↓ no ┌────┴────┐ ↓ ↓ Underwriter Data Quant ← parallel │ │ └────┬────┘ ↓ Conflict tree ← deterministic TypeScript ↓ picks 1 of 11 branches Orchestrator agent ← LLM writes the verdict ↓ Final decision ← APPROVE / REJECT / REVIEW / REWORK

5 · The four outcomes — what each one means

Every case ends on one of four outcomes. Each maps to a different operator workflow.

Outcome	What it means	Operator action
APPROVE	The committee agreed to lend, with or without conditions. Conditions are listed verbatim on the verdict card.	Forward to disbursal. If conditions are listed, ensure they’re met before funds release.
REJECT	The committee refused. The verdict names the rule that fired — sanctions hit, policy exclusion, hard methodology breach.	Send the rejection notice. The reason is regulator-citeable.
REVIEW	The committee couldn’t decide. Signals are mixed or compliance asked for human eyes (e.g. PEP exposure).	Open the case, read the dossiers, and decide. Use the Approve / Reject buttons in the four-eyes panel.
REWORK	The case is fundamentally fine — compliance OK, policy OK, risk within methodology — but the application package is incomplete. Missing forms, missing signatures, wrong template.	Not a credit decline. The verdict carries a checklist of missing items. Send the list back to the branch or the customer to complete the package, then resubmit.

Why REWORK is its own outcome Before v4 the only options were APPROVE, REJECT, or REVIEW. A case with missing paperwork would land as REVIEW, and a human reviewer would have to figure out that the problem wasn’t a credit question — it was a forms question. REWORK separates the “deal is alive but documents are missing” case so the operator workflow is obvious: chase the paperwork, don’t reopen the credit decision.

When the committee can’t run at all

Sometimes the committee never delivers a verdict — the LLM provider is down, an API key is invalid, the network drops mid-debate. In that case the verdict card is replaced by a clear “Committee unavailable” error block and the session is marked FAILED, not COMPLETED. The four-eyes review form is suppressed (there is no automated decision to sign off on) and the case can be re-run via the Re-run debate button once the underlying issue is fixed. No work is lost — the application package, the corpus reads, and any partial dossiers stay attached to the original session for audit.

6 · When the system asks you a question

Sometimes an agent isn’t confident enough to commit. When confidence drops below the threshold (default 0.6) and the agent has a single specific question that an operator could realistically answer in a sentence, the agent pauses and asks. This is by design — you want the system to flag genuine uncertainty rather than fabricate confidence.

What this looks like

The dossier card on the analyst page replaces its body with a question banner: “✋ Agent paused for clarification — round 1 of 2.” Below that, the agent’s question (often multi-part, e.g. “(1) How long has the business been registered? (2) What’s the appraised collateral value? (3) Were the missed payments recent or historical?”) and a small text-area for your answer.

What you can do

Type an answer and hit Submit. The orchestrator hands your reply to the agent and re-prompts. The agent often comes back with much higher confidence and a verdict.
Attach a file (optional) — PDF or image up to 10 MB. The system extracts text from PDFs and inlines it under your typed answer. Useful for “here’s the tax return that proves the income claim.”
Skip — if you genuinely don’t know. The agent gets a “no answer available” note and is asked to commit on what it has. Your skip is recorded in the audit trail.

Each agent gets up to two clarification rounds per case. If the agent is still uncertain after round 2, the loop caps and the case proceeds to the conflict tree with whatever best-effort dossier the agent produced — usually a REVIEW recommendation, which routes to a human anyway.

Cross-session memory

A Q&A you provide on one case becomes standing guidance for similar future cases. The next time an agent is about to ask the same question, it gets your earlier answer injected into its prompt as institutional memory. You can see this on the analyst page as a “📚 Prior operator guidance” row above the agent’s argument — expand it to read what context the agent was given.

Treat clarifications as feature, not friction An agent asking a focused question is the system telling you “the corpus doesn’t cover this case.” That’s a signal to either (a) answer once and move on, or (b) have the Architect draft a policy update so future cases of this shape don’t need asking. Marvin (the agent) tracks repeated clarifications across sessions and surfaces them as “knowledge gap” findings in the System Health tab.

7 · How decisions are made — the conflict tree

When the three live agents disagree — Compliance says pass, Quant says breach-hard, Underwriter says approve — somebody has to decide who wins. In a real bank committee, that’s the chair’s job, and the chair follows a written hierarchy. We wrote the hierarchy in TypeScript. It’s a function, it’s reproducible, and it has unit tests. The LLM only writes the prose that explains which branch fired.

The five levels

Level	Owner	Question it answers
L1 · Absolute Veto	Compliance Guardian	Is the deal legal? Sanctions, AML, statutory limits.
L2 · Corporate Veto	Underwriter	Does the deal fit bank policy / risk appetite?
L3 · Quantitative Risk	Data Quant	Do the numbers fit the methodology?
L4 · Operational & Process	Underwriter	Is the application package complete?
L5 · Heuristic / STP	Orchestrator	Anything else? Straight-through approval or fallthrough review.

Higher levels never give way to lower ones. A Compliance hard-reject ends the conversation regardless of how attractive the numbers look. A bank-policy decline can only be overturned by an explicit regulatory_exception factor (which routes the case to senior management, not to auto-approval).

The eleven branches

The function picks one of these branches and returns its name. The branch name is the audit trail.

L	Branch	Decision	What it means
L1	`compliance-hard-reject`	REJECT	Sanctions / AML / forbidden sector. Hard stop. No override.
L1	`compliance-review-required`	REVIEW	PEP exposure or weak fuzzy match. Human compliance officer signs.
L2	`policy-decline`	REJECT	Underwriter rejects on bank-policy grounds; numbers are fine, the block is policy-driven.
L2	`policy-escalation`	REVIEW	Same as policy-decline but Underwriter cited a regulatory_exception — routes to senior sign-off.
L3	`quant-breach-hard-no-comp`	REJECT	Hard methodology breach with no strong compensating factor.
L3	`quant-breach-hard-with-comp`	REVIEW	Hard breach but Underwriter cited a strong factor — non-trivial, defer to a human.
L3	`quant-breach-soft-with-comp`	APPROVE	Soft breach + strong factor — approve with conditions.
L3	`quant-breach-soft-no-comp`	REVIEW	Borderline numbers, no clear save — defer.
L4	`documents-rework`	REWORK	Case otherwise clean, package incomplete — return for paperwork.
L5	`all-clear`	APPROVE	Everything passes — straight-through approval.
L5	`fallthrough-review`	REVIEW	Signals disagree in a way the explicit branches don’t cover — defer.

Why deterministic code, not an LLM?

Auditors will ask: “why did the system approve this loan?” The answer must not depend on which model version was running on the day. By implementing the tree in TypeScript, we get reproducibility (same inputs always produce the same branch), testability (every branch has a unit test), and a function a regulator can read line by line before any LLM is involved. The LLM still runs — the Orchestrator agent — but only to write the prose explanation of the branch the tree picked.

8 · The corpus — how agents read your documents

The agents’ opinions are only as good as the documents they read. The corpus is the bank’s knowledge layer: every internal policy, every regulatory text, every annex, every precedent log. The Corpus tab in the admin console is where you upload, review, and remove bundles.

Uploading a document

Open Admin console → Corpus. The upload form has four sections, each answering a question about the document:

File
The PDF itself. Drag in or browse.
Strength — conflict tier
How authoritative is this content. Five tiers, T1 highest. Drives retrieval ranking.
Routing — target agent
Which agent gets to read this. Marvin’s least-privilege RAG means a document is invisible to every other agent. Global fans the bundle out to all eight agents (rare; reserved for shared glossaries).
Meta
Source ID (the citation label, e.g. EBA-GL-2020-06), jurisdiction (EU, BG, Internal/Bank, etc.), volatility (static / evolving / dynamic) and version. All four propagate into every chunk produced from the document.

The five tiers

Tier	Name	Examples	Conflict-tree role
T1	Absolute Veto	EBA guidelines, BNB ordinances, statute law	Hard stop. No override. Authored only by the regulator.
T2	Corporate Veto	Internal credit policy, risk appetite, sector exclusions	Vetoes anything below it. Authored by the bank.
T3	Quantitative Risk	DSTI/DSCR formulas, scoring model, threshold tables	Numeric inputs. Override-able only by T1/T2 with documented exception.
T4	Operational & Execution	Application form layout, document checklist	How-to material. Conflicts here are clerical, not policy.
T5	Heuristics & Precedents	Precedents log, training material, analyst rules-of-thumb	Lowest authority. Useful as soft signal; never decisive on its own.

When the embedder is unreachable. RAG is an enhancement, not a hard dependency. If the embedding service times out (Voyage / OpenAI flake, network blip, bad outbound DNS), retrieval returns an empty passage set with an explicit embedder_unavailable gap. The agent runs without grounding, declares known: false in its dossier, and the conflict tree typically lands the case in REVIEW with the missing-grounding signal in the audit trail. The committee still produces a real verdict; it just stays cautious. No silent degradation, no fallback-engine placeholder.

Retriever weights descend with tier (1.00 → 0.92 → 0.80 → 0.70 → 0.60). A T1 chunk and a T5 chunk with identical similarity rank T1 first; a T5 chunk has to clear a meaningfully higher bar to compete.

Previewing what’s actually inside a bundle

Every bundle row carries a Preview button next to Delete. Clicking it expands a strip showing the first few chunks of text the agents will see — the breadcrumb / section header for each chunk, a 600-character excerpt, the tier badge, and the token count. The goal is fast verification: you can scroll the corpus list and confirm “yes, the AlfaBank consumer-loan policy actually has a chapter on DSTI in here” without trusting the chunk count alone or opening the source PDF in a separate tab. Preview is read-only; chunk text comes straight from knowledge_chunks.

Per-agent allowlists

Each agent reads only what its allowlist explicitly grants. The admin’s Corpus tab shows you, on every bundle row, which agents currently have access via the “Read by” column — small chips with each agent’s avatar.

Agent	Reads
Archivist	Everything (privileged)
Underwriter	Credit policy, risk-appetite list, application-package requirements, self-employed annexes (income verification + compensating factors)
Data Quant	Scoring methodology, refinancing/consolidation rules (DTI sections), self-employed annex (DSCR formula)
Compliance Guardian	AML/KYC instruction only
Orchestrator	Precedents log, refinancing strategic-bonus sections, self-employed annex compensating-factor sections
Scout	None internal — web only
Architect	Old versions of internal docs (so it knows what to revise) plus regulatory reading material
Marvin	None — reads system logs, never policies

Why the allowlist matters A model that has read everything is one that can hallucinate from anywhere. Restricting Compliance to AML/KYC means a misbehaving Compliance prompt physically can’t cite the Underwriter’s policy clauses to soften a sanctions hit — the chunks aren’t in its retrieval pool. Least-privilege isn’t just security; it’s a hallucination control.

Citation chips

Every claim an agent makes carries a chunkId citation. On the analyst page, these render as small purple chips next to each argument. Click one to open the source passage in a side dialog, with full text, breadcrumb, jurisdiction, and tier. If a citation chip ever fails to resolve (the chunk vanished from the corpus), Marvin’s next overnight run flags it as a hallucination finding.

9 · How the policy library grows — Strategy loop

A regulator changes a guideline; a new sector appears on a watchlist; an academic paper shifts the consensus on a methodology. Most credit-AI products freeze at deployment and drift quietly. Marvin has Department B watching the open world so the corpus stays current.

Newsroom — the Scout’s feed

Open Admin console → Newsroom. The Scout polls regulatory feeds (BNB, EBA, Държавен вестник, ECB, BIS) on a schedule and runs targeted web searches for textbooks and working papers. Each finding lands here with a relevance score, category pill, and source link. You’ll see four states:

New — Scout just pulled it; needs operator triage.
Reviewed — you’ve looked but not acted.
Ingested — the Architect drafted a policy from it and an operator approved.
Dismissed — not relevant.

Click a finding to open the full body. From here you can Draft policy — which sends the finding to the Architect — or Dismiss if it’s noise. Click Run Scout now to pull a fresh batch on demand.

When the Architect has already produced a draft from a finding, the row shows a small → View draft in Policy Lab link with the draft’s current status (proposed / approved / rejected). Clicking it jumps to Policy Lab, expands the matching draft card, and tints it briefly so you can see what you were brought to. The two tabs stay separate — different agents, different actions — but the loop between them is now legible.

Policy Lab — the Architect’s drafts

Open Admin console → Policy Lab. The Architect produces drafts in response to Scout findings (or to gaps the Orchestrator flagged). Each draft is an expandable card with title, target tier, proposed bundle label, and a Markdown body you can edit inline. Three actions on each draft:

Save — preserves your edits without ingesting.
Approve & ingest — runs the draft through the corpus pipeline. The new bundle lands in the Corpus tab, gets allowlisted to the right agents, and the next debate sees it. The source Newsroom row flips to ingested.
Reject — drops the draft. The Architect won’t propose the same one again.

Each draft card carries a ← From news: “…” back-link to the originating Newsroom row, joined from policy_drafts.source_news_id. Click it to jump back to the Scout’s finding so you can see what triggered the draft before you sign off on it. Drafts produced from an internal knowledge gap (rather than external news) read “orchestrator gap” instead.

The Architect can’t author Tier 1 Tier 1 is reserved for the regulator’s own text. The Architect’s allowed range is T2 (corporate veto) through T5 (heuristics) — both the prompt and the database CHECK constraint enforce this. If you want to ingest a regulator document, you upload it directly via the Corpus tab as T1.

End to end — click Run Scout now, three findings appear; click Draft policy on one, a new draft lands in Policy Lab; edit the Markdown, click Approve & ingest, the new bundle appears in Corpus. Total wall-clock: 30–60 seconds.

10 · Marvin’s god view — how the system catches its own drift

The eighth agent is the one the product is named after. Marvin watches the other seven — their conversations, their dossiers, their citations, their clarification logs, their evals — and surfaces problems before they touch live applicants. The System Health tab is his dashboard.

Five signal categories

Category	What Marvin watches for
Bottleneck	An agent’s p95 latency drifting above the cross-agent median. Usually means a retrieval timeout or a misconfigured allowlist.
Hallucination	Citation `chunkId`s that don’t resolve to a real chunk in `knowledge_chunks`. The agent invented the source.
Knowledge gap	The same clarification question being asked across multiple unrelated cases. Either the corpus is missing a policy or the prompt isn’t pointing the agent at the right section.
Conflict loop	The Orchestrator’s REVIEW rate climbing over a window. Possible causes: a retrieval threshold drifted, a prompt tweak softened an agent’s commitment, methodology changed.
Prompt drift	An eval accuracy drop tied to a specific prompt-set hash. The last edit you made to a system prompt regressed something measurable.

What you do with a finding

Each finding shows in System Health with severity (block / warn / info), title, Markdown recommendation, and the evidence Marvin used (specific session ids, log excerpts). Three actions per finding:

Acknowledge — you’ve seen it and want to defer. Useful when the cause is known but the fix is in flight.
Actioned — you’ve fixed the underlying cause. The finding moves to history.
Dismiss — false positive. Marvin won’t fire on the same fingerprint within the dedupe window.

Click Run Marvin now to fire an immediate analysis pass. Otherwise Marvin runs nightly on a schedule.

Marvin recommends — never deploys In v4 Marvin produces findings only. He doesn’t edit prompts, change allowlists, or spin up new agents. The operator approves every change. Autonomous remediation is on the roadmap (v4.5), but it’s deliberately gated until the recommendation flow has earned enough trust to extend.

11 · Evals — testing on historical data

Before you trust the system on a live portfolio, you want to know how it would have performed on cases where the outcome is already known. The Evals tab is for that.

The flow

Upload a dataset
CSV with one row per historical loan: the application fields plus the realised outcome (defaulted / repaid). The system stores it and shows row counts.
Start a run
Click Run eval on a dataset. The system processes rows in order through the same pipeline a live case uses (Compliance → Underwriter ‖ Quant → Orchestrator). Cursor-based, so re-running picks up where it left off without duplicates.
Read the results
Confusion matrix (predicted vs realised), accuracy, precision/recall, expected loss per €1k lent. Per-row replay link — click any row to open the full analyst view of that historical case as the system would have decided it today.

Why this matters for trust

An eval is the closest you get to a back-test. It’s also the input that lets Marvin detect prompt drift — if you edit a system prompt and the same dataset’s accuracy drops, the next overnight run fires a prompt-drift finding linking the eval regression to the prompt-set hash that produced it.

Tip · keep one dataset frozen Designate one historical dataset as your “golden” benchmark and never edit it. Run evals against it after every prompt or settings change. Drift on the golden set is a hard signal that something regressed.

12 · Four-eyes review

Every case opens with a maker (the operator who submitted) and a slot for a checker (a second operator who must approve). This is the standard regulatory-bank control: no single person can move a case from REVIEW to APPROVE/REJECT.

What you see

The analyst page shows a Four-eyes Review block under the dossier grid with four states:

Pending — debate still running; controls greyed.
Open for review — debate finished; the maker is named, the checker slot is empty. Any operator who isn’t the maker can step in as checker.
Approved — checker signed. Decision is final. Maker, checker, timestamps, and notes are all in the audit trail.
Rejected — checker overrode. Notes are required.

The maker can never check their own case — the button is disabled with an explanation. If you see the wrong checker name on a case, log out and log back in as the correct user; the maker/checker identity comes from the X-User-Id header which the admin login flow sets.

13 · Tuning the system

The Settings tab in the admin console is where operators tune the system without touching code or redeploying. Five categories of knob:

Caching note: runtime settings are cached in-process for 60 seconds. The machine that takes the write invalidates immediately, but on a multi-machine deploy any other backend can serve a stale value for up to 60s after a save. The current single-machine Fly setup makes this invisible; revisit the TTL or move to LISTEN/NOTIFY / Redis when adding a second machine.

Retrieval

RETRIEVAL_TOP_K — how many passages each agent gets. Default 5. Higher = more context but slower and more expensive.
RETRIEVAL_MIN_SIMILARITY — cosine-similarity floor. Drop and you risk noise; raise and you risk gaps.

Clarification flow

CLARIFICATION_THRESHOLD — agent confidence below this triggers a pause if the agent has a question. Default 0.6.
MAX_CLARIFICATIONS_PER_AGENT — cap on rounds. Default 2.

System prompts

Each of the eight agents has an editable system prompt. Click any agent’s row to expand the textarea, edit, and save. The change is visible to the next debate. The version manifest records the prompt-set hash on every decision, so an auditor can tell you exactly which prompt version produced a given verdict.

LLM provider & model

Switch between Anthropic Claude variants (Haiku / Sonnet / Opus) or OpenAI models without a redeploy. Each model has its own pricing tier shown in the dropdown.

Live pipeline

The LIVE_PIPELINE flag toggles between the v3 factor debate (legacy) and the v4 committee. New deployments default to v4; v3 stays available for backwards compatibility with sessions that already started under it.

Operator caution Edits to system prompts and retrieval thresholds change agent behaviour immediately. After any change, run your golden eval (§11) to confirm nothing regressed. Marvin will eventually flag a regression on its own — but a manual eval is faster and removes the worry.

14 · Quick reference

Glossary

Term	Meaning
Agent	One of the eight specialised AI workers (Compliance, Underwriter, Quant, Orchestrator, Archivist, Scout, Architect, Marvin).
Department	A grouping of agents by purpose: A (live), B (strategy), C (evolution).
Dossier	An agent’s structured output for a single case — verdict, reasoning, citations, confidence.
Conflict tree	The deterministic TypeScript function that picks the decision branch from the three Department-A dossiers.
Branch	One of the eleven leaves of the conflict tree. The branch name is the audit trail.
Tier	The authority level of a corpus document. T1 (regulator) through T5 (precedents). Drives retrieval ranking.
Allowlist	Per-agent list of bundles the agent is permitted to read. Implements least-privilege RAG.
Bundle	A single ingested document (or document section) in the corpus. The unit of retrieval and allowlisting.
Clarification	A pause-and-ask round where an agent surfaces a focused question to the operator.
Manifest	The version pin saved per decision — prompt hash, retrieval settings, embedder, corpus bundles. Auditors love these.
Maker / Checker	Two-person rule. Maker submits; checker (must be different operator) approves or rejects the final call.
Finding	One of Marvin’s observations in the System Health tab. Severity-sorted; recommends only.

Where to find things

The admin console is grouped into four buckets so the eight tools don’t feel equally urgent: Operate (live work), Knowledge (the corpus and what feeds it), Quality (testing and observability), and Settings. The path column below names the group then the leaf.

I want to…	Go to
Submit a case	Cases (left sidebar) → pick a preset row or hit the lime + Submit new case button (top-right of the list)
Watch a debate	The analyst page — opens automatically after submit
Review live debates	Admin console → Operate → Debates
Resolve a clarification request	Admin console → Operate → Clarifications
Upload a policy document	Admin console → Knowledge → Corpus → Upload
See what Scout found	Admin console → Knowledge → Newsroom
Approve a draft policy	Admin console → Knowledge → Policy Lab
Run an eval	Admin console → Quality → Evals
Check on system health	Admin console → Quality → System Health
Edit an agent’s prompt	Admin console → Settings → expand the agent’s row
Tune retrieval / clarification thresholds	Admin console → Settings

15 · What changed from v3

v3 was a single-debate model: an Approver agent and a Rejector agent argued over a fixed list of factors, a Judge agent picked sides, and the operator read the transcript. It worked, but it was hard to map onto a real bank’s controls and the audit trail was hard to read.

v4 is the committee model documented in this guide. Briefly:

v3	v4
3 generic agents (Approver / Rejector / Judge)	8 specialised agents across 3 departments — including Marvin watching the watchers
5-factor debate; one transcript per case	Per-agent dossiers; conflict tree picks the verdict deterministically
3 outcomes (APPROVE / REJECT / REVIEW)	4 outcomes — REWORK added for missing-paperwork cases that aren’t credit declines
Single corpus pool, shared by all agents	Per-agent allowlists; least-privilege RAG; 5-tier authority scheme
Manual policy updates only	Strategy loop — Scout pulls news, Architect drafts, operator approves & ingests
No system-level monitoring	Marvin watches drift, hallucinations, bottlenecks, prompt regressions
Clarifications were one-shot, factor-keyed	Clarifications are agent-keyed, multi-round, attachment-aware, and feed cross-session memory

v3 sessions remain queryable on the analyst page — the schema_version column on debate_sessions drives the layout, so historical v3 cases keep their original visual treatment when you click Replay.

If you’re evaluating both versions side by side: pick the same input case from the inbox, run it once with LIVE_PIPELINE=v3 and once with LIVE_PIPELINE=v4, then compare the analyst pages. The v4 page will tell you which agent saw what, why, and on the basis of which citation. The v3 page will show you a debate transcript. Both reach a verdict; v4’s is regulator-citeable end to end.